Modeling Unstructured Document Using N-gram Consecutive and WordNet Dictionary
The main issue in Text Document Clustering (TDC) is document similarity. In order to measure the similarity, documents must be transformed into numerical values. Vector Space Model (VSM) is one of technique capable to convert document into numerical value. In VSM documents was represented by the frequencies of term inside document and it works like a Bag of Word (BOW). BOW has resulted two major problems since it ignores the term relationship by treating term as single and independent. Both problems stated as Polysemy and Synonymity concept which is reflected to the relationship of terms. This study was combined WordNet and N-gram to overcome both problems. By modifying document features from single term into Polysemy and Synonymity concept, it has improved VSM performance. There are four steps in experimental. Text documents selection, preprocessing, applying clustering and cluster evaluation using F-measure. With dataset reuters50_50 obtained from UCI repository the experiment was successful and the result promising.
Keywords: TDC, TD, VSM, Polysemy, Synonymity, WordNet, N-gram, K-Means Synset Syngram, Cosine Similarity and F-Measure.
Download Full-Text
ABOUT THE AUTHOR
Abdul Halim Omar
A.H Omar is a research assistant in Faculty of Computer Science in Tun Hussein Onn Malaysia University. He works on data mining area which is specializing in clustering. He received his Bachelor Degree of Information Technology (Computer Networking) from Tun Hussein Onn Malaysia University. In early 2007 he used to be a programmer in a Software House at Kuala Lumpur Malaysia. He currently studies in Master Degree of Information Technology majoring in Text Document Clustering.
Abdul Halim Omar
A.H Omar is a research assistant in Faculty of Computer Science in Tun Hussein Onn Malaysia University. He works on data mining area which is specializing in clustering. He received his Bachelor Degree of Information Technology (Computer Networking) from Tun Hussein Onn Malaysia University. In early 2007 he used to be a programmer in a Software House at Kuala Lumpur Malaysia. He currently studies in Master Degree of Information Technology majoring in Text Document Clustering.