Thursday 25th of April 2024
 

Modeling Unstructured Document Using N-gram Consecutive and WordNet Dictionary


Abdul Halim Omar

The main issue in Text Document Clustering (TDC) is document similarity. In order to measure the similarity, documents must be transformed into numerical values. Vector Space Model (VSM) is one of technique capable to convert document into numerical value. In VSM documents was represented by the frequencies of term inside document and it works like a Bag of Word (BOW). BOW has resulted two major problems since it ignores the term relationship by treating term as single and independent. Both problems stated as Polysemy and Synonymity concept which is reflected to the relationship of terms. This study was combined WordNet and N-gram to overcome both problems. By modifying document features from single term into Polysemy and Synonymity concept, it has improved VSM performance. There are four steps in experimental. Text documents selection, preprocessing, applying clustering and cluster evaluation using F-measure. With dataset reuters50_50 obtained from UCI repository the experiment was successful and the result promising.

Keywords: TDC, TD, VSM, Polysemy, Synonymity, WordNet, N-gram, K-Means Synset Syngram, Cosine Similarity and F-Measure.

Download Full-Text


ABOUT THE AUTHOR

Abdul Halim Omar
A.H Omar is a research assistant in Faculty of Computer Science in Tun Hussein Onn Malaysia University. He works on data mining area which is specializing in clustering. He received his Bachelor Degree of Information Technology (Computer Networking) from Tun Hussein Onn Malaysia University. In early 2007 he used to be a programmer in a Software House at Kuala Lumpur Malaysia. He currently studies in Master Degree of Information Technology majoring in Text Document Clustering.


IJCSI Published Papers Indexed By:

 

 

 

 
+++
About IJCSI

IJCSI is a refereed open access international journal for scientific papers dealing in all areas of computer science research...

Learn more »
Join Us
FAQs

Read the most frequently asked questions about IJCSI.

Frequently Asked Questions (FAQs) »
Get in touch

Phone: +230 911 5482
Email: info@ijcsi.org

More contact details »