International Journal of Computer Science Issues

Analysis of Stemming Algorithm for Text Clustering

N.Sandhya, Y.Srilalitha, V.Sowmya, Dr.K.Anuradha and Dr.A.Govardhan

Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In Bag of words representation of documents the words that appear in documents often have many morphological variants and in most cases, morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of clustering applications. For this reason, a number of stemming Algorithms, or stemmers, have been developed, which attempt to reduce a word to its stem or root form. Thus, the key terms of a document are represented by stems rather than by the original words. In this work we have studied the impact of stemming algorithm along with four popular similarity measures (Euclidean, cosine, Pearson correlation and extended Jaccard) in conjunction with different types of vector representation (boolean, term frequency and term frequency and inverse document frequency) on cluster quality. For Clustering documents we have used partitional based clustering technique K Means. Performance is measured against a human-imposed classification of Classic data set. We conducted a number of experiments and used entropy measure to assure statistical significance of results. Cosine, Pearson correlation and extended Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean measures perform poor. After applying the Stemming algorithm Euclidean measure shows little improvement.

Keywords: Text clustering, Stemming Algorithm, Similarity Measures, Cluster Accuracy.

Download Full-Text

ABOUT THE AUTHORS

N.Sandhya
N.Sandhya B.Tech, M.Tech (Ph.D). I passed B.Tech in 2000 and M.Tech in 2007. Registered Ph.D in 2008. Has 11 years of experience in teaching. Working in GRIET. My areas of interest are Databases, Data Mining , Information Retrieval and Text Mining.

Y.Srilalitha
Y.Srilalitha M.Tech (Ph.D). I completed M.Tech in 2001. Registered Ph.D in 2008. Has 16 years of experience in teaching. Working in GRIET. My areas of interest are Information Retrieval, Data Mining and Text Mining.

V.Sowmya
V.Sowmya M.Tech (Ph.D). I completed M.Tech in 2009. Registered Ph.D in 2011. Has 6 years of experience in teaching. Working in GRIET. My areas of interest are Information Retrieval, Data Mining and Text Mining.

Dr.K.Anuradha
Dr.K.Anuradha M.Tech, Ph.D. I completed Ph.D in 2011. Working as professor and Head of the CSE Dept in GRIET. My areas of interest are Information Retrieval, Data Mining and Text Mining.

Dr.A.Govardhan
I received Ph.D. degree in Computer Science and Engineering from Jawaharlal Nehru Technological University in 2003 M.Tech. from Jawaharlal Nehru University in 1994 and B.E.in from Osmania University in 1992. I am Working as a Principal of Jawaharlal Nehru Technological University, Jagitial. My areas of interest are Information Retrieval, Databases, Data Mining and Text Mining.

International Journal of Computer Science Issues More than a traditional journal...

Analysis of Stemming Algorithm for Text Clustering

International Journal of Computer Science Issues

More than a traditional journal...