Friday 26th of April 2024

Hybrid Distance Based Document Clustering with Keyword and Phrase Indexing

Subhadra Kompella and M. Shashi

Document Clustering algorithms group a set of documents into subsets or clusters. Several applications of clustering exist in information retrieval. Our proposed method uses Scatter-Gather approach for clustering group of documents from an entire collection. The selected groupsare merged and the resulting set is again clustered. This process is repeateduntil a cluster of interest is found. This research presents a model for documentclustering that arranges unstructured documents into content-basedhomogeneous groups. The clustering approach uses the popular Cosine similarity measure combined with Euclidian distance measure. To the best of our knowledge, much work has been carried on keyword based clustering and Phrase index based clustering. Our method attempts to combine the two. The method has been applied to standard NewsGroup-20 dataset having documents distributed over 20 different topics. Results have been verified considering fixed number of clusters and different corpora and with variable number of clusters for fixed corpora. Both results indicate a steady increase in the overall purity of clustering compared to the keyword-based clustering method. With Keyword-based clustering, the purity was seen to increase for increasing number of clusters for a fixed corpora, but the purity was observed to decrease with fixed number of clusters and increase in number of corpora. In our method, the increase in purity was more pronounced with increase in number of clusters.

Keywords: Document clustering, Phraseindex, Purity

Download Full-Text


Subhadra Kompella
Subhadra Kompella has a teaching experience of over 6 years and is currently working as an Assistant Professor in the department of CSE,GIT,GITAM University.Her areas of interest include Data Minig,Data structures and Algorithms.

M. Shashi
Prof.Dr.M.Shashi has a teaching experience of over 24yearsand is currently working as the head of the department of the department of CSSE in the College of Engineering,Andhra University.Her research areas include Data Warehousing & Mining, AI, and Machine Learning.

IJCSI Published Papers Indexed By:





IJCSI is a refereed open access international journal for scientific papers dealing in all areas of computer science research...

Learn more »
Join Us

Read the most frequently asked questions about IJCSI.

Frequently Asked Questions (FAQs) »
Get in touch

Phone: +230 911 5482

More contact details »