Thursday 25th of April 2024
 

Content-based Text Categorization using Wikitology


Muhammad Rafi, Sundus Hassan and Muhammad Shahid Shaikh

The process of text categorization assigns labels or categories to each text document according to the semantic content of the document. The traditional approaches to text categorization used features from the text like: words, phrases, and concepts hierarchies to represent and reduce the dimensionality of the documents. Recently, researchers addressed this brittleness by incorporating background knowledge into document representation by using some external knowledge base for example WordNet, Open Project Directory (OPD) and Wikipedia. In this paper we have tried to enhance text categorization by integrating knowledge from Wikitology. Wikitology is a knowledge repository which extracts knowledge from Wikipedia in structured/unstructured forms with a warping of ontological structure. We have augmented text document by exploring Wikitology fields like: {Bag of Words, titles, redirects, entity types, categories and linked entities}. We also propose and evaluate different text representations and text enrichment technique. The classification is performed by using Support Vector Machine (SVM and we have validated this experiment on 4-fold cross-validation.

Keywords: Text Categorization, Machine Learning, Wikitology, Support Vector Machine, 20- Newsgroup. Reuters-21578

Download Full-Text


ABOUT THE AUTHORS

Muhammad Rafi
Muhammad Rafi received his MS degree in computer science with a Gold Madel from National University of Computer & Emerging Science, FAST-NU Karachi campus in 2000. He is currently pursuing his PhD degree in Computer Science from the same university. He is an assistant professor in computer science department of FAST-NU, Karachi Campus. His research interests include machine learning, algorithm design, data/text mining and information retrieval. He is also a member of ACM and IEEE.

Sundus Hassan
Sundus Hassan received her MS in computer science in 2010. She is currently a software engineer at a local software company

Muhammad Shahid Shaikh
Muhammad Shahid Shaikh received the BE degree from Mehran University of Engineering and Technology, Pakistan, in 1986, the MS from Michigan State Univeristy in 1989 and the PhD degree from McGill University, Montreal, in 2004, all in Electrical Engineering. He is currently associate professor and head of the department of Electrical Engineering at the National University of Computer and Emerging Sciences, Karachi, Pakistan.


IJCSI Published Papers Indexed By:

 

 

 

 
+++
About IJCSI

IJCSI is a refereed open access international journal for scientific papers dealing in all areas of computer science research...

Learn more »
Join Us
FAQs

Read the most frequently asked questions about IJCSI.

Frequently Asked Questions (FAQs) »
Get in touch

Phone: +230 911 5482
Email: info@ijcsi.org

More contact details »