Graph Theoretic and Genetic Algorithm-Based Model for Web Content Mining
The World Wide Web (www) is arguably the largest and the most heterogeneous repository of data and has continued to expand in size and complexity. With consistency in expansion, retrieval of required web pages and information has become a herculean task for web users due to information overload and worst still, existing web content retrieval techniques have not exhibited enough efficiency in areas of speed and accuracy. This paper presents a Graph Theoretic (GT) and Genetic Algorithm (GA)-based technique for mining of web documents. The technique utilizes graph representations of document content to address the problems of initialization, convergence to local minimal and failure to handle large datasets. The technique works in three phases; namely contents extraction, preprocessing and database formulation while Maximum Common Sub-graph (MCS) was used to calculate the distance between clusters. Results of the web-based experimental study on Pentium 4 with 2GHz processor and 1GB RAM running on Window 7 operating system platform with web scraper (import.io) as front-end and PHP 6 and MySQL5 as back-ends show the applicability and the superiority of the new techniques over some existing ones.
Keywords: Web mining, graph theory, genetic algorithm, knowledge discovery
Download Full-Text
ABOUT THE AUTHORS
Moses Akinjide Adelola
PhD Research student
Sunday Olumide Adewale
Professor of Computer Science
Gabriel Babatunde Iwasokun
Lecturer/Researcher
Moses Akinjide Adelola
PhD Research student
Sunday Olumide Adewale
Professor of Computer Science
Gabriel Babatunde Iwasokun
Lecturer/Researcher