International Journal of Computer Science Issues

Language Specific Crawler for Myanmar Web Pages

Pann Yu Mon Chew Yew Choong Yoshiki Mikami

With the enormous growth of the World Wide Web, search engines play a critical role in retrieving information from the borderless Web. Although many search engines can search for content in numerous major languages, they are not capable of searching pages of less-computerized languages such as Myanmar due to the use of multiple non-standard encodings in the Myanmar Web pages. Since the Web is a distributed, dynamic and rapidly growing information resource, a normal Web crawler cannot download all pages. For a Language specific search engine, Language Specific Crawler (LSC) is needed to collect targeted pages. This paper presents a LSC implemented as multi-threaded objects that run concurrently with language identifier. The LSC is capable of collecting as many Myanmar Web pages as possible. In experiments, the implemented algorithm collected Myanmar pages at a satisfactory level of coverage. The results of an evaluation of the LSC by two criteria, recall and precision and a method to measure the total number of Myanmar Web pages on the entire Web are also discussed. Finally, another analysis was conducted to determine the location of the servers of Myanmar Web content, and those results are presented.

Keywords: Language Specific Crawling, Myanmar, Web Search, Language Identification

Download Full-Text

International Journal of Computer Science Issues More than a traditional journal...

Language Specific Crawler for Myanmar Web Pages

International Journal of Computer Science Issues

More than a traditional journal...