International Journal of Computer Science Issues

Language Identification of Web Pages Based on Improved N-gram Algorithm

Chew Yew Choong Yoshiki Mikami Robin Lee Nagano

Language identification of written text in the domain of Latin-script based languages is a well-studied research field. However, new challenges arise when it is applied to non-Latin-script based languages, especially for Asian languages web pages. The objective of this paper is to propose and evaluate the effectiveness of adapting Universal Declaration of Human Rights and Biblical texts as a training corpus, together with two new heuristics to improve an n-gram based language identification algorithm for Asian languages. Extension of the training corpus produced improved accuracy. Improvement was also achieved by using byte-sequence based HTML parser and a HTML character entities converter. The performance of the algorithm was evaluated based on a written text corpus of 1,660 web pages, spanning 182 languages from Asia, Africa, the Americas, Europe and Oceania. Experimental result showed that the algorithm achieved a language identification accuracy rate of 94.04%.

Keywords: Asian Language, Byte-Sequences, HTML Character Entities, N-gram, Non-Latin-Script, Language Identification

Download Full-Text

International Journal of Computer Science Issues More than a traditional journal...

Language Identification of Web Pages Based on Improved N-gram Algorithm

International Journal of Computer Science Issues

More than a traditional journal...