Friday 19th of April 2024
 

Language Identification of Web Pages Based on Improved N-gram Algorithm



Language identification of written text in the domain of Latin-script based languages is a well-studied research field. However, new challenges arise when it is applied to non-Latin-script based languages, especially for Asian languages web pages. The objective of this paper is to propose and evaluate the effectiveness of adapting Universal Declaration of Human Rights and Biblical texts as a training corpus, together with two new heuristics to improve an n-gram based language identification algorithm for Asian languages. Extension of the training corpus produced improved accuracy. Improvement was also achieved by using byte-sequence based HTML parser and a HTML character entities converter. The performance of the algorithm was evaluated based on a written text corpus of 1,660 web pages, spanning 182 languages from Asia, Africa, the Americas, Europe and Oceania. Experimental result showed that the algorithm achieved a language identification accuracy rate of 94.04%.

Keywords: Asian Language, Byte-Sequences, HTML Character Entities, N-gram, Non-Latin-Script, Language Identification

Download Full-Text

IJCSI Published Papers Indexed By:

 

 

 

 
+++
About IJCSI

IJCSI is a refereed open access international journal for scientific papers dealing in all areas of computer science research...

Learn more »
Join Us
FAQs

Read the most frequently asked questions about IJCSI.

Frequently Asked Questions (FAQs) »
Get in touch

Phone: +230 911 5482
Email: info@ijcsi.org

More contact details »