International Journal of Computer Science Issues

An Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes

Tarek El-Shishtawy and Fatma El-Ghannam

In spite of its robust syntax, semantic cohesion, and less ambiguity, lemma level analysis and generation does not yet focused in Arabic NLP literatures. In the current research, we propose the first non-statistical accurate Arabic lemmatizer algorithm that is suitable for information retrieval (IR) systems. The proposed lemmatizer makes use of different Arabic language knowledge resources to generate accurate lemma form and its relevant features that support IR purposes. As a POS tagger, the experimental results show that, the proposed algorithm achieves a maximum accuracy of 94.8%. For first seen documents, an accuracy of 89.15% is achieved, compared to 76.7% of up to date Stanford accurate Arabic model, for the same, dataset.

Keywords: Arabic NLP, Information Retrieval, Arabic Lemmateizer, POS tagger

Download Full-Text

ABOUT THE AUTHORS

Tarek El-Shishtawy
Dr. Tarek El-Shishtawy have participated in many Arabic computational Linguistic projects. Large Scale Arabic annotated Corpus, 1995, was one of important projects for Egyptian Computer Society, and Academy of Scientific Research and Technology, He has many publications in Arabic Corpus, machine translation, Text, and data Mining.

Fatma El-Ghannam
Fatma El-Ghannam has great research interests in Arabic language generation and analysis. Currently, she\'s preparing for a Ph.D. degree in NLP.

International Journal of Computer Science Issues More than a traditional journal...

An Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes

International Journal of Computer Science Issues

More than a traditional journal...