Parallelization of Maximum Entropy POS Tagging for Bahasa Indonesia with MapReduce
In this paper, MapReduce programming model is used to parallelize training and tagging process in maximum entropy part of speech tagging for Bahasa Indonesia. In training process, MapReduce model is implemented dictionary, tagtoken, and feature creation. In tagging process, MapReduce is implemented to tag lines of document in parallel. The training experiments showed that total training time using MapReduce is faster, but its result reading time inside the process slow down the total training time. The tagging experiments using different number of map and reduce process showed that MapReduce implementation could speedup the tagging process. The fastest tagging result is showed by tagging process using 1,000,000 word corpus and 30 map process.
Keywords: POS tagging, Maximum Entropy, MapReduce
Download Full-Text
ABOUT THE AUTHORS
Arif Nurwidyantoro
Arif Nurwidyantoro received his bachelor degree from Institut Pertanian Bogor, Indonesia, and master degree from Universitas Gadjah Mada, Indonesia, both in Computer Sciences. He currently works as teaching assistants at Universitas Gadjah Mada. He has interest in data mining, especially text and web mining, and also in large data processing.
Edi Winarko
Edi Winarko received his bachelor degree in Statistics from Universitas Gadjah Mada, Indonesia, M.Sc in Computer Sciences from Queen University, Canada, and Ph.D in Computer Sciences from Flinders University, Australia. He currently works as lecturer at Department of Computer Sciences and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada. His research interests are data warehousing, data mining, and information retrieval. He is a member of ACM and IEEE.
Arif Nurwidyantoro
Arif Nurwidyantoro received his bachelor degree from Institut Pertanian Bogor, Indonesia, and master degree from Universitas Gadjah Mada, Indonesia, both in Computer Sciences. He currently works as teaching assistants at Universitas Gadjah Mada. He has interest in data mining, especially text and web mining, and also in large data processing.
Edi Winarko
Edi Winarko received his bachelor degree in Statistics from Universitas Gadjah Mada, Indonesia, M.Sc in Computer Sciences from Queen University, Canada, and Ph.D in Computer Sciences from Flinders University, Australia. He currently works as lecturer at Department of Computer Sciences and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada. His research interests are data warehousing, data mining, and information retrieval. He is a member of ACM and IEEE.