Efficient Algorithm for Near Duplicate Documents Detection
Identification of duplicates or near duplicate documents in a set of documents is one of the major problems in information retrieval. Several methods to detect those documents have been proposed but their relevance is still an issue. In this paper we propose an algorithm based on word position which provides a reduced candidate size to search in and increases efficiency and effectiveness for partial documents relevance. In our experiments the results show that during search process for the query the candidate size has reduced up to 12% of the size of set of documents which leads to a decreased time in searching. The results also have shown a higher accuracy thus helping help the user not to waste time on waiting for a query and getting unwanted documents
Keywords: Inverted Index, Near-Duplicate Document, Partial Document, Document Relevance, Duplicates Detection.
Download Full-Text
ABOUT THE AUTHORS
Gaudence Uwamahoro
Gaudence Uwamahoro received Master Degree of Engineering in Computer Science and Technology from Central South University in 2010. She is currently working towards her Ph.D. Degree at the School of Information Science and Engineering, Central South University, China. Her research interests include information system, database technology and data mining.
Zuping Zhang received the Ph.D. degree in Information Science and Engineering, Central South University in 2005. He is now a Professor in School of Information Science and Engineering, Central South University. His current research interests include information fusion and information system, parameter computing and biology computing
Gaudence Uwamahoro
Gaudence Uwamahoro received Master Degree of Engineering in Computer Science and Technology from Central South University in 2010. She is currently working towards her Ph.D. Degree at the School of Information Science and Engineering, Central South University, China. Her research interests include information system, database technology and data mining.
Zuping Zhang received the Ph.D. degree in Information Science and Engineering, Central South University in 2005. He is now a Professor in School of Information Science and Engineering, Central South University. His current research interests include information fusion and information system, parameter computing and biology computing