Friday 23rd of February 2018

Efficient Algorithm for Near Duplicate Documents Detection

Gaudence Uwamahoro and

Identification of duplicates or near duplicate documents in a set of documents is one of the major problems in information retrieval. Several methods to detect those documents have been proposed but their relevance is still an issue. In this paper we propose an algorithm based on word position which provides a reduced candidate size to search in and increases efficiency and effectiveness for partial documents relevance. In our experiments the results show that during search process for the query the candidate size has reduced up to 12% of the size of set of documents which leads to a decreased time in searching. The results also have shown a higher accuracy thus helping help the user not to waste time on waiting for a query and getting unwanted documents

Keywords: Inverted Index, Near-Duplicate Document, Partial Document, Document Relevance, Duplicates Detection.

Download Full-Text


Gaudence Uwamahoro
Gaudence Uwamahoro received Master Degree of Engineering in Computer Science and Technology from Central South University in 2010. She is currently working towards her Ph.D. Degree at the School of Information Science and Engineering, Central South University, China. Her research interests include information system, database technology and data mining.

Zuping Zhang received the Ph.D. degree in Information Science and Engineering, Central South University in 2005. He is now a Professor in School of Information Science and Engineering, Central South University. His current research interests include information fusion and information system, parameter computing and biology computing

IJCSI Published Papers Indexed By:





IJCSI is a refereed open access international journal for scientific papers dealing in all areas of computer science research...

Learn more »
Join Us

Read the most frequently asked questions about IJCSI.

Frequently Asked Questions (FAQs) »
Get in touch

Phone: +230 911 5482

More contact details »