Wednesday 24th of April 2024
 

Web Scraping Framework based on Combining Tag and Value Similarity


Shridevi Appayya Swami

When user fires a query on data intensive web sites, the response to the user query is web page generated dynamically, consisting of Query Relevant Records (QRRs). Along with user desired data these QRRs are decorated with irrelevant data such as advertisements, navigational panels etc. Deciding which region of this result page contains the relevant data is easy for human but not for computer programs. Thus, for utilization of this data removal of irrelevant data and atomic extraction of QRRs from result web pages is necessary , which further can be used in value added services like comparison shopping, data integration, meta querying etc. This paper discusses various atomic data extraction techniques and proposes a new approach which uses similarity of Tag and Value together to extract QRRs automatically from query result page and aligns the extracted QRRs in structured format e.g. tables where they can be easily aggregated and compared. The challenge of proposed automatic data extraction is to handle the situation when QRRs are not contiguous as query result page often contains auxiliary query irrelevant information and that of data values alignment present in the extracted records into a table so that the data values for the same attribute in each record are placed into the same column in the table.

Keywords: Data aggregation, Data integration, Data scraping, Data values alignment, Wrapper.

Download Full-Text


ABOUT THE AUTHOR

Shridevi Appayya Swami
Student of Master of Computer Engineering


IJCSI Published Papers Indexed By:

 

 

 

 
+++
About IJCSI

IJCSI is a refereed open access international journal for scientific papers dealing in all areas of computer science research...

Learn more »
Join Us
FAQs

Read the most frequently asked questions about IJCSI.

Frequently Asked Questions (FAQs) »
Get in touch

Phone: +230 911 5482
Email: info@ijcsi.org

More contact details »