Web Scraping Framework based on Combining Tag and Value Similarity
When user fires a query on data intensive web sites, the response to the user query is web page generated dynamically, consisting of Query Relevant Records (QRRs). Along with user desired data these QRRs are decorated with irrelevant data such as advertisements, navigational panels etc. Deciding which region of this result page contains the relevant data is easy for human but not for computer programs. Thus, for utilization of this data removal of irrelevant data and atomic extraction of QRRs from result web pages is necessary , which further can be used in value added services like comparison shopping, data integration, meta querying etc.
This paper discusses various atomic data extraction techniques and proposes a new approach which uses similarity of Tag and Value together to extract QRRs automatically from query result page and aligns the extracted QRRs in structured format e.g. tables where they can be easily aggregated and compared. The challenge of proposed automatic data extraction is to handle the situation when QRRs are not contiguous as query result page often contains auxiliary query irrelevant information and that of data values alignment present in the extracted records into a table so that the data values for the same attribute in each record are placed into the same column in the table.
Keywords: Data aggregation, Data integration, Data scraping, Data values alignment, Wrapper.
Download Full-Text
ABOUT THE AUTHOR
Shridevi Appayya Swami
Student of Master of Computer Engineering
Shridevi Appayya Swami
Student of Master of Computer Engineering