International Journal of Computer Science Issues

Web Scraping Framework based on Combining Tag and Value Similarity

Shridevi Appayya Swami

When user fires a query on data intensive web sites, the response to the user query is web page generated dynamically, consisting of Query Relevant Records (QRRs). Along with user desired data these QRRs are decorated with irrelevant data such as advertisements, navigational panels etc. Deciding which region of this result page contains the relevant data is easy for human but not for computer programs. Thus, for utilization of this data removal of irrelevant data and atomic extraction of QRRs from result web pages is necessary , which further can be used in value added services like comparison shopping, data integration, meta querying etc. This paper discusses various atomic data extraction techniques and proposes a new approach which uses similarity of Tag and Value together to extract QRRs automatically from query result page and aligns the extracted QRRs in structured format e.g. tables where they can be easily aggregated and compared. The challenge of proposed automatic data extraction is to handle the situation when QRRs are not contiguous as query result page often contains auxiliary query irrelevant information and that of data values alignment present in the extracted records into a table so that the data values for the same attribute in each record are placed into the same column in the table.

Keywords: Data aggregation, Data integration, Data scraping, Data values alignment, Wrapper.

Download Full-Text

ABOUT THE AUTHOR

Shridevi Appayya Swami
Student of Master of Computer Engineering

International Journal of Computer Science Issues More than a traditional journal...

Web Scraping Framework based on Combining Tag and Value Similarity

International Journal of Computer Science Issues

More than a traditional journal...