Thursday 25th of April 2024
 

Information Extraction and Webpage Understanding


M.Sharmila Begum, L.Dinesh and P.Aruna

The two most important tasks in information extraction from the Web are webpage structure understanding and natural language sentences processing. However, little work has been done toward an integrated statistical model for understanding webpage structures and processing natural language sentences within the HTML elements. Our recent work on webpage understanding introduces a joint model of Hierarchical Conditional Random Fields (HCRFs) and extended Semi-Markov Conditional Random Fields (Semi-CRFs) to leverage the page structure understanding results in free text segmentation and labeling. In this top-down integration model, the decision of the HCRF model could guide the decision making of the Semi-CRF model. However, the drawback of the topdown integration strategy is also apparent, i.e., the decision of the Semi-CRF model could not be used by the HCRF model to guide its decision making. This paper proposed a novel framework called WebNLP, which enables bidirectional integration of page structure understanding and text understanding in an iterative manner. We have applied the proposed framework to local business entity extraction and Chinese person and organization name extraction. Experiments show that the WebNLP framework achieved significantly better performance than existing methods.

Keywords: Natural language processing, Webpage understanding, Information Extraction, Conditional Random Fields

Download Full-Text


ABOUT THE AUTHORS

M.Sharmila Begum
Sharmila Begum received M.E degree in Computer Science and Engineering. She is currently working as a Assistant Professor in Department of Software Engineering in Periyar Maniammai University Thanjavur Tamilnadu India. She has Presented several papers in international conferences and published few papers in PMU journal and published a book named Design and Analysis of Algorithms her research areas are Data Mining, Bio-Medical, OOAD, Networking and Web Programming.

L.Dinesh
Dinesh received M.Sc degree [5 Years Integrated] in Software Engineering.He is currently pursing his M.E Software Engineering in Periyar Maniammai University Thanjavur Tamilnadu India.

P.Aruna
Aruna received MCA and M.Phil degree in Computer Application. She is currently working as a Assistant Professor in Department of Software Engineering in Periyar Maniammai University Thanjavur Tamilnadu India. She has presented several papres in International conferences and her research area is Mobile Adhoc Network.


IJCSI Published Papers Indexed By:

 

 

 

 
+++
About IJCSI

IJCSI is a refereed open access international journal for scientific papers dealing in all areas of computer science research...

Learn more »
Join Us
FAQs

Read the most frequently asked questions about IJCSI.

Frequently Asked Questions (FAQs) »
Get in touch

Phone: +230 911 5482
Email: info@ijcsi.org

More contact details »