Discovering Informative Blocks from Web Pages for Efficient Information Extraction using DOM tree

Rakesh M. Kohale and Shreyash G. Balbudhe
Volume 1: Issue 2, Revised on – 30 June 2020, pp 81-84


Author's Information
Rakesh M. Kohale1 
Corresponding Author
1Datta Meghe Institute of Engineering, Technology and Research, Sawangi, Wardha, India
rakeshkohale77@gmail.com

Shreyash G. Balbudhe2
2Datta Meghe Institute of Engineering, Technology and Research, Sawangi, Wardha, India.


Review Article -- Peer Reviewed
First online on – 30 March 2015,      Revised on – 30 June 2020

Open Access article under Creative Commons License

Cite this article –Rakesh M. Kohale and Shreyash G. Balbudhe, “Discovering Informative Blocks from Web Pages for Efficient Information Extraction using DOM tree”, International Journal of Computational and Electronics Aspects in Engineering, RAME Publishers, vol. 1, issue 2, pp. 81-84, 2015, Revised in 2020.
https://doi.org/10.26706/ijceae.1.2.20150108


Abstract:-
A webpage generally contains data along with navigation panels, advertisements, copyright and privacy notices. Except data these other things do not contain any important information. These blocks can be called as non-informative blocks. As these blocks are non-informative, they can affect the result of web data mining. To avoid this, it is important to separate the main data i.e. informative blocks and non-informative blocks from the web page. In a website these non-informative blocks are generally present in different web pages and have same format. Also, the data contained in these blocks is also same. In case of informative blocks, data contained by the block and their format are different. We need a structure at site level to capture the same format of the blocks and the data present in the blocks. DOM Tree structure is available at page level. Many tools are available to construct a DOM Tree of a webpage. But DOM Tree structure is not useful at site level. So, we need to construct a Site Style Tree (SST) for a website. After analyzing this SST, we can identify which part of SST is informative and which is non-informative. There is no tool available to construct a style tree for a given website. This work aims at constructing a style tree for given website and separating informative and non-informative blocks from the website.
Index Terms:-
DOM tree, Site Style Tree, Tokens, Parsing, Informative blocks, Non-informative blocks.
REFERENCES
[1] R. Gunasundari, S. Karthikeyan, “Removing Non-informative Blocks from the Web Pages”, Communication Control and Computing Technologies (ICCCCT), 2010 IEEE International Conference.

[2] B. Liu, K. Zhao, and L. Yi, “Eliminating Noisy Information in Web Pages for Data Mining”, Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 296-305, 2003.

[3] Bar-Yossef, Z. and Rajagopalan, S., "Template Detection via Data Mining and its Applications", 2002.

[4] Shian-Hua Lin and Jan-Ming Ho., "Discovering Informative Content Blocks from Web Documents", KDD-02, 2002.

[5] S. Debnath, P. Mitra, and C.L. Giles, N.Pal “Automatic Identification of informative sections of Web Pages", IEEE Transaction on Knowledge and Data Engineering , 2005.

[6] Chia-Hsin Huang, Po-Yi Yen, Yi-Chan Hung, Tyng-Ruey Chuang, and Hahn-Ming Lee, "Enhancing Entropy-based Informative Block Identification Using Block Preclustering Technology", IEEE International Conference on Systems, Man, and Cybernetics, October 8-11, 2006, Taipei, Taiwan.

[7] Hung-Yu Kao, Jan-Ming Ho, Member, IEEE, and Ming-Syan Chen, Fellow, IEEE, "WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model", IEEE transaction on Knowledge and Data Engineering, VOL. 17, NO. 5, MAY 2005.

[8] Jushmerick, N., "Learning to remove Internet advertisements", AGENT- 99, 1999.

[9] Yao, Z. and Choi, B., 2007. “Clustering Web Pages into Hierarchical Categories,” International Journal of Intelligent Information Technologies, Special Issue on Web Mining, Vol. 3, No. 2, pp.17-35.


To view full paper, Download here


To View Full Paper

Publishing with