基于文本标签属性的网页信息抽取方法研究-《武汉职业技术学院学报》

文章信息/Info

Title:: Research on Information Extraction of Webpage Based on Text Tagging Attributes

Keywords:: HTML DOM Tree; text tagging attributes; Web news; information extraction

摘要:: 伴随着互联网的飞速发展，网络上的信息资源呈现出井喷态势，如何从海量的信息中抽取出自己需要的信息已经变得越发的困难。在分析现有Web信息抽取技术现状及面临的挑战的基础上，设计了一种基于文本标签属性的Web新闻信息抽取模型。主要介绍了基于标签的Web信息抽取技术的算法，给出了信息抽取的具体实现过程，对基于DOM树节点遍历的文本标签过滤算法进行了描述，并选取了主流的新闻网站进行了抽取实验，验证了算法的可行性。

Abstract:: With the rapid development of Internet, online information resources present a blowout situation. At the same time, it has become increasingly difficult to extract information from huge amounts of the information we need. After studying the existing Web information extraction technology and the challenges faced, we design a Web news information extraction model based on text tagging attribute. This paper mainly introduces the Web information extraction technology based on the attribute of text tag, presents the specific implementation process of information extraction, describes the traversal algorithm of the filtering text labels based on DOM tree node, and chooses the mainstream news sites to carry out the extraction experiment and to verify the feasibility of the algorithm.