[1]沈娜.基于文本标签属性的网页信息抽取方法研究[J].武汉职业技术学院学报,2016,(01):62-65.
 SHEN Na.Research on Information Extraction of Webpage Based on Text Tagging Attributes[J].Journal of Wuhan Polytechnic,2016,(01):62-65.
点击复制

基于文本标签属性的网页信息抽取方法研究()
分享到:

《武汉职业技术学院学报》[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2016年01期
页码:
62-65
栏目:
出版日期:
2016-02-28

文章信息/Info

Title:
Research on Information Extraction of Webpage Based on Text Tagging Attributes
文章编号:
1671-931X(2016)01-0062-04
作者:
沈娜
宿迁开放大学,江苏 宿迁 223800
Author(s):
SHEN Na
Suqian Open University, Suqian223800, China
关键词:
HTML DOM树文本标签属性Web新闻信息抽取
Keywords:
HTML DOM Tree text tagging attributes Web news information extraction
分类号:
TP391.1
文献标志码:
A
摘要:
伴随着互联网的飞速发展,网络上的信息资源呈现出井喷态势,如何从海量的信息中抽取出自己需要的信息已经变得越发的困难。在分析现有Web信息抽取技术现状及面临的挑战的基础上,设计了一种基于文本标签属性的Web新闻信息抽取模型。主要介绍了基于标签的Web信息抽取技术的算法,给出了信息抽取的具体实现过程,对基于DOM树节点遍历的文本标签过滤算法进行了描述,并选取了主流的新闻网站进行了抽取实验,验证了算法的可行性。
Abstract:
With the rapid development of Internet, online information resources present a blowout situation. At the same time, it has become increasingly difficult to extract information from huge amounts of the information we need. After studying the existing Web information extraction technology and the challenges faced, we design a Web news information extraction model based on text tagging attribute. This paper mainly introduces the Web information extraction technology based on the attribute of text tag, presents the specific implementation process of information extraction, describes the traversal algorithm of the filtering text labels based on DOM tree node, and chooses the mainstream news sites to carry out the extraction experiment and to verify the feasibility of the algorithm.

备注/Memo

备注/Memo:
收稿日期:2015-05-11 作者简介:沈娜(1984-),女,江苏宿迁人,工程硕士,宿迁开放大学讲师,研究方向:数据库技术、网络安全与应用技术。
更新日期/Last Update: 2016-02-28