python 实现 html 转 text 文本_PHP开发|软件开发|北京清如许

python 实现 html 转 text 文本

采集的网页中，想提取所有的txt 保存为 txt 文件。
原理：
1，匹配所有DIV 之间的内容
2，去掉内容中的HTML标签
3，去掉多换行空格制表符nbsp
4，检测长度是否超过300 说明是正文

经过测试，可以让多数网页博客变成纯净的 txt 文件。
以下是源码

#coding:utf-8
"""
html转txt 
2016-02-06 11:36
"""
import os
import re
import sys

word_len = 300

for root,dir,files in os.walk(os.getcwd()):
	for file in files:
		file = os.path.join(root,file)
		ext = file.split(".")[-1]
		if ext in ["htm","html"]:
			with open(file) as htm:
				newText = ""
				htm =htm.read()
				#查找到所有匹配的 DIV 去掉里面的HTML 后检测字数
				divs = re.findall(r"]+)?>([\s|\S]+?)<([\/]+)?div",htm)
				for attr,text,flag in divs:
					#移除javascript style
					text = re.sub(r"]+)?>([\s|\S]+?)<\/script>","",text)
					text = re.sub(r"","",text)
					#HTML标签变成换行
					text = re.sub(r"<([^>]+)>","\r\n",text)
					#换行空白移除
					text = text.replace("\t","")
					text = text.replace(" ","")
					text = text.replace("\r","")
					text = re.sub(r"[\s]+","\n",text)
					text = re.sub(r"[ ]+","",text)
					if word_len < len(text):
						newText += text
				with open(file[:-len(ext)]+"txt","w") as text:
					text.write(newText)

上一篇：python 实现 ntp 网络对时详解
下一篇：没有了