BeautifulSoup是第三方Python库,用于解析网页中的数据。它有助于Web抓取,Web抓取是从不同资源提取,使用和处理数据的过程。
Web抓取还可以用于提取数据以用于研究目的,了解/比较市场趋势,执行SEO监视等等。
可以运行以下行在Windows上安装BeautifulSoup-
pip install beautifulsoup4
import requests from bs4 import BeautifulSoup fromurllib.requestimport urlopen import urllib url = 'https://en.wikipedia.org/wiki/Algorithm' html = urlopen(url).read() print("阅读网页...") soup = BeautifulSoup(html, features="html.parser") print("正在解析网页...") for script in soup(["script", "style"]): script.extract() # 撕掉 print("从网页中提取文字...") text = soup.get_text() print("数据清理...") lines = (line.strip() for line in text.splitlines()) chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) text = '\n'.join(chunk for chunk in chunks if chunk) text = str(text) print(text)输出结果
阅读网页... 正在解析网页... 从网页中提取文字... 数据清理... Recursive C implementation of Euclid's algorithm from the above flowchart Recursion A recursive algorithm is one that invokes (makes reference to) itself repeatedly until a certain condition (also known as termination condition) matches, which is a method common to functional programming…. ….. Developers Statistics Cookie statement
所需的软件包已导入并使用别名。
网站已定义。
将打开url,并删除'script'标签和其他不相关的HTML标签。
“ get_text”功能用于从网页数据中提取文本。
多余的空格和无效词被消除。
文本被打印在控制台上。