BeautifulSoup是一个python库,可从HTML和XML文件中提取数据。使用BeautifulSoup,我们还可以删除HTML或XML文档中存在的空标签,并将给定的数据进一步转换为人类可读的文件。
首先,我们将使用以下命令在本地环境中安装BeautifulSoup库:pip install beautifulsoup4
#导入BeautifulSoup库 from bs4 import BeautifulSoup #获取HTML文档 html_object = """ <p>Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation.</p> """ #让我们为给定的html文档创建汤 soup = BeautifulSoup(html_object, "lxml") #遍历文档的每一行并提取数据 for x in soup.find_all(): if len(x.get_text(strip=True)) == 0: x.extract() print(soup)输出结果
运行上面的代码将生成输出,并通过除去其中的空标签将给定的HTML文档转换为人类可读的代码。
<html><body><p>Python is an interpreted, high−level and general−purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation.</p> </body></html>