我们可以从各种领域中提取网页内容,例如数据挖掘,信息检索等。要从报纸和杂志的网站中提取信息,我们将使用报纸库。
该库的主要目的是从报纸和类似网站中提取和整理文章。
要安装报纸库,请在您的终端中运行:
$ pip install newspaper3k
对于lxml依赖项,在终端中运行以下命令
$pip install lxml
要安装PIL,请运行
$pip install Pillow
NLP语料库将被下载:
$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python
python newpaper库用于收集与文章关联的信息。这包括作者姓名,文章中的主要图像,发布日期,文章中的视频,描述文章的关键字以及文章的摘要。
#Import required library from newspaper import Article # url link-which you want to extract url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117" # Download the article >>> from newspaper import Article >>> url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117" >>> article = Article(url) >>> article.download() # Parse the article and fetch authors name >>> article.parse() >>> print(article.authors)
['Kristina Peterson', 'Andrew Duehren', 'Natalie Andrews', 'Kristina.Peterson Wsj.Com', 'Andrew.Duehren Wsj.Com', 'Natalie.Andrews Wsj.Com'] # Extract Publication date >>> print("文章发表日期:") >>> print(article.publish_date) # Extract URL of the major images >>> print(article.top_image)
https://images.wsj.net/im-51122/social # Extract keywords using NLP print ("Keywords in the article", article.keywords) # Extract summary of the article print("Article Summary", article.summary)
下面是完整的程序:
from newspaper import Article url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117" article = Article(url) article.download() article.parse() print(article.authors) print("文章发表日期:") print(article.publish_date) print("文章中的主要图片:") print(article.top_image) article.nlp() print ("Keywords in the article") print(article.keywords) print("Article Summary") print(article.summary)
['Kristina Peterson', 'Andrew Duehren', 'Natalie Andrews', 'Kristina.Peterson Wsj.Com', 'Andrew.Duehren Wsj.Com', 'Natalie.Andrews Wsj.Com'] 文章发表日期: None 文章中的主要图片: https://images.wsj.net/im-51122/social Keywords in the article ['state', 'spending', 'sweeping', 'southern', 'security', 'border', 'principle', 'lawmakers', 'avoid', 'shutdown', 'reach', 'weekendthe', 'fund', 'trump', 'union', 'agreement', 'wall'] Article Summary President Trump made the case in his State of the Union address for the construction of a wall along the southern U.S. border, calling it a “moral issue." Photo: GettyWASHINGTON—Senior lawmakers said Monday night they had reached an agreement in principle on a sweeping deal to end a monthslong fight over border security and avoid a partial government shutdown this weekend. The top four lawmakers on the House and Senate Appropriations Committees emerged after three closed-door meetings Monday and announced that they had agreed to a framework for all seven spending bills whose funding expires at 12:01 a.m. Saturday.