News Corpus Builder .
A simple module that can be used to quickly build a corpus from news articles. The generated corpus can be stored in a sqlite database or as flat files.
Use Case
News Corpus Builder can be used to generate a corpus specific to the users interests/topics quickly and reliable. This allows to user to focus on using the generated corpus in a variety of Natural Language Processing & Machine Learning Tasks instead of crawling web pages.
How To Use
To install the module simple grab it from PyPi:
pip install news-corpus-builder
Using:
from news_corpus_builder import NewsCorpusGenerator
# Location to save generated corpus
corpus_dir = '/Users/skillachie/finance_corpus'
# Save results to sqlite or files per article
ex = NewsCorpusGenerator(corpus_dir,'sqlite')
# Retrieve 50 links related to the search term dogs and assign a category of Pet to the retrieved links
links = ex.google_news_search('dogs','Pet',50)
# Generate and save corpus
ex.generate_corpus(links)
Please see example.py that was used to generate a corpus with over 2500 articles across 11 finance topics.
Limitations
- The module currently uses Google News as the source to obtain the links for the relevant articles.
- Google News only returns a max of 100 articles per search term. To build a bigger corpus you can specify multiple related terms or run it again the following day to add the new articles to the previously generated corpus.
- Can add additional sources
Support or Contact
Feel free to contribute or make suggestions, features requests. If you have any issues feel free to send a message to @skillachie