News Corpus Builder .

A simple module that can be used to quickly build a corpus from news articles. The generated corpus can be stored in a sqlite database or as flat files.

Use Case

News Corpus Builder can be used to generate a corpus specific to the users interests/topics quickly and reliable. This allows to user to focus on using the generated corpus in a variety of Natural Language Processing & Machine Learning Tasks instead of crawling web pages.

How To Use

To install the module simple grab it from PyPi:

pip install news-corpus-builder


from news_corpus_builder import NewsCorpusGenerator

# Location to save generated corpus
corpus_dir = '/Users/skillachie/finance_corpus'

# Save results to sqlite or  files per article 
ex = NewsCorpusGenerator(corpus_dir,'sqlite')

# Retrieve 50 links related to the search term dogs and assign a category of Pet to the retrieved links
links = ex.google_news_search('dogs','Pet',50)

# Generate and save corpus

Please see that was used to generate a corpus with over 2500 articles across 11 finance topics.


Support or Contact

Feel free to contribute or make suggestions, features requests. If you have any issues feel free to send a message to @skillachie