News Corpus Builder .

A simple module that can be used to quickly build a corpus from news articles. The generated corpus can be stored in a sqlite database or as flat files.

Use Case

News Corpus Builder can be used to generate a corpus specific to the users interests/topics quickly and reliable. This allows to user to focus on using the generated corpus in a variety of Natural Language Processing & Machine Learning Tasks instead of crawling web pages.

How To Use

To install the module simple grab it from PyPi:

pip install news-corpus-builder

Using:

from news_corpus_builder import NewsCorpusGenerator

# Location to save generated corpus
corpus_dir = '/Users/skillachie/finance_corpus'

# Save results to sqlite or  files per article 
ex = NewsCorpusGenerator(corpus_dir,'sqlite')

# Retrieve 50 links related to the search term dogs and assign a category of Pet to the retrieved links
links = ex.google_news_search('dogs','Pet',50)

# Generate and save corpus
ex.generate_corpus(links)

Please see example.py that was used to generate a corpus with over 2500 articles across 11 finance topics.

Limitations

Support or Contact

Feel free to contribute or make suggestions, features requests. If you have any issues feel free to send a message to @skillachie