topic_modeller is a program for training LDA (Latent Dirichlet Allocation)-based topic models, implemented using Python and scikit-learn.
topic_modeller is written in Python, and so a recent version of Python 3 should be downloaded before using it. Downloads for Python can be found at https://www.python.org/downloads/.
topic_modeller depends on a number of external libraries, and so a requirements.txt
file has been included in
the root directory. To run it from the command line, type:
pip install -r requirements.txt
topic_modeller's entry point is the TopicModeller
class, which can be imported and used either using the Python
interpreter or as part of your own Python project.
To train a topic model, instantiate a new instance of the TopicModeller
class and call
it's build_topic_model
function, passing the following arguments:
- input file path - .csv file containing training data to be processed.
- dataset (keyword arg) - the type of dataset being loaded (
"abcnews"
[default]).
For example
from topic_modeller import TopicModeller
modeller = TopicModeller()
modeller.build_topic_model("relative/path/to/input.csv", dataset="abcnews")
Once training has successfully completed, a new TopicModel
object will be created containing the trained LDA model
and count vectoriser created during training, and will be assigned to the TopicModeller
instance's topic_model
attribute.
To save a TopicModeller
instance's trained topic_model
(TopicModel
object), call
TopicModeller
's save_topic_model_with_name
function, passing the following arguments:
- output model name - name with which to save the trained topic model.
For example:
modeller.save_topic_model("new_model")
A new directory output/models/new_model/
will be created, containing topic_model.pkl
and vectoriser.pkl
.