The goal of this project is to construct a dataset consisting of posts written by english news channels such as CNN, BBC, Fox, Reuters, etc on Twitter alongside with their political bias label and perform some natural language processing tasks on the collected data.
- Constructed a novel dataset of famous news media data on Twitter labeled with political bias.
- Conducted multiple preprocessing and data analysis experiments on the collected data.
- Ran multiple NLP tasks including word similarity, NER, and DP.
- Trained deep language models, namely, GPT2 as a Causal LM and RoBERTa as a Masked LM.
- Trained deep classification models using word2vec, LSTM, BERT, and CNN.
- Gained experience with running models using CUDA-enabled GPU on my local machine.
- The structure of the codebase is based on famous cookiecutter datascience.
- Codes for collecting and processing data are placed in
src/data
directory. - Codes for designing and running models are placed in
src/models
directory. - Reports of every phase is placed in
docs
directory. - Run
src.data.make_dataset
module to download and build the dataset. - Run
src.data.make_analysis_reports
module to extract figures and tables needed to compile latex reports. - Compile
docs/phase_1_report/report.tex
to makereport.pdf
of phase 1. (download) - Compile
docs/phase_2_report/report.tex
to makereport.pdf
of phase 2. (download)