In this project I investigated several solutions for News classification. A corpus of Russian-language news dedicated to Chines AI was collected within "Китайский Искусcтвенный Интеллект | 中国人工智能" news aggregation project. These news are divided on two classes: "Published" and "Not published". The problem was solved using three natural language processing approaches: TF-IDF and Logistic Regression, BERT and GPT. A comparative analysis of the results of these approaches was carried out. Repository of this project is located in https://github.com/ArtyomR/AI-News-Classification.
The Dataset was created within "Китайский Искусcтвенный Интеллект | 中国人工智能" news aggregation project. It could be found in ai_articles_prep_231124.xlsx. Data are combined from titles and first paragraphs of news dedicated
to Chinese high technologies. Originally these news were in English or
Chinese. Then they were translated into Russian by Google translation engine.
Example of Dataset records is on picture below.
After pre-processing and lematization dataset looks like on picture below.
There are 1711 (55.48%) 'Not published' news in this dataset and 'Published' news - 1373 (44.52%) in this dataset.
This dataset was divided on train dataset (2158 records - 70%) and test dataset (926 - 30%).
For Transformer (rubert-tiny2) model train dataset was divided on 2 subsets: train subset - 1726 records, validation subset - 432 records.
- 01. Dataset preparation.ipynb - It is Jupyter notebook for dataset pre-processing and exploration data analysis (EDA).
- 02_News_classification_with_TF_IDF_and_Logistic_Regression_v2.ipynb - It is Jupyter notebook for News classification with TF-IDF and Logistic Regression.
- 03_News_classification_with_BERT_v6.ipynb - It is Jupyter notebook for News classification with BERT (rubert-tiny2) model. I would like to recommend to use Google Colab because this code can use GPT processor and it is processed faster.
- 04_News_classification_with_LLM_and_RAG_v2.ipynb) - It is Jupyter notebook for News classification with GPT (Saiga/Mistral) model. I would like to recommend to use Google Colab because this code can use GPT processor and it is processed faster.
Confusion matrix for Logistic Regression
Confusion matrix for rubert-tiny2.
Confusion matrix for GPT (Saiga/Mistral) model.
Best performance in terms of F1-score was shown by rubert-tiny2 model.
Best performance in terms of Time of prediction was shown by Logistic Regression model.
Please read AI News Classification (LogReg, BERT, GPT) v4.pdf for additional information.