Skip to content

AI News Classification. (1) TF-IDF and Logistic Regression. (2) BERT (3) LLM (Saiga/Mistral) and RAG

Notifications You must be signed in to change notification settings

ArtyomR/AI-News-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News classification with Logistic Regression, BERT and LLM/GPT

Abstract

In this project I investigated several solutions for News classification. A corpus of Russian-language news dedicated to Chines AI was collected within "Китайский Искусcтвенный Интеллект | 中国人工智能" news aggregation project. These news are divided on two classes: "Published" and "Not published". The problem was solved using three natural language processing approaches: TF-IDF and Logistic Regression, BERT and GPT. A comparative analysis of the results of these approaches was carried out. Repository of this project is located in https://github.com/ArtyomR/AI-News-Classification.

Dataset

The Dataset was created within "Китайский Искусcтвенный Интеллект | 中国人工智能" news aggregation project. It could be found in ai_articles_prep_231124.xlsx. Data are combined from titles and first paragraphs of news dedicated to Chinese high technologies. Originally these news were in English or Chinese. Then they were translated into Russian by Google translation engine. Example of Dataset records is on picture below. image

After pre-processing and lematization dataset looks like on picture below. image There are 1711 (55.48%) 'Not published' news in this dataset and 'Published' news - 1373 (44.52%) in this dataset. image

This dataset was divided on train dataset (2158 records - 70%) and test dataset (926 - 30%).

For Transformer (rubert-tiny2) model train dataset was divided on 2 subsets: train subset - 1726 records, validation subset - 432 records.

Experiment Setup

Results

Confusion matrix for Logistic Regression image


Confusion matrix for rubert-tiny2.

image

Confusion matrix for GPT (Saiga/Mistral) model. image

Best performance in terms of F1-score was shown by rubert-tiny2 model.

image

Best performance in terms of Time of prediction was shown by Logistic Regression model.

image

Additional information

Please read AI News Classification (LogReg, BERT, GPT) v4.pdf for additional information.

About

AI News Classification. (1) TF-IDF and Logistic Regression. (2) BERT (3) LLM (Saiga/Mistral) and RAG

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published