Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Egoluback authored Jun 2, 2021
1 parent 63d25d4 commit 91703a1
Showing 1 changed file with 7 additions and 0 deletions.
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,10 @@ ML-bot that detects toxicity in russian texts.
Works on TeleBot(telegram API) <br />
Data: [Train&Test](https://www.kaggle.com/alexandersemiletov/toxic-russian-comments) [Test](https://www.kaggle.com/blackmoon/russian-language-toxic-comments) <br />
Bot represents 3 models classifying insults, threats and obscenities.

# How it works?
All words in sentences are presented in vectors with word2vec model([Model 204 on nlppl.eu, trained on RNC, Wikipedia, News and ARM](http://vectors.nlpl.eu/repository/)). The resulting vector of the proposal is the average value of its vectors. <br />
Train data shape: Nx300. I use 3 CatBoostClassifier models to train on insults, threats and obscenities datasets.
# Result
This architecture is nice for getting main topic of sentence(because mean word2vec vector guesses semantics well), but it is not perfect for predicting tone of sentence. For this task it's better to use different way to vectorize sentences and different models(not decisions trees, better NN(RNN or CNN)). <br />
Maybe I'll come back to this task later with better method.

0 comments on commit 91703a1

Please sign in to comment.