Skip to content

Latest commit

 

History

History
51 lines (34 loc) · 1.62 KB

README.md

File metadata and controls

51 lines (34 loc) · 1.62 KB

Model

Directory /models contains ML models used for detecting sample repositories (SR).

How to use it?

You can use these models for detecting SRs:

IF model

TBD..

Transformer BERT model

TBD..

How to train it?

To train models, run this:

docker run -e "GH_TOKEN=..." abialiauski/samples-filter-models model/<model-name>.py

For <model-name> you should provide a name of Python script for training, for instance isolation-forest or t_bert. For GH_TOKEN you should provide a GitHub PAT for pushing isolation-forest model files into results branch of this h1alexbel/samples-filter. If you are training t_bert model, and you want to export output model files, then pass -e "HF_TOKEN=..." to push them to the HuggingFace.

You will need Docker installed.

How to build new dataset?

To build a new dataset, run srdataset either on cloud VM or locally. The building process can take a while. After it completed, you should have repos.csv file with all collected repositories.

All features must be preprocessed and vectorized using pipeline.py. Once you have vectors, you can feed them to the models.