Directory /models
contains ML models used for detecting sample
repositories (SR).
You can use these models for detecting SRs:
TBD..
TBD..
To train models, run this:
docker run -e "GH_TOKEN=..." abialiauski/samples-filter-models model/<model-name>.py
For <model-name>
you should provide a name of Python script for training, for
instance isolation-forest
or t_bert
. For GH_TOKEN
you should provide a
GitHub PAT for pushing isolation-forest
model files into results
branch
of this h1alexbel/samples-filter
. If you are training t_bert
model, and you
want to export output model files, then pass -e "HF_TOKEN=..."
to push them
to the HuggingFace.
You will need Docker installed.
To build a new dataset, run srdataset either on cloud VM or locally. The
building process can take a while. After it completed, you should have
repos.csv
file with all collected repositories.
All features must be preprocessed and vectorized using pipeline.py. Once you have vectors, you can feed them to the models.