This repository contains an HTTP service designed for automatic text classification for pipeline integration with the SuperAnnotate platform.
To integrate this HTTP service into your pipeline on the SuperAnnotate platform, follow these steps:
- Create and set up a text project on the SuperAnnotate platform.
- Deploy this HTTP service to a global accessible location.
- Configure a pipeline on the SuperAnnotate platform to link this service to your project.
The project facilitates the automatic training of a text classification and data tagging model on the SuperAnnotate platform.
Here's a high-level overview of the process:
- Annotate Data: Annotate approximately 100 items per class using the SuperAnnotate platform.
- Model Fine-Tuning: Fine-tune the text classification model using the annotated data.
- Prediction: Use the fine-tuned model to predict labels for other items in your dataset.
The project was created for the automatic training of a text classification and data tagging model on the SuperAnnotate platform. Everything happens in 3 main stages:
- Annotations with file names are loaded from the specified project (and optional folders) from the platform.
- Document texts are also loaded through the selected integration.
- All this data is combined into a dataset and has standard processing, such as removing empty, duplicates, and extremely short/long (less than 10 and above 2000 words) texts. Texts are also preprocessed by converting them to lowercase and removing unnecessary spaces and line breaks. At any time you can change the text preprocessing function (
text_auto_classification/utils/data/data_processing.py
) to suit your needs.
- The training data is divided into training and validation data to evaluate the quality of the model and the learning process.
- Next, the hyperparameters are initialized, which can be customized through the training config file.
- The model's auto fine-tuning process, specified in the config, begins. All model, arguments and trainer are defined by standard HuggingFace abstractions.
- The model output layer is gonna be based on the number of classes in the training data
- All downloaded data from the platform that did not yet have labels is separated during the data preparation process into a separate set for future prediction.
- At this stage, we run the texts of these elements through the model to obtain predictions.
- These predictions are then uploaded to the platform.
To configure the Pipeline and service operation from the platform side, read this Tutorial
You can deploy the service wherever it is convenient; one of the basic options is on a created EC2 instance. Learn about instance creation and setup here.
NOTES:
- To verify that everything is functioning correctly, try calling the healthcheck endpoint.
- Hardware requirements will depend largely on your arguments and the base model being used. However, it's recommended to utilize NVIDIA GPU architecture. For the basic configuration of the default service, it's advisable to use the following instance: g3s.xlarge.
- Also, ensure that the port on which your service is deployed (8080 by default) is open to the global network. Refer to this tutorial for guidance on opening a port on an EC2 instance.
To get started with the project, you should determine all the necessary configuration files. By default, they have located in the following path: etc/configs
. Namely, there are 3 configs:
-
SA_config.ini:
- This is a configuration file for connecting work with SDK SuperAnnotate, which contains your key to the platform and is needed for authorization in SAClient. You can read more here.
-
service_config.json:
- This file contains a basic field for the working of the service in general. Contains the following fields:
SA_CONFIG_PATH
: The path to the first config (SA_config.ini).SA_PROJECT_NAME
,SA_FOLDERS
: The name and optionally the folders of the project on the platform with which to work.AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
: AWS keys, more details here.AWS_URL_FOR_DATA_DOWNLOADS
,AWS_URL_TO_MODEL_UPLOAD
: S3 URLs to the location of original documents and the place to save model checkpoints, respectively. More details here.
NOTE: Please ensure that the file structure on following path
AWS_URL_FOR_DATA_DOWNLOADS
matches with foldersSA_FOLDERS
- This file contains a basic field for the working of the service in general. Contains the following fields:
-
train_config.json:
- This config contains the basic arguments necessary for training:
pretrain_model
: The name of the pre-trained model with HuggingFace (it is recommended to use Bert-like model).validation_ratio
: A value from 0 to 1, representing the proportion of data that will be used to validate the model.max_length
: The maximum length of texts for the tokenizer, by default 512 is the limit for Bert-like models.- The remaining keys correspond to the arguments of the following
TrainingArguments
class. More details here.
- This config contains the basic arguments necessary for training:
After all the configs are configured as described, you can start the service. There are 2 options:
- Install Python version 3.11. More details here
- Install Nvidia drivers and CUDA toolkit using, for example, this instructions: Nvidia drivers and CUDA toolkit
- Install dependencies:
pip install -r ./text_auto_classification/requirements.txt
- Set the Python path variable:
export PYTHONPATH="."
- Run the API:
uvicorn --host 0.0.0.0 --port 8080 text_auto_classification.fastapi_app:app
- Initialize environment variables:
- Path to the general configuration file
DEFAULT_SERVICE_CONFIG
:export DEFAULT_SERVICE_CONFIG=etc/configs/service_config.json
- Path to the configuration file with parameters for training
DEFAULT_TRAINING_CONFIG
:export DEFAULT_TRAINING_CONFIG=etc/configs/train_config.json
- Path to the general configuration file
- Install Docker, Nvidia drivers, CUDA toolkit and NVIDIA Container Toolkit using, for example, this instructions: Docker; Nvidia drivers; CUDA toolkit; NVIDIA Container Toolkit
- Build the docker image:
sudo docker build -t text_auto_classification --build-arg DEFAULT_SERVICE_CONFIG=$DEFAULT_SERVICE_CONFIG --build-arg DEFAULT_TRAINING_CONFIG=$DEFAULT_TRAINING_CONFIG .
- Run a container:
sudo docker run --gpus all -p 8080:8080 -d text_auto_classification
The following endpoints are available in the Text Auto Classification service:
-
GET /healthcheck:
- Summary: Ping
- Description: Alive method
- Input Type: None
- Output Type: JSON
- Output Values:
{"healthy": True}
- Status Codes:
200
: Successful Response
-
POST /train_predict:
- Summary: Train Predict
- Description: Train model on annotated data from SA project and auto annotate other data
- Input Type: None
- Output Type: JSON
- Output Values:
{"status": "Pipeline successfully started"}
{"status": "Pipeline is already started"}
- Status Codes:
200
: Pipeline successfully started429
: Pipeline is already started
-
GET /status:
- Summary: Status
- Description: Method for status tracking
- Input Type: None
- Output Type: JSON
- Output Values:
{"status": "Not started"}
{"status": "Downloading data"}
{"status": "Model training"}
{"status": "Predicting other items"}
{"status": "Completed"}
{"status": "Failed"}
- Status Codes:
200
: Successful Response
There are several areas where the project can be further improved:
-
Implement support for multi-label classification: Currently, the project focuses on single-label classification. Adding support for multi-label classification would enhance its versatility and applicability in various use cases.
-
Logic for working with long texts, add auto chunking: Handling long texts efficiently is crucial for many natural language processing tasks. Implementing logic to handle long texts, such as automatic chunking, would improve the project's performance and scalability when dealing with lengthy documents.