Cyberbullying is a serious issue, especially in the age of social media, where interactions turn hurtful. Some applications of cyberbullying detection that this code can solve:
- Real-time Monitoring: Continuously scan social media posts or messages for signs of cyberbullying.
- Alerts and Notifications: Notify users when potentially harmful content is detected.
- Reporting Mechanism: Allow users to report incidents and take appropriate action.
Distilbert-base-uncased is chosen over other transformers.
- For sentiment analysis, bidirectional semantic comprehension is important, hence not choosing GPT (which is appropriate for text generation)
- It has the same capability of loss calculation and Masked Language Modeling (MLM) like that of BERT, but lighter and faster than the latter
- As part of text cleaning, the text is going to be small capped; hence, using the uncased model version as capping is not significant here
- I have limited/irregular GPU availability and a small (~2000) corpus; making Distibert the most appropriate one
EDA and Feature Engineering to remove unwanted columns, treat missing values and ensure the right datatypes of columns
Note: It is important to have integer target labels (not float); else BCEwithLogitsLoss throws an error during model training
Data Processing using Regex and NLTK
- Convert to lower case (Not necessary for the Bert uncased model)
- Remove all hastags (#), handles (@), hyperlinks (http) and URLs (www.)
- Remove all characters except numbers or alphabets (emoticons, punctuations or multi-space blocks)
- Identify commonly found irrelevant words and append them to stopwords and lemmatize
- Remove the duplicates
- Analyze the length of each sequence; this is going to be useful while padding or truncating during tokenization
Defining the Transformer Dataset and Training
- Convert the clean dataframe to a transformer dataset to leverage its fast computation and batch processing
- Use the autotokenizer associated with distilbert with the longest padding and truncation strategy
- Split into train and test datasets
- Define the distilbert model
- Leverage the Trainer class for faster training, initialize its arguments and define functions for evaluation
![image](https://private-user-images.githubusercontent.com/166985767/354899232-9ce635a6-370e-431c-91d9-ed06084853b3.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMjYzNzMsIm5iZiI6MTczOTIyNjA3MywicGF0aCI6Ii8xNjY5ODU3NjcvMzU0ODk5MjMyLTljZTYzNWE2LTM3MGUtNDMxYy05MWQ5LWVkMDYwODQ4NTNiMy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQyMjIxMTNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02OGQ0NzEwNTE1YmYzN2IzNDQ5NzAzMjk4OWFjNjU3YWJjNjJhZGYxODI0MmM3YTczYzk3YjE1ZDdiYzIwMjU4JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.aSzlmrRIlZS4HAUzzVCfI8lx0MTM7hsFDORV94k9KEk)
![image](https://private-user-images.githubusercontent.com/166985767/354899169-bf85e6f9-f25b-48ac-9fcf-54e928b93b01.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMjYzNzMsIm5iZiI6MTczOTIyNjA3MywicGF0aCI6Ii8xNjY5ODU3NjcvMzU0ODk5MTY5LWJmODVlNmY5LWYyNWItNDhhYy05ZmNmLTU0ZTkyOGI5M2IwMS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQyMjIxMTNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT03MjY4ZWE3OTg2MDg1ZDVhMzJhOWIzNjI3MGY1YjA2MjM0MmU4NmZmNjZlOGNhMGM1MzJkNmFjMmYzOWYwZGFlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.PVh5HLEV0O-C_mLpkDI3ySwsbkmVhU-4wZ0Z242jH8Y)
Deployment
https://huggingface.co/LalasaMynalli/LalasaMynalli_First_LLM/resolve/main/README.md
Next Steps
- Hyperparameter tuning