The location of the original dataset needs to be preprocessed, otherwise training cannot be performed.
https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO/tree/main
You can use our processed csv directly in colab
https://huggingface.co/datasets/nomiaow/badcode/resolve/main/ALL_code.csv
- Convert output.json to generate data.csv. Users can select files for training according to their needs.
- data.csv is the same csv file as pythondata.csv
- output.json only contains python data
- The ALLcodedet folder contains all the code used for data preprocessing
Register google cloud (no money required, but please follow the steps below and don’t bind your card!!!)
Before training the program, you need to register with Google Cloud and complete the OAuth related settings. If the number of requests exceeds the number, you may need to swipe your card. Please refer to this article for details.
- First go to google cloud to enable API -> Generative Language API
After you get the client_secret.json file from OAuth, please go to google cloud cli and execute the following command. Don’t forget that the location of client_secret.json must be in the directory where you downloaded the command, and confirm that your cloud has a list of allowed testers. After following this command, you will need to log in to the tester's account.
gcloud auth application-default login --client-id-file client_secret.json --scopes=https://www.googleapis.com/auth/cloud-platform --scopes=https://www.googleapis.com/auth/generative-language.tuning
Credentials saved to file: [/<YOUR_HOME_DIR>/.config/gcloud/application_default_credentials.json]
These credentials will be used by any library that requests Application Default Credentials (ADC).
- Application_default_credentials.json must be generated to run the code normally
Remember to run this command
- pip install google-generativeai
The main training program is finetunellm.py. During training, please make sure you obtain client_secret.json in OAuth and import our data.csv data set. Please make sure they are all in the same directory.
- After training, run the elev.py program to enter the code you want to detect. Remember! If you only use data.csv, then finetunellm.py is a program that only trains python and is not written properly! So please only enter xxx.py (Enter the name of the code file to be tested according to the training program)
The running result is as shown in the figure
2.mp4
- data.csv only contains python data sets
- If you want to train multiple improper programming methods, the data set is placed in the ALLcodedet folder
- install_mode file will be generated after each training
- Requirements.txt contains the packages that need to be downloaded. In fact, more or less installation content may be required.
- The un_sec_code folder contains some examples of incorrect code writing, which can be tested when executing elev.py
- Example:
RAGCONDE1.mp4
- https://ai.google.dev/gemini-api/docs/model-tuning/python
- https://www.kaggle.com/code/bhavikjikadara/gemini-api-with-python
If you don’t want to train, you can go to google gemini to get the developer API, which is free as long as the requests per minute do not exceed the official limit.
After getting the API, use notraintogemini.py directly.
The running results are as follows
1.mp4
- Noflag
- Jimmy Liao
Send a PR to add new features and we will review it!
- test results
- web display
- PDF report
- add Security Protect into project +RAG
- Ollama ...