In this study, we have presented an vulnerability data extraction tool to detect vulnerabilities in the C\C++ source code of several operating systems(OS) and software. The source code of major software was used to create a binary and multi-class labeled dataset including both vulnerable and benign samples. The vulnerability types presented in the extracted dataset are linked to the Common Weakness Enumeration (CWE) records.
- Python (3.7)
- pip 23.3.1
- FlawFinder 2.0.19
- Cppcheck 2.10.3
- RATS v2.4 - Rough Auditing Tool for Security
- Clang Static Analyzer 15.0.0
The code is written in python 3.7. The program requires the following python packages:
Follow requirements.txt
to see the python APIs used in the repository to reproduce the result. Run the following command to create a virtual environment, activate it and install all thre required python dependencies.
conda create -n vulnminer python==3.8
conda activate vulnminer
pip install pip==23.3.1
pip install -r requirements.txt
Once required packages were installed, run the command to extract the vulnerability data from the given input projects as listed in config.yaml
:
python3 -m source.extract
If you want to extract the vulnerability data from any source-code project and collect the data to the given SQLite3 database (overriding the config.yaml
parameters). You can execute the command as follows-
python3 -m source.extract --project [project-dir] --database [db-name.db]
Once the execution of the script completes, it will save all the collected vulnerability data (statements and functions) to a database file as specified in database
parameter in config.yaml
file, i.e., data/VulnMiner.db
.
The VulnMiner.db
dataset is a collection of vulnerable codes from various projects. The current VulnMiner dataset has 2,263,907 statements (2,165,850 benign and 98,057 vulnerable) and 1,026,111 functions (922,473 benign and 103,638 vulnerable). Among all the projects analyzed, linux
has the highest number of entries, totaling 1,193,381 statements and 451,021 functions, followed by chromium
with 216,724 statements and 257,917 functions. While the size of vulnerability and weakness samples (statements and functions) reflects the severity of the projects, it's worth noting that larger projects like \emph{linux} may naturally harbor more vulnerable samples.
Project | version | URL |
---|---|---|
linux-rpi | 6.1.y | www.raspberrypi.com/software/ |
ARMmbed | 6.17.0 | https://os.mbed.com/mbed-os/ |
FreeRTOS | 202212.01 | www.freertos.org/a00104.html |
RIOT | 2023.07 | https://github.com/RIOT-OS/RIOT |
contiki | 2.4 | https://github.com/contiki-os/contiki |
gnucobol | 3.2 | https://gnucobol.sourceforge.io/ |
mbed-os | 6.17.0 | https://github.com/ARMmbed/mbed-os |
miropython | 1.12.0 | https://micropython.org/ |
mosquito | 2.0.18 | https://github.com/eclipse/mosquitto |
openwrt | 23.05.2 | https://github.com/openwrt/openwrt |
The initial release of the extracted dataset can be accessible at zenodo and earlier IoT-Specific version of the dataset is available at zenodo.
The research presented in this paper has benefited from the Kristiania-HPC which is financially supported by the Kristiania University College.