|
1 | 1 | # DeepSQLi
|
2 | 2 |
|
3 |
| -This repository contains the code for the DeepSQLi project. The project aims to detect SQL injection attacks using deep learning models. The project consists of two main components: **Tokenizer API** and **Serving API**. The Tokenizer API tokenizes and sequences the input query, while the Serving API predicts whether the query is SQL injection or not. |
| 3 | +**DeepSQLi** is a project designed to detect SQL injection (SQLi) attacks using deep learning models. It provides a **Prediction API**, a Flask application that tokenizes and sequences SQL queries, and then predicts whether a query is an SQL injection. |
4 | 4 |
|
5 |
| -The Tokenizer API is built using Flask and TensorFlow, and the Serving API is built using TensorFlow Serving. The Tokenizer API is responsible for tokenizing and sequencing the input query using the corresponding [dataset](./dataset/), and the Serving API is responsible for predicting whether the query is SQL injection or not using the trained deep learning [model](./sqli_model/). |
| 5 | +## Project Overview |
6 | 6 |
|
7 |
| -The project also includes a [SQL IDS/IPS plugin](https://github.com/gatewayd-io/gatewayd-plugin-sql-ids-ips) that integrates with the GatewayD database gatewayd. The plugin serves as a frontend for these APIs. It intercepts the incoming queries and sends them to the Serving API for prediction. If the query is predicted as SQL injection, the plugin terminates the request; otherwise, it forwards the query to the database. |
| 7 | +The **Prediction API** handles both tokenization and prediction in a single endpoint, simplifying the detection workflow. It processes incoming SQL queries and determines if they contain SQL injection patterns. |
8 | 8 |
|
9 |
| -The following diagram shows the architecture of the project: |
| 9 | +### GatewayD Integration |
| 10 | + |
| 11 | +This project can be integrated with [GatewayD](https://github.com/gatewayd-io/gatewayd) using the [GatewayD SQL IDS/IPS plugin](https://github.com/gatewayd-io/gatewayd-plugin-sql-ids-ips). The plugin acts as a middleware between clients and the database, intercepting SQL queries and sending them to the Prediction API for analysis. If a query is classified as malicious by the Prediction API, the plugin blocks the query; otherwise, it forwards the query to the database. |
| 12 | + |
| 13 | +### Architecture |
10 | 14 |
|
11 | 15 | ```mermaid
|
12 | 16 | flowchart TD
|
13 | 17 | Client <-- PostgreSQL wire protocol:15432 --> GatewayD
|
14 |
| - GatewayD <--> Sq["SQL IDS/IPS plugin"] |
15 |
| - Sq <-- http:8000 --> T["Tokenizer API"] |
16 |
| - Sq <-- http:8501 --> S["Serving API"] |
17 |
| - S -- loads --> SM["SQLi models"] |
18 |
| - T -- loads --> Dataset |
19 |
| - Sq -- threshold: 80% --> D{Malicious query?} |
| 18 | + GatewayD <--> P["Prediction API"] |
| 19 | + P --loads--> SM["SQLi Model"] |
| 20 | + P --threshold: 80% --> D{Malicious query?} |
20 | 21 | D -->|No: send to| Database
|
21 | 22 | D -->|Yes: terminate request| GatewayD
|
22 | 23 | ```
|
23 | 24 |
|
24 |
| -There are currently two models available and trained using the [dataset](./dataset/). Both models are trained using the same model architecture, but they are trained using different datasets. The first model is trained using the [SQLi dataset v1](./dataset/sqli_dataset1.csv), and the second model is trained using the [SQLi dataset v2](./dataset/sqli_dataset2.csv). The models are trained using the following hyperparameters: |
25 |
| - |
26 |
| -- Model architecture: LSTM (Long Short-Term Memory) |
27 |
| -- Embedding dimension: 128 |
28 |
| -- LSTM units: 64 |
29 |
| -- Dropout rate: 0.2 |
30 |
| -- Learning rate: 0.001 |
31 |
| -- Loss function: Binary crossentropy |
32 |
| -- Optimizer: Adam |
33 |
| -- Metrics: Accuracy, precision, recall, and F1 score |
34 |
| -- Validation split: 0.2 |
35 |
| -- Dense layer units: 1 |
36 |
| -- Activation function: Sigmoid |
37 |
| -- Maximum sequence length: 100 |
38 |
| -- Maximum number of tokens: 10000 |
39 |
| -- Maximum number of epochs: 11 |
40 |
| -- Batch size: 32 |
| 25 | +## Models and Tokenization |
41 | 26 |
|
42 |
| -## Installation |
| 27 | +### Model Versions |
43 | 28 |
|
44 |
| -The fastest way to get started is to use Docker and Docker Compose. If you don't have Docker installed, you can install it by following the instructions [here](https://docs.docker.com/get-docker/). |
| 29 | +- **LSTM Models**: The first two models are LSTM-based and are trained using different datasets: |
| 30 | + - **`sqli_model/1`**: Trained on dataset v1 (for historical purposes, will remove in the future). |
| 31 | + - **`sqli_model/2`**: Trained on dataset v2 (for historical purposes, will remove in the future). |
45 | 32 |
|
46 |
| -### Docker Compose |
| 33 | +- **CNN-LSTM Model**: The third model, **`sqli_model/3`**, is a hybrid CNN-LSTM model with a custom SQL tokenizer to improve performance on SQL injection patterns (recommended). |
47 | 34 |
|
48 |
| -Use the following command to build and run the Tokenizer and Serving API containers using Docker Compose (recommended). Note that `--build` is only needed the first time, and you can omit it later. |
| 35 | +### Tokenization |
49 | 36 |
|
50 |
| -```bash |
51 |
| -docker compose up --build -d |
52 |
| -``` |
| 37 | +The Prediction API performs tokenization internally: |
| 38 | + |
| 39 | +- **Default Tokenizer** (Models 1 and 2): Uses Keras’s `Tokenizer` for general tokenization. |
| 40 | +- **Custom SQL Tokenizer** (Model 3): A custom tokenizer designed to handle SQL syntax and injection-specific patterns. |
53 | 41 |
|
54 |
| -To stop the containers, use the following command: |
| 42 | +## Installation |
| 43 | + |
| 44 | +### Docker Compose |
| 45 | + |
| 46 | +To start the Prediction API with Docker Compose: |
55 | 47 |
|
56 | 48 | ```bash
|
57 |
| -docker compose stop |
| 49 | +docker compose up --build -d |
58 | 50 | ```
|
59 | 51 |
|
60 |
| -To remove the containers and release their resources, use the following command: |
| 52 | +To stop and remove the containers: |
61 | 53 |
|
62 | 54 | ```bash
|
63 | 55 | docker compose down
|
64 | 56 | ```
|
65 | 57 |
|
66 |
| -### Docker |
| 58 | +### Docker (Manual Setup) |
67 | 59 |
|
68 |
| -#### Build the images |
| 60 | +#### Build Images |
69 | 61 |
|
70 | 62 | ```bash
|
71 |
| -docker build --no-cache --tag tokenizer-api:latest -f Dockerfile.tokenizer-api . |
72 |
| -docker build --no-cache --tag serving-api:latest -f Dockerfile.serving-api . |
| 63 | +docker build --no-cache --tag prediction-api:latest -f Dockerfile . |
73 | 64 | ```
|
74 | 65 |
|
75 |
| -#### Run the containers |
| 66 | +#### Run the Container |
76 | 67 |
|
77 | 68 | ```bash
|
78 |
| -docker run --rm --name tokenizer-api -p 8000:8000 -d tokenizer-api:latest |
79 |
| -docker run --rm --name serving-api -p 8500-8501:8500-8501 -d serving-api:latest |
| 69 | +docker run --rm --name prediction-api -p 8000:8000 -d prediction-api:latest |
80 | 70 | ```
|
81 | 71 |
|
82 |
| -### Test |
83 |
| - |
84 |
| -You can test the APIs using the following commands: |
| 72 | +## Usage |
85 | 73 |
|
86 |
| -#### Tokenizer API |
87 |
| - |
88 |
| -```bash |
89 |
| -# Tokenize and sequence the query |
90 |
| -curl 'http://localhost:8000/tokenize_and_sequence' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"query":"select * from users where id = 1 or 1=1"}' |
91 |
| -``` |
| 74 | +Once the Prediction API is running, use the `predict` endpoint to classify SQL queries. |
92 | 75 |
|
93 |
| -#### Serving API |
| 76 | +### Prediction API |
94 | 77 |
|
95 | 78 | ```bash
|
96 |
| -# Predict whether the query is SQLi or not |
97 |
| -curl 'http://localhost:8501/v1/models/sqli_model/versions/3:predict' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"inputs":[[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,21,4,32,3,10,3,3]]}' |
| 79 | +curl 'http://localhost:8000/predict' -X POST -H 'Content-Type: application/json' \ |
| 80 | +--data-raw '{"query":"SELECT * FROM users WHERE id=1 OR 1=1;"}' |
98 | 81 | ```
|
99 | 82 |
|
100 |
| -#### One-liner |
| 83 | +### Response Format |
101 | 84 |
|
102 |
| -```bash |
103 |
| -# Or you can use the following one-liner: |
104 |
| -curl -s 'http://localhost:8501/v1/models/sqli_model/versions/3:predict' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"inputs":['$(curl -s 'http://localhost:8000/tokenize_and_sequence' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"query":"select * from users where id = 1 or 1=1"}' | jq -c .tokens)']}' | jq |
105 |
| -``` |
| 85 | +The response includes the prediction (`1` for SQL injection, `0` for legitimate query) and confidence score. Note that the confidence score is only available for the CNN-LSTM model using the Prediction API. |
106 | 86 |
|
107 |
| -#### Model metadata |
108 |
| - |
109 |
| -```bash |
110 |
| -# Get the model metadata |
111 |
| -curl -X GET 'http://localhost:8501/v1/models/sqli_model/versions/3/metadata' |
| 87 | +```json |
| 88 | +{ |
| 89 | + "confidence": 0.9722, |
| 90 | +} |
112 | 91 | ```
|
0 commit comments