Skip to content

Commit d8892cb

Browse files
committed
Add READMEs
1 parent 7db6c54 commit d8892cb

File tree

3 files changed

+101
-67
lines changed

3 files changed

+101
-67
lines changed

README.md

+45-66
Original file line numberDiff line numberDiff line change
@@ -1,112 +1,91 @@
11
# DeepSQLi
22

3-
This repository contains the code for the DeepSQLi project. The project aims to detect SQL injection attacks using deep learning models. The project consists of two main components: **Tokenizer API** and **Serving API**. The Tokenizer API tokenizes and sequences the input query, while the Serving API predicts whether the query is SQL injection or not.
3+
**DeepSQLi** is a project designed to detect SQL injection (SQLi) attacks using deep learning models. It provides a **Prediction API**, a Flask application that tokenizes and sequences SQL queries, and then predicts whether a query is an SQL injection.
44

5-
The Tokenizer API is built using Flask and TensorFlow, and the Serving API is built using TensorFlow Serving. The Tokenizer API is responsible for tokenizing and sequencing the input query using the corresponding [dataset](./dataset/), and the Serving API is responsible for predicting whether the query is SQL injection or not using the trained deep learning [model](./sqli_model/).
5+
## Project Overview
66

7-
The project also includes a [SQL IDS/IPS plugin](https://github.com/gatewayd-io/gatewayd-plugin-sql-ids-ips) that integrates with the GatewayD database gatewayd. The plugin serves as a frontend for these APIs. It intercepts the incoming queries and sends them to the Serving API for prediction. If the query is predicted as SQL injection, the plugin terminates the request; otherwise, it forwards the query to the database.
7+
The **Prediction API** handles both tokenization and prediction in a single endpoint, simplifying the detection workflow. It processes incoming SQL queries and determines if they contain SQL injection patterns.
88

9-
The following diagram shows the architecture of the project:
9+
### GatewayD Integration
10+
11+
This project can be integrated with [GatewayD](https://github.com/gatewayd-io/gatewayd) using the [GatewayD SQL IDS/IPS plugin](https://github.com/gatewayd-io/gatewayd-plugin-sql-ids-ips). The plugin acts as a middleware between clients and the database, intercepting SQL queries and sending them to the Prediction API for analysis. If a query is classified as malicious by the Prediction API, the plugin blocks the query; otherwise, it forwards the query to the database.
12+
13+
### Architecture
1014

1115
```mermaid
1216
flowchart TD
1317
Client <-- PostgreSQL wire protocol:15432 --> GatewayD
14-
GatewayD <--> Sq["SQL IDS/IPS plugin"]
15-
Sq <-- http:8000 --> T["Tokenizer API"]
16-
Sq <-- http:8501 --> S["Serving API"]
17-
S -- loads --> SM["SQLi models"]
18-
T -- loads --> Dataset
19-
Sq -- threshold: 80% --> D{Malicious query?}
18+
GatewayD <--> P["Prediction API"]
19+
P --loads--> SM["SQLi Model"]
20+
P --threshold: 80% --> D{Malicious query?}
2021
D -->|No: send to| Database
2122
D -->|Yes: terminate request| GatewayD
2223
```
2324

24-
There are currently two models available and trained using the [dataset](./dataset/). Both models are trained using the same model architecture, but they are trained using different datasets. The first model is trained using the [SQLi dataset v1](./dataset/sqli_dataset1.csv), and the second model is trained using the [SQLi dataset v2](./dataset/sqli_dataset2.csv). The models are trained using the following hyperparameters:
25-
26-
- Model architecture: LSTM (Long Short-Term Memory)
27-
- Embedding dimension: 128
28-
- LSTM units: 64
29-
- Dropout rate: 0.2
30-
- Learning rate: 0.001
31-
- Loss function: Binary crossentropy
32-
- Optimizer: Adam
33-
- Metrics: Accuracy, precision, recall, and F1 score
34-
- Validation split: 0.2
35-
- Dense layer units: 1
36-
- Activation function: Sigmoid
37-
- Maximum sequence length: 100
38-
- Maximum number of tokens: 10000
39-
- Maximum number of epochs: 11
40-
- Batch size: 32
25+
## Models and Tokenization
4126

42-
## Installation
27+
### Model Versions
4328

44-
The fastest way to get started is to use Docker and Docker Compose. If you don't have Docker installed, you can install it by following the instructions [here](https://docs.docker.com/get-docker/).
29+
- **LSTM Models**: The first two models are LSTM-based and are trained using different datasets:
30+
- **`sqli_model/1`**: Trained on dataset v1 (for historical purposes, will remove in the future).
31+
- **`sqli_model/2`**: Trained on dataset v2 (for historical purposes, will remove in the future).
4532

46-
### Docker Compose
33+
- **CNN-LSTM Model**: The third model, **`sqli_model/3`**, is a hybrid CNN-LSTM model with a custom SQL tokenizer to improve performance on SQL injection patterns (recommended).
4734

48-
Use the following command to build and run the Tokenizer and Serving API containers using Docker Compose (recommended). Note that `--build` is only needed the first time, and you can omit it later.
35+
### Tokenization
4936

50-
```bash
51-
docker compose up --build -d
52-
```
37+
The Prediction API performs tokenization internally:
38+
39+
- **Default Tokenizer** (Models 1 and 2): Uses Keras’s `Tokenizer` for general tokenization.
40+
- **Custom SQL Tokenizer** (Model 3): A custom tokenizer designed to handle SQL syntax and injection-specific patterns.
5341

54-
To stop the containers, use the following command:
42+
## Installation
43+
44+
### Docker Compose
45+
46+
To start the Prediction API with Docker Compose:
5547

5648
```bash
57-
docker compose stop
49+
docker compose up --build -d
5850
```
5951

60-
To remove the containers and release their resources, use the following command:
52+
To stop and remove the containers:
6153

6254
```bash
6355
docker compose down
6456
```
6557

66-
### Docker
58+
### Docker (Manual Setup)
6759

68-
#### Build the images
60+
#### Build Images
6961

7062
```bash
71-
docker build --no-cache --tag tokenizer-api:latest -f Dockerfile.tokenizer-api .
72-
docker build --no-cache --tag serving-api:latest -f Dockerfile.serving-api .
63+
docker build --no-cache --tag prediction-api:latest -f Dockerfile .
7364
```
7465

75-
#### Run the containers
66+
#### Run the Container
7667

7768
```bash
78-
docker run --rm --name tokenizer-api -p 8000:8000 -d tokenizer-api:latest
79-
docker run --rm --name serving-api -p 8500-8501:8500-8501 -d serving-api:latest
69+
docker run --rm --name prediction-api -p 8000:8000 -d prediction-api:latest
8070
```
8171

82-
### Test
83-
84-
You can test the APIs using the following commands:
72+
## Usage
8573

86-
#### Tokenizer API
87-
88-
```bash
89-
# Tokenize and sequence the query
90-
curl 'http://localhost:8000/tokenize_and_sequence' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"query":"select * from users where id = 1 or 1=1"}'
91-
```
74+
Once the Prediction API is running, use the `predict` endpoint to classify SQL queries.
9275

93-
#### Serving API
76+
### Prediction API
9477

9578
```bash
96-
# Predict whether the query is SQLi or not
97-
curl 'http://localhost:8501/v1/models/sqli_model/versions/3:predict' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"inputs":[[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,21,4,32,3,10,3,3]]}'
79+
curl 'http://localhost:8000/predict' -X POST -H 'Content-Type: application/json' \
80+
--data-raw '{"query":"SELECT * FROM users WHERE id=1 OR 1=1;"}'
9881
```
9982

100-
#### One-liner
83+
### Response Format
10184

102-
```bash
103-
# Or you can use the following one-liner:
104-
curl -s 'http://localhost:8501/v1/models/sqli_model/versions/3:predict' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"inputs":['$(curl -s 'http://localhost:8000/tokenize_and_sequence' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data-raw '{"query":"select * from users where id = 1 or 1=1"}' | jq -c .tokens)']}' | jq
105-
```
85+
The response includes the prediction (`1` for SQL injection, `0` for legitimate query) and confidence score. Note that the confidence score is only available for the CNN-LSTM model using the Prediction API.
10686

107-
#### Model metadata
108-
109-
```bash
110-
# Get the model metadata
111-
curl -X GET 'http://localhost:8501/v1/models/sqli_model/versions/3/metadata'
87+
```json
88+
{
89+
"confidence": 0.9722,
90+
}
11291
```

training/README.md

+44
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# SQL Injection Detection Model Training and Tokenization
2+
3+
This repository contains code for training SQL injection detection models using various deep learning architectures and tokenization methods. The models are saved in the [`sqli_model`](../sqli_model/) directory, organized by versions based on their architecture and tokenization strategy.
4+
5+
## Overview of Training Files
6+
7+
### Model Training Scripts
8+
9+
- **[`train.py`](train.py)**: This script trains models saved in [`sqli_model/1`](../sqli_model/1/) and [`sqli_model/2`](../sqli_model/2/). It utilizes an **LSTM-based architecture** and the default Keras tokenizer.
10+
- **[`train_v3.py`](train_v3.py)**: This script trains the model saved in [`sqli_model/3`](../sqli_model/3/), which uses a **Deep Learning CNN-LSTM hybrid model** with a custom SQL tokenizer designed to handle SQL syntax and injection patterns more effectively.
11+
12+
### Tokenization Methods
13+
14+
Each training script employs a different tokenization strategy, suitable for the specific model architecture:
15+
16+
- **Keras Default Tokenizer (`train.py` for `sqli_model/1` and `sqli_model/2`)**:
17+
- The default Keras tokenizer (`tensorflow.keras.preprocessing.text.Tokenizer`) is used in `train.py`. It performs basic tokenization by splitting the text into words and mapping each word to an integer. This simple approach is effective for general text but may miss some nuances specific to SQL syntax.
18+
19+
- **Custom SQL Tokenizer (`train_v3.py` for `sqli_model/3`)**:
20+
- The `train_v3.py` script employs a [custom SQL tokenizer](sql_tokenizer.py) specifically designed to handle SQL syntax and keywords. This tokenizer tokenizes SQL queries by recognizing SQL-specific patterns, operators, and punctuation, providing a more robust representation for SQL injection detection. This method captures complex SQL expressions that are crucial for detecting injection patterns accurately.
21+
22+
## Model Architectures
23+
24+
- **LSTM Model** (`train.py` for `sqli_model/1` and `sqli_model/2`):
25+
- The models in `sqli_model/1` and `sqli_model/2` are trained using an LSTM (Long Short-Term Memory) network. LSTMs are particularly suited for sequential data, making them effective in capturing dependencies in SQL query patterns.
26+
27+
- **CNN-LSTM Hybrid Model** (`train_v3.py` for sqli_model/3):
28+
- The model in `sqli_model/3` is a hybrid CNN-LSTM model. It combines convolutional layers to detect local patterns in SQL syntax and LSTM layers to capture sequential dependencies. This architecture, combined with the custom SQL tokenizer, enhances the model’s ability to detect complex injection patterns.
29+
30+
## How to Use
31+
32+
1. **Training**:
33+
- Run `train.py` to train models using the default Keras tokenizer and save them in `sqli_model/1` or `sqli_model/2`.
34+
- Run `train_v3.py` to train the CNN-LSTM model with the custom SQL tokenizer and save it in `sqli_model/3`.
35+
36+
2. **Tokenization**:
37+
- The default tokenizer from Keras is used automatically within `train.py`.
38+
- For `train_v3.py`, the custom SQL tokenizer defined in [sql_tokenizer.py](sql_tokenizer.py) is automatically used.
39+
40+
## File Structure
41+
42+
- **[`train.py`](train.py)** - Trains models with default Keras tokenizer and LSTM architecture.
43+
- **[`train_v3.py`](train_v3.py)** - Trains model with custom SQL tokenizer and CNN-LSTM architecture.
44+
- **[`sql_tokenizer.py`](sql_tokenizer.py)** - Custom SQL tokenizer for handling SQL-specific patterns, used by `train_v3.py`.

vulnerable_app/README.md

+12-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,17 @@
1-
## How to run
1+
# Vulnerable Customer App
2+
3+
This is a simple, intentionally vulnerable web application designed for testing SQL injection (SQLi) detection. The app is built with Flask, connects to a PostgreSQL database via GatewayD, and exposes a vulnerability in the `/customer/<customer_id>` endpoint that allows SQL injection through unsanitized user input.
4+
5+
> [!WARNING]
6+
> This application is vulnerable to SQL injection and is for testing purposes only. Do not deploy this application in production or on a public server.
7+
8+
## Setup and Installation
29

310
```bash
11+
git clone git@github.com:gatewayd-io/DeepSQLi.git
12+
cd DeepSQLi/vulnerable_app
413
pip install -r requirements.txt
514
python main.py
615
```
16+
17+
The app will start in debug mode and listen on `http://localhost:3000`.

0 commit comments

Comments
 (0)