Quickstart: BERT Pre-trained Model Fine-tuning on WAF Dataset¶
This guide demonstrates how to fine-tune a BERT pre-trained model on a Web Application Firewall (WAF) dataset for malicious request classification.
Introduction¶
In this example, we fine-tune the compact bert_uncased_L-2_H-128_A-2
model from Hugging Face to classify HTTP requests as malicious or benign using a WAF dataset. This process leverages the smf
framework for efficient model training. For background, refer to [1].
Prerequisites:
The latest version of the
smf
framework and its dependencies, installed per the installation guide. Using the provided Docker image is recommended.A GPU with at least 16 GB HBM (e.g., NVIDIA RTX3090 or RTX4090).
Python 3.9+ and Conda installed.
Access to the WAF dataset repository on CodeHub.
Dataset Introduction¶
The WAF dataset contains HTTP request logs labeled as malicious or benign, designed for training models to detect web-based attacks (e.g., SQL injection, XSS). It includes features such as URL paths, headers, and payloads, preprocessed into a format suitable for BERT’s input pipeline. The dataset is hosted on the CodeHub dataset repository and consists of approximately 50,000 samples, split into 80% training and 20% validation sets.
Step 1: Prepare the Dataset¶
Clone the WAF dataset from the CodeHub repository and initialize submodules.
git clone https://codehub.example.com/waf-dataset.git
cd waf-dataset
git submodule init
git submodule update
The dataset will be downloaded to the data/
directory in CSV format, with columns for request text and labels (0 for benign, 1 for malicious).
Step 2: Download the Base Model¶
Download the pre-trained bert_uncased_L-2_H-128_A-2
model from Hugging Face.
mkdir -p models
cd models
git lfs install
git clone https://huggingface.co/google/bert_uncased_L-2_H-128_A-2
The model weights and configuration will be stored in models/bert_uncased_L-2_H-128_A-2/
.
Step 3: Fine-tune the Model¶
Fine-tune the model using a configuration file tailored for the WAF dataset.
Ensure the
smf_env
Conda environment is activated:
conda activate smf_env
Navigate to the
smf
source directory and run the training script with the provided configuration:
cd smf/src
python main.py --config ai_waf/bert_L2H128A2.yaml
Sample configuration file (ai_waf/bert_L2H128A2.yaml
):
model:
pretrained_path: ../models/bert_uncased_L-2_H-128_A-2
num_labels: 2
dataset:
path: ../../waf-dataset/data
train_split: train.csv
val_split: val.csv
max_length: 128
training:
batch_size: 32
learning_rate: 2e-5
epochs: 3
optimizer: adamw
scheduler: linear
output:
save_dir: ../outputs/waf_finetuned
log_interval: 100
This configuration specifies the model path, dataset details, and training hyperparameters. Adjust paths as needed based on your directory structure.
Step 4: Evaluate the Model¶
After training, evaluate the model on the validation set:
python evaluate.py --model ../outputs/waf_finetuned --data ../../waf-dataset/data/val.csv
The script outputs metrics such as accuracy, precision, recall, and F1-score.