Quickstart: BERT Pre-trained Model Fine-tuning on WAF Dataset

This guide demonstrates how to fine-tune a BERT pre-trained model on a Web Application Firewall (WAF) dataset for malicious request classification.

Introduction

In this example, we fine-tune the compact bert_uncased_L-2_H-128_A-2 model from Hugging Face to classify HTTP requests as malicious or benign using a WAF dataset. This process leverages the smf framework for efficient model training. For background, refer to [1].

Prerequisites:

  • The latest version of the smf framework and its dependencies, installed per the installation guide. Using the provided Docker image is recommended.

  • A GPU with at least 16 GB HBM (e.g., NVIDIA RTX3090 or RTX4090).

  • Python 3.9+ and Conda installed.

  • Access to the WAF dataset repository on CodeHub.

Dataset Introduction

The WAF dataset contains HTTP request logs labeled as malicious or benign, designed for training models to detect web-based attacks (e.g., SQL injection, XSS). It includes features such as URL paths, headers, and payloads, preprocessed into a format suitable for BERT’s input pipeline. The dataset is hosted on the CodeHub dataset repository and consists of approximately 50,000 samples, split into 80% training and 20% validation sets.

Step 1: Prepare the Dataset

Clone the WAF dataset from the CodeHub repository and initialize submodules.

git clone https://codehub.example.com/waf-dataset.git
cd waf-dataset
git submodule init
git submodule update

The dataset will be downloaded to the data/ directory in CSV format, with columns for request text and labels (0 for benign, 1 for malicious).

Step 2: Download the Base Model

Download the pre-trained bert_uncased_L-2_H-128_A-2 model from Hugging Face.

mkdir -p models
cd models
git lfs install
git clone https://huggingface.co/google/bert_uncased_L-2_H-128_A-2

The model weights and configuration will be stored in models/bert_uncased_L-2_H-128_A-2/.

Step 3: Fine-tune the Model

Fine-tune the model using a configuration file tailored for the WAF dataset.

  1. Ensure the smf_env Conda environment is activated:

conda activate smf_env
  1. Navigate to the smf source directory and run the training script with the provided configuration:

cd smf/src
python main.py --config ai_waf/bert_L2H128A2.yaml

Sample configuration file (ai_waf/bert_L2H128A2.yaml):

model:
  pretrained_path: ../models/bert_uncased_L-2_H-128_A-2
  num_labels: 2
dataset:
  path: ../../waf-dataset/data
  train_split: train.csv
  val_split: val.csv
  max_length: 128
training:
  batch_size: 32
  learning_rate: 2e-5
  epochs: 3
  optimizer: adamw
  scheduler: linear
output:
  save_dir: ../outputs/waf_finetuned
  log_interval: 100

This configuration specifies the model path, dataset details, and training hyperparameters. Adjust paths as needed based on your directory structure.

Step 4: Evaluate the Model

After training, evaluate the model on the validation set:

python evaluate.py --model ../outputs/waf_finetuned --data ../../waf-dataset/data/val.csv

The script outputs metrics such as accuracy, precision, recall, and F1-score.

References