.. _quickstart: ============================================================= Quickstart: BERT Pre-trained Model Fine-tuning on WAF Dataset ============================================================= This guide demonstrates how to fine-tune a BERT pre-trained model on a Web Application Firewall (WAF) dataset for malicious request classification. Introduction ------------ .. _bert_uncased_L-2_H-128_A-2: https://huggingface.co/google/bert_uncased_L-2_H-128_A-2 In this example, we fine-tune the compact ``bert_uncased_L-2_H-128_A-2`` model from Hugging Face to classify HTTP requests as malicious or benign using a WAF dataset. This process leverages the ``smf`` framework for efficient model training. For background, refer to [1]_. Prerequisites: - The latest version of the ``smf`` framework and its dependencies, installed per the `installation guide `_. Using the provided Docker image is recommended. - A GPU with at least 16 GB HBM (e.g., NVIDIA RTX3090 or RTX4090). - Python 3.9+ and Conda installed. - Access to the WAF dataset repository on CodeHub. Dataset Introduction -------------------- The WAF dataset contains HTTP request logs labeled as malicious or benign, designed for training models to detect web-based attacks (e.g., SQL injection, XSS). It includes features such as URL paths, headers, and payloads, preprocessed into a format suitable for BERT's input pipeline. The dataset is hosted on the CodeHub dataset repository and consists of approximately 50,000 samples, split into 80% training and 20% validation sets. Step 1: Prepare the Dataset --------------------------- Clone the WAF dataset from the CodeHub repository and initialize submodules. .. code-block:: bash git clone https://codehub.example.com/waf-dataset.git cd waf-dataset git submodule init git submodule update The dataset will be downloaded to the ``data/`` directory in CSV format, with columns for request text and labels (0 for benign, 1 for malicious). Step 2: Download the Base Model ------------------------------- Download the pre-trained ``bert_uncased_L-2_H-128_A-2`` model from Hugging Face. .. code-block:: bash mkdir -p models cd models git lfs install git clone https://huggingface.co/google/bert_uncased_L-2_H-128_A-2 The model weights and configuration will be stored in ``models/bert_uncased_L-2_H-128_A-2/``. Step 3: Fine-tune the Model --------------------------- Fine-tune the model using a configuration file tailored for the WAF dataset. 1. Ensure the ``smf_env`` Conda environment is activated: .. code-block:: bash conda activate smf_env 2. Navigate to the ``smf`` source directory and run the training script with the provided configuration: .. code-block:: bash cd smf/src python main.py --config ai_waf/bert_L2H128A2.yaml Sample configuration file (``ai_waf/bert_L2H128A2.yaml``): .. code-block:: yaml model: pretrained_path: ../models/bert_uncased_L-2_H-128_A-2 num_labels: 2 dataset: path: ../../waf-dataset/data train_split: train.csv val_split: val.csv max_length: 128 training: batch_size: 32 learning_rate: 2e-5 epochs: 3 optimizer: adamw scheduler: linear output: save_dir: ../outputs/waf_finetuned log_interval: 100 This configuration specifies the model path, dataset details, and training hyperparameters. Adjust paths as needed based on your directory structure. Step 4: Evaluate the Model -------------------------- After training, evaluate the model on the validation set: .. code-block:: bash python evaluate.py --model ../outputs/waf_finetuned --data ../../waf-dataset/data/val.csv The script outputs metrics such as accuracy, precision, recall, and F1-score. References ---------- .. [1] Well-Read Students Learn Better: On the Importance of Pre-training Compact Models (https://arxiv.org/abs/1908.08962). This paper highlights the benefits of pre-training compact models like BERT for downstream tasks.