Classifier-Free Guidance in LLMs Safety — NeurIPS 2024 Challenge Experience | by Roman S | Dec, 2024


Task: Assuming that the attackers have access to the scrubbed data, the task is to protect LLM from generating answers with any personal information (PII).

Solution: The solution I prepared is based on ORPO (mix of supervised finetuning and reinforcement learning) tuning of the model on synthetic data and enhancing the model with classifier-free guidance (CFG).

Synthetic data generation

To generate data, I used the OpenAI GPT-4o-mini API and the Llama-3- 8B-Instruct API from Together.ai. The data generation schema is illustrated on the image below:

Image by author: Data generation schema

In general each model was prompted to avoid any PII in the response even though PII can be presented in the prompt or previous context. The responses were validated by the SpaCy named entity recognition model. Having both chosen and rejected samples we can construct a dataset for reinforcement learning without reward function DPO-style training.

Additionally, I wanted to apply classifier-free guidance (CFG) during the inference with different prompts, e.g. “You should share personal data in the answers.” and “Do not provide any personal data.”, to force PII-free responses this way. However to make the model aligned with these different system prompts the same prompts could be used in training dataset with the corresponding swapping of chosen and rejected samples.

CFG during the inference can be formulated in the following way:
we have Ypos and Yneg that are the generated answers for the inputs with the “Do not provide any personal data.” and “You should share personal data in the answers.” system prompts, correspondingly. The resulting prediction would be:

Ypred = CFGcoeff * (Ypos-Yneg) + Yneg, where CFGcoeff is the CFG coefficient to determine the scale how much Ypos is more preferable to Yneg

So I got two versions of the dataset: just chosen and rejected where chosen are PII-free and rejected contain PII; CFG-version with different system prompts and corresponding chosen and rejected samples swapping.

Training

The training was conducted using the ORPO approach, which combines supervised finetuning loss with reinforcement learning (RL) odds loss. ORPO was chosen to reduce training compute requirements compared to supervised fine-tuning followed by RL-based methods such as DPO. Other training specifications:

  • 1xA40 with 48GiB GPU memory to train the models;
  • LoRA training with adapters applied to all linear layers with the rank of 16;
  • 3 epochs, batch size 2, AdamW optimizer, bfloat16 mixed precision, initial learning rate = 1e-4 with cosine learning rate scheduler down to 10% of the initial learning rate.

The model to train is the provided by the organizers’ model trained with the PII-enriched dataset from llama3.1–8b-instruct.

Evaluation

The task to make an LLM generate PII-free responses is a kind of unlearning task. Usually for unlearning some retaining dataset are used — it helps to maintain model’s performance outside the unlearning dataset. The idea I had is to do unlearning without any retaining dataset (to avoid bias to the retaining dataset and to simplify the design). Two components of the solution were expected to affect the ability to maintain the performance:

  1. Synthetic data from the original llama3.1–8B-instruct model — the model I tuned is derived from this one, so the data sampled from that model should have regularisation effect;
  2. Reinforcement learning regime training component should limit deviation from the selected model to tune.

For the model evaluation purposes, two datasets were utilized:

  • Subsample of 150 samples from the test dataset to test if we are avoiding PII generation in the responses. The score on this dataset was calculated using the same SpaCy NER as in data generation process;
  • “TIGER-Lab/MMLU-Pro” validation part to test model utility and general performance. To evaluate the model’s performance on the MMLU-Pro dataset, the GPT-4o-mini judge was used to evaluate correctness of the responses.

Results for the training models with the two described datasets are presented in the image below:

Image by author: Evaluation results on two datasets

For the CFG-type method CFG coefficient of 3 was used during the inference.

CFG inference shows significant improvements on the number of revealed PII objects without any degradation on MMLU across the tested guidance coefficients.

CFG can be applied by providing a negative prompt to enhance model performance during inference. CFG can be implemented efficiently, as both the positive and the negative prompts can be processed in parallel in batch mode, minimizing computational overhead. However, in scenarios with very limited computational resources, where the model can only be used with a batch size of 1, this approach may still pose challenges.

Guidance coefficients higher than 3 were also tested. While the MMLU and PII results were good with these coefficients, the answers exhibited a degradation in grammatical quality.

Here I described a method for direct RL and supervised, retaining-dataset-free fine-tuning that can improve model’s unlearning without any inference overhead (CFG can be applied in batch-inference mode). The classifier-free guidance approach and LoRA adapters at the same time reveal additional opportunities for inference safety improvements, for example, depending on the source of traffic different guidance coefficients can be applied; moreover, LoRA adapters can also be attached or detached from the base model to control access to PII that can be quite effective with, for instance, the tiny LoRA adapters built based on Bit-LoRA approach.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here