Batch Calibration: A Simple and Effective Method for AI Model Calibration
Large language models (LLMs) are powerful tools that can be adapted to new tasks by providing human-designed instructions. However, LLM predictions are highly sensitive and biased towards certain labels, resulting in unexpected performance issues. To address this problem, calibration methods have been developed to mitigate biases and improve LLM performance. In our research, we conducted a systematic analysis of existing calibration methods and proposed a new method called Batch Calibration (BC) that effectively addresses the limitations of previous approaches.
Understanding the Problem
The problem with uncalibrated LLM predictions is the contextual bias, where LLMs tend to favor certain labels unfairly. We found that non-linear boundaries, although more flexible, are prone to overfitting, while linear boundaries are more robust and generalizable. Additionally, relying on content-free inputs to estimate contextual bias may introduce additional bias depending on the task type.
Introducing Batch Calibration
BC is a zero-shot, self-adaptive calibration method that accurately estimates the contextual bias for each class from a batch of inputs. We use a linear decision boundary for robustness and estimate the mean score for each class within the batch. This allows us to align the log-probability distribution of the LLM scores to the estimated mean of each class, resulting in calibrated probabilities. BC requires no additional inputs and incurs negligible computation costs.
We tested BC on 13 diverse natural language understanding tasks and achieved state-of-the-art performance on the PaLM 2 and CLIP models. BC consistently outperformed existing calibration methods, demonstrating its effectiveness in mitigating contextual bias and improving LLM performance. We also observed that BC remained stable and effective across different tasks and when varying the number of few-shot inputs.
Improving Prompt Engineering
BC proved to be robust to different prompt engineering choices, such as the order and examples of in-context prompts and the label space. It also demonstrated resilience to variations in label space designs. This makes prompt engineering easier and enhances LLM performance.
Batch Calibration is a simple and effective method for calibrating AI models. It mitigates contextual bias, improves LLM performance, and makes prompt engineering easier. BC outperforms existing calibration methods and remains stable across tasks and few-shot inputs. By addressing the limitations of previous approaches, BC shows promise in advancing robust and efficient LLM applications.