Improving Explanations through ML Techniques
Explanation scores for our models have been generally low. However, we believe that by applying machine learning (ML) techniques, we can enhance our ability to produce better explanations. Here are three ways we have discovered to improve explanation scores:
1. Iterating on explanations: One method is to ask GPT-4, our AI model, to generate potential counterexamples. We then revise explanations based on these counterexamples, which has resulted in higher scores.
2. Using larger models for explanations: As we increase the capabilities of our explainer model, the average score for explanations also improves. However, it’s important to note that even GPT-4 falls short compared to human-produced explanations, leaving room for further enhancement.
3. Modifying the architecture of the explained model: We have observed that training models with different activation functions leads to improved explanation scores.
For easy access and collaboration, we have made our datasets and visualization tools for GPT-4-written explanations of all 307,200 neurons in GPT-2 available for open-source. Additionally, we have provided code for explanation and scoring using publicly available models on the OpenAI API. We hope that this will encourage the research community to develop new techniques for generating higher-scoring explanations and improved tools for exploring GPT-2 using explanations.
In our study, we identified more than 1,000 neurons with explanations that scored at least 0.8. According to GPT-4, these neurons contribute significantly to the model’s top-activating behavior. While most of these neurons are not particularly interesting, we have also discovered numerous intriguing neurons that GPT-4 struggles to comprehend. We anticipate that as explanations continue to improve, we will rapidly uncover fascinating qualitative insights into the computations performed by the model.