Introducing Automated Interpretability Agents in AI Research
Explaining the behavior of trained artificial intelligence (AI) systems has become a complex puzzle. Understanding how these systems work requires a lot of experimentation. A team of researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has developed a new approach using AI models to conduct experiments and explain the behavior of other systems.
Automated Interpretability Agent Method
The new approach is based on the Automated Interpretability Agent (AIA), which is designed to mimic a scientist’s experimental processes. The AIA can plan and perform tests on other computational systems and produce explanations in the form of language descriptions and code that reproduces a system’s behavior.
Function Interpretation and Description Benchmark
To evaluate this approach, the team developed the Function Interpretation and Description (FIND) benchmark, which tests explanations of functions in AI systems. This benchmark allows researchers to compare the capabilities of AIAs to other methods in the literature and provides a reliable standard for evaluating interpretability procedures.
Automated Interpretability’s Future
While AIAs outperform existing interpretability methods, they still struggle to accurately describe almost half of the functions in the benchmark. The researchers are developing a toolkit to enhance AIAs’ ability to analyze neural networks and develop automated interpretability procedures for real-world scenarios such as autonomous driving and face recognition systems.
The team envisions developing nearly autonomous AIAs with human scientists providing oversight and guidance. Their focus is on expanding AI interpretability to include more complex behaviors and predicting inputs that might lead to undesired behaviors.