The extraordinary performance of language models (LMs) suggests that they can effectively learn from text data and apply that knowledge to interactive tasks. LMs have achieved impressive results in natural language processing, surpassing state-of-the-art methods and even outperforming humans in tasks that require complex reasoning. However, it is important to determine whether their success is due to their general reasoning abilities or their ability to recognize and recall specific tasks encountered during training.
Previous research has mainly focused on how well LMs can perform specific tasks, which can be complicated by data contamination issues. In this study, researchers investigated the generalizability of LMs to new task variations by changing the conditions or rules under which well-performing tasks are performed. These new tasks, called counterfactual tasks, deviate from the default conditions and measure the model’s ability to adapt to different variations of a task.
The researchers designed a set of 11 counterfactual evaluation tasks that span different categories and domains, including deductive reasoning, code generation, drawing, and spatial reasoning. While the reasoning process remains the same across the original tasks and their counterfactual variants, the input-output mappings are different. This evaluation aims to assess how well LMs can adapt to new variations of a task.
The performance of several language models, including GPT-4, GPT-3.5, Claude, and PaLM-2, was evaluated on both the default and counterfactual conditions of the tasks. The results showed that while LMs performed above random chance on counterfactual tasks, their performance consistently degraded compared to the default settings. This suggests that the models’ success on these tasks can be attributed partly to their ability to perform well under specific conditions, rather than their general reasoning abilities.
The findings also revealed interesting relationships between model behavior on default and counterfactual tasks. Correlations between default and counterfactual performance, the effectiveness of zero-shot chain-of-thought prompting, and interactions between task- and instance-level frequency effects were observed. Overall, slight variations in the default instantiations of tasks presented challenges for LMs, indicating that the success of existing models should not be solely attributed to their general capacity for the target task.
For more details, you can check out the full paper [here](https://arxiv.org/abs/2307.02477). Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter to stay updated on the latest AI research news and projects. If you have any questions or if there’s anything we missed in this article, feel free to email us at Asif@marktechpost.com.
Also, make sure to check out AI Tools Club for a wide range of AI tools that can help you in your projects.