Introduction to Large Language Models (LLMs)
The AI community has been amazed by the recent development of Large Language Models (LLMs). These models have successfully imitated human language through advanced Natural Language Processing (NLP), Natural Language Generation (NLG), and Natural Language Understanding (NLU) techniques. LLMs have gained popularity for their ability to engage in realistic conversations, answer questions, generate content, complete code, translate languages, and summarize text. NLP aims to enable computers to understand and respond to natural language commands, allowing for more natural and flexible interactions between humans and machines. One notable application of this is instruction-following models.
Instruction-Following Models in Question-Answering Tasks
Instruction-following models are trained using LLMs, supervised examples, or other forms of guidance. They are exposed to thousands of tasks presented as natural language instructions. In a recent study conducted by a team from Mila Quebec AI Institute, McGill University, and Facebook CIFAR AI Chair, the performance of instruction-following models in question-answering (QA) tasks was evaluated. These models can answer questions based on a given prompt, task description, relevant text passages retrieved by a retriever, and produce responses that are natural and informative, enhancing user trust and engagement.
Measuring Instruction-Following Model Performance
Traditional QA evaluation metrics like exact match (EM) and F1 score face challenges in effectively quantifying the performance of instruction-following models due to their verbose responses. These models may provide more details than the reference answer, while still being accurate. To overcome this issue, the research team has proposed two criteria for measuring the performance of instruction-following models in retrieval-augmented quality assurance (QA).
- Information Necessity and Accuracy: This dimension evaluates how well the model satisfies the user’s informational requirements. It considers whether the generated response includes relevant information, even if it goes beyond what is explicitly mentioned in the reference answer.
- Fidelity in Relation to Provided Information: This dimension assesses how well the model grounds its answers in the presented knowledge. A good model should abstain from responding when irrelevant information is provided and provide accurate answers when the relevant information is available.
The authors of the study evaluated various instruction-following models on three diverse QA datasets: Natural Questions for open-domain QA, HotpotQA for multi-hop QA, and TopiOCQA for conversational QA. They manually analyzed 900 model responses and compared the results with different automatic metrics for accuracy and faithfulness. The research suggests that recall, which measures the overlap between the reference answer and the model response, is a stronger indicator of correctness compared to lexical overlap metrics like EM or F1 score. For faithfulness, K-Precision, which measures the overlap between the model’s answer and the knowledge snippet, shows a stronger correlation with human judgments.
In conclusion, this study presents a comprehensive assessment of instruction-following models for QA tasks, considering their strengths and limitations. The research team has provided their code and data on their GitHub repository to further advancements in this field.