The Use of Large-Scale Language Models in Evaluating Natural Language Processing
When it comes to evaluating the performance of natural language processing models and algorithms, human evaluation has traditionally been used. However, this method has its limitations, as human evaluations can vary and may not be reproducible. To address this issue, researchers from National Taiwan University have explored the use of “large-scale language models” as a new evaluation method.
Open-Ended Story Generation
In the field of open-ended story generation, the researchers aimed to compare the quality of stories generated by a human and a generative model. They used large-scale language models to evaluate these stories and compared the results to human evaluations. The evaluation criteria included grammatical accuracy, consistency, liking, and relevance.
English teachers rated human-written stories higher than those generated by the generative model, indicating that they could distinguish between the two. However, the large-scale language models showed mixed results, with some models showing a preference for human-written stories and others showing no clear preference.
Adversarial Attacks
The researchers also studied the ability of AI models to classify sentences in the presence of adversarial attacks. They used a large-scale language model and human evaluations to assess the impact of these attacks on sentence classification. English teachers rated the fluency and preservation of meaning lower for sentences produced by hostile attacks compared to the original sentences. The large-scale language model showed similar results to the human evaluations.
Advantages and Limitations
The use of large-scale language models in evaluation offers several advantages, including reproducibility, independence, cost efficiency and speed, and reduced exposure to objectionable content. However, these models can also misinterpret facts and may not effectively assess tasks involving emotions. A combination of human evaluations and large-scale models is likely to provide the most accurate and comprehensive evaluation.
Overall, the research conducted by the National Taiwan University highlights the potential of large-scale language models in evaluating natural language processing tasks and provides insights into their strengths and weaknesses.