The Skills of Large Language Models (LLMs) in Program Synthesis and Understanding
Large language models (LLMs) possess exceptional representation learning abilities when it comes to program synthesis and understanding tasks. These models have gained popularity because they are easy to comprehend and have low technical complexity. Additionally, one model can perform multiple jobs, leading to time and cost savings. Furthermore, larger models tend to perform better on various tasks.
However, there are some challenges. While the self-attention circuit is simple, choosing an attention-masking technique is necessary for learning bidirectional or unidirectional representations. The tasks of synthesis and comprehension have not been fully integrated, and training numerous models for different tasks is expensive. It is also challenging to determine the best model design, learning algorithm, and data distribution.
To overcome these challenges, researchers aim to create a standardized formula for training a universally applicable model. They also plan to release open-source code and highly refined models to the public. The focus of the study includes prefix-LM architecture, infill sampling, goal function selection, and combining data in natural and programming languages.
Researchers propose a simple blend of uncorrupted and span-corruption sequences for left-to-right and fill-in-the-middle auto-regressive sampling. The reference implementation for LLM training will be available as open-source software. CodeGen2.5, a small yet powerful model, will also be open-sourced. Despite the trend of larger LLMs, this study demonstrates that even a modestly sized model can achieve impressive results with proper training.
The key contributions to bringing these models to market include incorporating the latest improvements to CodeGen’s LLM and releasing it with HumanEval’s 7B parameters. CodeGen2.5 with 7B is competitive, despite being smaller than previous code-generation models. It has robust infill sampling and is optimized for rapid sampling. CodeGen2.5 is an AR language model family used for code generation and is trained with a wide range of languages.
The models are released in different versions, including CodeGen2.5-7B-multi, CodeGen2.5-7B-mono, and CodeGen2.5-7B-instruct. These models aim to provide practitioners with useful tools and insights, even though complete unification has not been achieved.
In summary, this study explores the skills of large language models in program synthesis and understanding tasks. It provides insights into model architecture, infill sampling, goal function selection, and data distribution. The aim is to create a standardized formula for training a universally applicable model and to release refined models and open-source code to the public.