A new method is now available from Microsoft researchers that generates diverse, high-quality instruction data from open-source code. This method, named CodeOcean, has been created to improve the effectiveness of instruction tuning and enhance the generalization ability of fine-tuned models. The goal is to address challenges with instruction data generation, such as duplicate data and insufficient control over data quality.
CodeOcean: Augmenting Performance with Instruction Tuning
CodeOcean, a dataset developed by the researchers, contains 20,000 instruction instances across four code-related tasks: Code Summarization, Code Generation, Code Translation, and Code Repair. Alongside CodeOcean, the team also introduced WaveCoder, a fine-tuned Code LLM with Enhanced instruction tuning, designed to improve the generalization abilities of Code LLMs. The WaveCoder models outperform other models in various benchmarks, showcasing their effectiveness in code generation, repair, and summarization tasks.
Enhancing Instruction Data Generation with WaveCoder Models
Recent advancements in Large Language Models (LLMs) have demonstrated the potential of instruction tuning in improving model capabilities for various tasks. The proposed LLM Generator-Discriminator framework proves effective in generating realistic, diverse instruction data and enhancing performance across code-related tasks. It leverages source code to control data quality during the generation process and adjusts data diversity through raw code distribution adjustments to obtain more realistic instruction data.
This new research introduces an essential approach, CodeOcean, to refine and enhance instruction data for improved performance in code-related tasks. The integration of CodeOcean with WaveCoder models showcases the potential to further enhance mono-task performance and generalization abilities for diverse scenarios.