The development of large language models (LLMs), like OpenAI’s ChatGPT and GPT-4, has greatly influenced artificial intelligence in various fields such as natural language processing, computer vision, and bioinformatics. However, the specific details of ChatGPT’s training and the model architectures for its variations remain unknown. The performance of LLaMA, an open-source foundational language model, is believed to be subpar in applications that require extensive domain knowledge, possibly due to the lack of domain-specific data during the model’s pre-training stage.
Numerous studies have explored ways to modify and utilize open-source LLMs for specialized purposes. For example, projects like Alpaca and Vicuna aim to enhance the model’s interaction capabilities by training it with automatically generated examples of obeying instructions.
A recent project by Shanghai Jiao Tong University and Shanghai AI Laboratory took a different approach by incorporating domain knowledge into a single pre-trained LLaMA model to tailor it specifically for the medical domain. They created PMC-LLaMA, a publicly available language model, by refining LLaMA-7B using 4.8 million medical academic papers. The team believes that a foundational language model with a medical focus would greatly benefit medical discussions and consultations.
To create PMC-LLaMA, the team utilized the S2ORC Datasets, which consist of 81.1 million English academic papers. They filtered the papers using their PubMed Central (PMC)-id, resulting in approximately 4.9 million papers that are highly relevant to medical knowledge. The LLaMA-7B model was then fine-tuned on these PMC papers using an autoregressive generation objective and optimized using the bf16 data format and the FSDP acceleration approach.
The team conducted three types of fine-tuning on PMC-LLaMA using various associated medical QA datasets: full fine-tuning, parameter-efficient fine-tuning, and data-efficient fine-tuning. The experiments demonstrated that PMC-LLaMA performs better than LLaMA and other models trained with LLaMA-tuned instructions in the medical domain when the instructions are modified.
One limitation of PMC-LLaMA is that not every token can be found in the 4.8 million papers since the model has only been trained for five epochs. In the future, the team plans to gradually train PMC-LLaMA models with more parameters, continuously train PMC-LLaMA, and update the base model on the Hugging Face page.
For more information, you can refer to the research paper and code provided. Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter for the latest AI research news and projects. If you have any questions or if we missed anything, feel free to email us.
Check out 100’s of AI tools in the AI Tools Club!
Author: Tanushree Shenwai – Consulting Intern at MarktechPost