Large language models, such as GPT-4 and Claude 2, have incredible abilities in various tasks like marketing and medical analysis. These models are trained on huge amounts of text data, allowing them to make accurate predictions and perform well in different contexts. However, it can be hard to differentiate between deep understanding and superficial memorization in these models.
Researchers from MIT have conducted two studies to evaluate the true reasoning capabilities of these language models. In the first study, they used an ensemble of twelve LLMs to predict outcomes for binary questions. The results showed that the LLM crowd performed as well as human forecasters in a forecasting tournament. In the second study, they focused on improving predictions by incorporating human cognitive input, specifically looking at GPT-4 and Claude 2 models.
In the first study, researchers compared predictions from LLMs to human forecasts and found that the LLMs had a bias towards positive outcomes. The second study analyzed forecasts from GPT-4 and Claude 2, showing that exposure to human forecasts improved the accuracy of the models. Overall, the studies demonstrate that when LLMs work together, they can rival human-based methods in probabilistic forecasting.
These findings have practical implications for real-world applications, providing decision-makers with accurate forecasts in fields like politics, economics, and technology. By combining different models in crowds, LLMs can perform better and offer more reliable predictions. This approach could lead to broader societal use of LLM predictions, benefiting various industries and sectors.