Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI

Read original: arXiv:2312.06941 - Published 5/20/2024 by MAhdi Abolghasemi, Odkhishig Ganbold, Kristian Rotaru

💬

Overview

This study investigates the forecasting accuracy of human experts versus Large Language Models (LLMs) in the retail sector, particularly during standard and promotional sales periods.
The researchers conducted a controlled experiment involving 123 human forecasters and five LLMs, including ChatGPT4, ChatGPT3.5, Bard, Bing, and Llama2.
They evaluated forecasting precision through Mean Absolute Percentage Error and analyzed the effects of factors like the supporting statistical model, product promotions, and external impacts.

Plain English Explanation

The study compares the forecasting abilities of human experts and large language models (LLMs) in the retail industry. The researchers wanted to see how well these two groups could predict sales, especially during regular and promotional sales periods.

They set up a controlled experiment with 123 human forecasters and five LLMs, including well-known models like ChatGPT4, ChatGPT3.5, Bard, Bing, and Llama2. The researchers measured the accuracy of the forecasts using a metric called Mean Absolute Percentage Error.

The study looked at how factors like the statistical model used (basic or advanced), whether the product was on sale, and external influences affected the performance of the human and LLM forecasters. The key finding is that LLMs don't consistently outperform human experts in forecasting accuracy. Additionally, advanced statistical models don't always improve the performance of either humans or LLMs.

Both the human and LLM forecasters had more trouble making accurate predictions during promotional periods and when there were positive external factors influencing the sales. This suggests that integrating LLMs into practical forecasting processes should be done carefully, as they may not always outperform human experts.

Technical Explanation

The researchers evaluated forecasting precision through Mean Absolute Percentage Error and analyzed the effect of the following factors on forecasters' performance: the supporting statistical model (baseline and advanced), whether the product was on promotion, and the nature of external impact.

The findings indicate that LLMs do not consistently outperform humans in forecasting accuracy, and that advanced statistical forecasting models do not uniformly enhance the performance of either human forecasters or LLMs. Both human and LLM forecasters exhibited increased forecasting errors, particularly during promotional periods and under the influence of positive external impacts.

Critical Analysis

The study's findings call for careful consideration when integrating LLMs into practical forecasting processes. While the researchers acknowledge the potential of LLMs in certain forecasting tasks, the results suggest that these models may not always outperform human experts, especially in complex retail environments with factors like promotions and external influences.

One potential limitation of the study is that it only examined a limited set of LLMs, and the performance of these models may vary depending on their specific capabilities and training. Additionally, the study did not explore the potential benefits of using a combination of human and LLM forecasts, which could potentially enhance overall accuracy.

Further research may be needed to better understand the strengths and limitations of LLMs in more diverse forecasting scenarios, as well as to explore ways to leverage LLMs as research assistants to support and augment human expertise.

Conclusion

This study provides important insights into the relative forecasting capabilities of human experts and Large Language Models in the retail sector. While LLMs show promise, the findings suggest that they do not consistently outperform human forecasters, particularly in complex scenarios with factors like promotions and external influences.

The implications of this research highlight the need for a nuanced approach when integrating LLMs into practical forecasting processes. Rather than viewing these models as a replacement for human expertise, the findings suggest that a collaborative approach, where LLMs and human experts work together, may be the most effective way to enhance forecasting accuracy and decision-making in the retail industry.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI

MAhdi Abolghasemi, Odkhishig Ganbold, Kristian Rotaru

This study investigates the forecasting accuracy of human experts versus Large Language Models (LLMs) in the retail sector, particularly during standard and promotional sales periods. Utilizing a controlled experimental setup with 123 human forecasters and five LLMs, including ChatGPT4, ChatGPT3.5, Bard, Bing, and Llama2, we evaluated forecasting precision through Mean Absolute Percentage Error. Our analysis centered on the effect of the following factors on forecasters performance: the supporting statistical model (baseline and advanced), whether the product was on promotion, and the nature of external impact. The findings indicate that LLMs do not consistently outperform humans in forecasting accuracy and that advanced statistical forecasting models do not uniformly enhance the performance of either human forecasters or LLMs. Both human and LLM forecasters exhibited increased forecasting errors, particularly during promotional periods and under the influence of positive external impacts. Our findings call for careful consideration when integrating LLMs into practical forecasting processes.

5/20/2024

Can Language Models Use Forecasting Strategies?

Sarah Pratt, Seth Blumberg, Pietro Kreitlon Carolino, Meredith Ringel Morris

Advances in deep learning systems have allowed large models to match or surpass human accuracy on a number of skills such as image classification, basic programming, and standardized test taking. As the performance of the most capable models begin to saturate on tasks where humans already achieve high accuracy, it becomes necessary to benchmark models on increasingly complex abilities. One such task is forecasting the future outcome of events. In this work we describe experiments using a novel dataset of real world events and associated human predictions, an evaluation metric to measure forecasting ability, and the accuracy of a number of different LLM based forecasting designs on the provided dataset. Additionally, we analyze the performance of the LLM forecasters against human predictions and find that models still struggle to make accurate predictions about the future. Our follow-up experiments indicate this is likely due to models' tendency to guess that most events are unlikely to occur (which tends to be true for many prediction datasets, but does not reflect actual forecasting abilities). We reflect on next steps for developing a systematic and reliable approach to studying LLM forecasting.

6/10/2024

🎯

AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy

Philipp Schoenegger, Peter S. Park, Ezra Karger, Sean Trott, Philip E. Tetlock

Large language models (LLMs) match and sometimes exceeding human performance in many domains. This study explores the potential of LLMs to augment human judgement in a forecasting task. We evaluate the effect on human forecasters of two LLM assistants: one designed to provide high-quality (superforecasting) advice, and the other designed to be overconfident and base-rate neglecting, thus providing noisy forecasting advice. We compare participants using these assistants to a control group that received a less advanced model that did not provide numerical predictions or engaged in explicit discussion of predictions. Participants (N = 991) answered a set of six forecasting questions and had the option to consult their assigned LLM assistant throughout. Our preregistered analyses show that interacting with each of our frontier LLM assistants significantly enhances prediction accuracy by between 24 percent and 28 percent compared to the control group. Exploratory analyses showed a pronounced outlier effect in one forecasting item, without which we find that the superforecasting assistant increased accuracy by 41 percent, compared with 29 percent for the noisy assistant. We further examine whether LLM forecasting augmentation disproportionately benefits less skilled forecasters, degrades the wisdom-of-the-crowd by reducing prediction diversity, or varies in effectiveness with question difficulty. Our data do not consistently support these hypotheses. Our results suggest that access to a frontier LLM assistant, even a noisy one, can be a helpful decision aid in cognitively demanding tasks compared to a less powerful model that does not provide specific forecasting advice. However, the effects of outliers suggest that further research into the robustness of this pattern is needed.

8/23/2024

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy

Philipp Schoenegger, Indre Tuminauskaite, Peter S. Park, Rafael Valdece Sousa Bastos, Philip E. Tetlock

Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human-crowd forecasting-tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of 12 LLMs. We compare the aggregated LLM predictions on 31 binary questions to those of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark, and is not statistically different from the human crowd. We also observe a set of human-like biases in machine responses, such as an acquiescence effect and a tendency to favour round numbers. In Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%, though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of the human crowd: via the simple, practically applicable method of forecast aggregation.

6/18/2024