Researchers from the University of Chicago have shown that enormous language models (LLMs) can perform financial plan evaluation with an accuracy that rivals and even exceeds that of skilled analysts. The results, that are presented in a working paper entitled “Balance sheet evaluation with large language models“, could have significant implications for the longer term of economic evaluation and decision-making.
The researchers tested the performance of GPT-4, a cutting-edge LLM developed by OpenAI, on the duty of analyzing corporate financial statements to predict future earnings growth. Remarkably, GPT-4 was capable of outperform human analysts even when supplied with only standardized, anonymized balance sheets and income statements with none textual context.
“We find that the LLM's prediction accuracy is comparable to the performance of a tightly trained state-of-the-art ML model,” the authors write. “The LLM's prediction doesn’t come from its training memory. Instead, we discover that the LLM generates useful narrative insights a few firm's future performance.”
Thought process prompts emulate the pondering of human analysts
A serious innovation was the usage of “Chain of thought“” – prompts that caused GPT-4 to emulate the analytical technique of a financial analyst, identifying trends, calculating ratios, and summarizing the data right into a forecast. This improved version of GPT-4 achieved 60% accuracy in predicting the direction of future earnings, which is significantly higher than the 53-57% range of forecasts by human analysts.
“Taken together, our results suggest that LLMs could play a central role in decision-making,” the researchers conclude. They indicate that the LLM's advantage likely comes from its vast knowledge base and skill to acknowledge patterns and business concepts, allowing it to attract intuitive conclusions even when information is incomplete.
Despite challenges, LLMs are poised to revolutionize financial evaluation
The results are all of the more remarkable because numerical evaluation has traditionally been difficult for language models. “One of essentially the most demanding areas for a language model is the numerical domain, where the model must perform calculations, make human interpretations, and make complex judgments,” said Alex Kim, certainly one of the study's co-authors. “Although LLMs are effective at word problems, their number understanding often comes from narrative context they usually lack deep numerical reasoning or the flexibleness of a human mind.”
Some experts warn that “ANNThe model used as a benchmark within the study may not represent the most recent cutting-edge in quantitative finance. “This ANN benchmark is much from cutting-edge,” commented one practitioner on the Hacker News Forum“People didn't stop working on it in 1989 – they realized they may make lots of money from it and run it privately.”
Nevertheless, the flexibility of a general-purpose language model to match the performance of specialised ML models and outperform human experts indicates the disruptive potential of LLMs in finance. The authors have also created an interactive web application to showcase GPT-4's capabilities to curious readers, but note that its accuracy must be independently verified.
As AI continues its rapid advance, the role of the financial analyst could also be next to alter. While it’s unlikely that human expertise and judgment may be completely replaced any time soon, powerful tools like GPT-4 could significantly augment and streamline the work of analysts, potentially reshaping the sphere of economic statement evaluation within the years to come back.