HomeNewsAI isn't superb at history, a brand new publication finds

AI isn't superb at history, a brand new publication finds

AI could excel at certain tasks like coding or making a podcast. But he struggles to pass a high-level history exam, a brand new study has found.

A team of researchers has created a brand new benchmark to check three top large language models (LLMs) – OpenAI's GPT-4, Meta's Llama and Google's Gemini – on historical questions. The Hist-LLM benchmark tests the accuracy of answers in line with the Seshat Global History Databank, an unlimited database of historical knowledge named after the traditional Egyptian goddess of wisdom.

The results that were presented They were disappointing, in line with researchers on the high-profile AI conference NeurIPS last month Center for Complexity Science (CSH), a research institute based in Austria. The best-performing LLM was GPT-4 Turbo, however it only achieved about 46% accuracy – not much higher than random estimates.

“The key takeaway from this study is that while LLMs are impressive, they still lack the deep understanding required for advanced history. “They are great for basic facts, but in the case of more sophisticated historical investigations on the doctoral level, they will not be yet as much as the duty,” said Maria del Rio-Chanona, certainly one of the paper’s co-authors and associate professor of computer science at University College London.

The researchers shared with TechCrunch example historical questions that LLMs got flawed. For example, the GPT-4 Turbo was asked whether scale armor was present in ancient Egypt during a selected period. The LLM said yes, however the technology didn't appear until 1,500 years later in Egypt.

Why are LLMs bad at answering technical-historical questions once they could be so good at answering very complicated questions on things like coding? Del Rio-Chanona told TechCrunch that this is probably going because LLMs are likely to extrapolate from historical data that could be very necessary and due to this fact find it difficult to retrieve more obscure historical knowledge.

For example, the researchers asked GPT-4 whether ancient Egypt had knowledgeable standing army during a selected historical period. While the right answer is “no,” the LLM incorrectly responded that it was. This might be because there’s a variety of public information that other ancient empires like Persia had standing armies.

“If you're told A and B 100 times and C once, and then you definitely're asked an issue about C, you may just remember A and B and take a look at to infer,” del Rio-Chanona said.

The researchers also identified other trends, including that OpenAI and Llama models performed worse in certain regions equivalent to sub-Saharan Africa, suggesting possible biases of their training data.

The results show that LLMs are still not a alternative for humans in certain areas, said Peter Turchin, who led the study and is a school member at CSH.

However, researchers are still confident that LLMs may also help historians in the long run. They are working to refine their benchmark by including more data from underrepresented regions and adding more complex questions.

“Overall, while our results highlight areas where LLMs need improvement, additionally they highlight the potential of those models to support historical research,” the paper says.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read