HomeArtificial IntelligenceStanford study: AI tools for legal research are inclined to cause hallucinations

Stanford study: AI tools for legal research are inclined to cause hallucinations

Large Language Models (LLMs) are increasingly used for tasks that require extensive information processing. Several firms have launched dedicated tools that use LLMs and data retrieval systems to support legal research.

However, a brand new study by researchers at Stanford University notes that despite the vendors' claims, these tools still suffer from a big variety of hallucinations or demonstrably false results.

The study, which the authors say is the primary “pre-registered empirical evaluation of AI-driven legal research tools,” tested products from major legal research providers and compared them to OpenAI's GPT-4 on over 200 manually created legal queries. The researchers found that while hallucinations decreased in comparison with general chatbots, the legal AI tools still hallucinated at an alarming rate.

The challenge of retrieval-augmented generation in law

Many legal AI tools use retrieval-augmented generation (RAG) techniques to cut back the danger of hallucinations. Unlike easy LLM systems that rely solely on the knowledge they acquire during training, RAG systems first retrieve relevant documents from a knowledge base and supply them to the model as context for its answers. RAG is the gold standard for firms looking to cut back hallucinations in various fields.

However, the researchers indicate that legal questions often do not need a transparent answer that will be retrieved from a set of documents. Deciding what to retrieve will be difficult since the system may have to go looking information from multiple sources over a time frame. In some cases, documents that clearly answer the query might not be available whether it is novel or legally indeterminate.

In addition, the researchers caution that hallucinations should not clearly defined within the context of legal research. In their study, the researchers consider the model's response to be a hallucination if it is wrong or unsubstantiated, meaning that the facts are correct but don’t apply within the context of the legal case being discussed. “In other words, if a model makes a false statement or falsely claims that a source supports a press release, it’s a hallucination,” they write.

The study also points out that the relevance of documents within the legal field is just not based on textual similarity alone, as most RAG systems do. Retrieving documents that only appear textually relevant but are literally irrelevant can negatively impact the system's performance.

“Our team has a previous study which have shown that general AI tools are susceptible to legal hallucinations – the tendency to invent false facts, cases, properties, laws and regulations,” Daniel E. Ho, a Stanford law professor and co-author of the paper, told VentureBeat. “Like elsewhere in AI, the legal tech industry has relied on (RAG) and boldly claimed to have 'hallucination-free' products. This led us to design a study to guage these claims in legal RAG tools, and we show that contrary to those marketing claims, legal RAG has not solved the issue of hallucinations.”

The researchers designed a series of various legal queries representing real-world research scenarios and tested them with three leading AI-powered legal research tools, Lexis+ AI from LexisNexis and Westlaw AI-Assisted Research and Ask Practical Law AI from Thomson ReutersAlthough the tools should not open source, all of them indicate that they use some type of RAG within the background.

The researcher manually checked the outcomes of the tools and compared them with GPT-4 without RAG as a baseline. The study found that each one three tools produce significantly higher results than GPT-4, but are removed from perfect. Errors occur in 17-33% of queries.

The researchers also found that the systems struggled with basic legal comprehension tasks that require close evaluation of the sources cited by the tools. The researchers argue that the closed nature of legal AI tools makes it difficult for lawyers to evaluate after they can safely depend on them.

However, the authors indicate that despite its current limitations, AI-assisted legal research can still add value in comparison with traditional keyword search methods or general AI, especially when used as a place to begin somewhat than the ultimate word.

“One of the positive findings of our study is that RAG causes fewer legal hallucinations in comparison with general AI,” Ho said. “But our paper also shows that RAG is just not a panacea. Errors can occur along the RAG pipeline, for instance if the retrieved documents are inappropriate, and querying legal documents is especially difficult.”

The need for transparency

“One of the important thing arguments we make within the paper is that we urgently need transparency and benchmarking within the legal AI space,” Ho said. “Unlike general AI research, legal technology is uniquely closed, with vendors offering virtually no technical information or proof of performance for his or her products. This poses an enormous risk to lawyers.”

According to Ho, a big law firm spent nearly a 12 months and a half evaluating a product and got here up with no higher answer than “whether the lawyers liked using the tool.”

“The paper calls for public benchmarking and we’re pleased that the vendors now we have spoken to agree on the immense value of what has been done elsewhere within the AI ​​space,” he said.

In a blog entry Responding to the document, Mike Dahn, head of Westlaw product management at Thomson Reuters, described the method for testing the tool, which included rigorous testing with lawyers and clients.

“We strongly support efforts to check and compare solutions akin to these, and we support the intent of the Stanford research team in conducting their recent study of RAG-based solutions for legal research,” Dahn wrote, “but we were quite surprised to listen to the claims of serious problems with hallucinations in AI-assisted research.”

Dahn suspects that the Stanford researchers could have found higher inaccuracy rates than Thomson Reuters' internal tests because “the study included query types that we very rarely or never see in AI-based research.”

Dahn also emphasized that the corporate makes it “very clear to customers that the product can produce inaccuracies.”

However, Ho said that these tools “are marketed as general legal research tools, and our questions include bar exam questions, appellate questions, and Supreme Court questions – precisely the sorts of questions that require legal research.”

Pablo Arredondo, vice chairman of CoCounsel at Thomson Reuters, told VentureBeat, “I welcome the discussion Stanford has began with this study, and we sit up for exploring these results and other potential benchmarks in additional detail. We are in initial discussions with the university to form a consortium of universities, law firms and legal tech firms to develop and maintain cutting-edge benchmarks across a spread of legal use cases.”

VentureBeat has also reached out to LexisNexis for comment. We will update this post if we hear back from them. In a blog entry After the study was published, LexisNexis wrote: “It is very important to know that we don’t promise you perfection, but that each one linked legal citations are freed from misrepresentation. No Gen-AI tool today can deliver 100% accuracy, no matter who the provider is.”

LexisNexis also stressed that Lexis+ AI is meant to “enhance the work of a lawyer, not replace it. No technology application or software product can ever replace the judgment and reasoning of a lawyer.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read