Most models hallucinate

It is a known fact but it takes a CS degree to “publish” it.

Our results are stark: Most models struggle to produce relevant sources. Four out of five models hallucinate a significant proportion of sources by producing invalid URLs. This problem goes away with the retrieval augmented generation (RAG) model, which first performs a web search for relevant sources before producing a summary of its findings. However, even in the GPT-4 RAG model, we find that up to 30% of statements made are not supported by any sources provided, with nearly half of responses containing at least one unsupported statement. This finding is more exaggerated in the other four models, with as few as 10% of responses fully supported in Gemini Pro, Google’s recently released LLM.

Source: Generating Medical Errors: GenAI and Erroneous Medical References by stanfordHAI

Summary:

  • The passage evaluates how well large language models (LLMs) can cite relevant medical references to support the claims they make in responses.
  • It finds that between 50-90% of statements made by LLMs are not fully supported by the sources they provide. Retrieval augmented generation (RAG) models perform better but still do not fully support responses around half the time.
  • The passage proposes an automated evaluation framework called SourceCheckup to generate medical questions and score how well LLMs provide relevant sources. It finds 88% agreement between this framework and medical doctors.
  • Top performing LLMs like GPT-4, Claude, and MedPaLM were evaluated. GPT-4 with RAG performed best but still failed to fully support over half of responses. Other models supported statements even less.
  • LLMs predominantly cited US-based sources, which may not adequately represent international users. Performance also varied by question source, with Reddit questions faring worst.
  • Accurate source verification is important for user and regulator trust, and informing future regulation of medical LLMs.
  • The rapid pace of LLM development requires understanding their ability to produce trustworthy, verifiable medical information.
  • Source verification extends beyond medicine to other domains like law and journalism that require evidence-backed claims.
  • Failure to provide sources accounted for many unsupported responses from GPT-4 RAG.
  • Future work should aim to improve LLMs’ ability to ground responses in credible, accessible sources.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.