A new study from Google researchers introduces “sufficient context,” a novel perspective for understanding and improving retrieval augmented generation (RAG) systems in large language models (LLMs). This approach makes it possible to determine if an LLM has enough information to answer a query accurately, a critical factor for developers building real-world enterprise applications where reliability and factual correctness are paramount. They found that Google’s Gemini 1.5 Pro model, with a single example (1-shot), performed best in classifying context sufficiency, achieving high F1 scores and accuracy. The paper notes, “In real-world scenarios, we cannot expect candidate answers when evaluating model performance. Hence, it is desirable to use a method that works using only the query and context.” Interestingly, while RAG generally improves overall performance, additional context can also reduce a model’s ability to abstain from answering when it doesn’t have sufficient information. Given the finding that models may hallucinate rather than abstain, especially with RAG compared to no RAG setting, the researchers explored techniques to mitigate this. They developed a new “selective generation” framework. This method uses a smaller, separate “intervention model” to decide whether the main LLM should generate an answer or abstain, offering a controllable trade-off between accuracy and coverage. This framework can be combined with any LLM, including proprietary models like Gemini and GPT. The study found that using sufficient context as an additional signal in this framework leads to significantly higher accuracy for answered queries across various models and datasets. This method improved the fraction of correct answers among model responses by 2–10% for Gemini, GPT, and Gemma models. For enterprise teams looking to apply these insights to their own RAG systems, such as those powering internal knowledge bases or customer support AI, Cyrus Rashtchian, co-author of the study outlines a practical approach. He suggests first collecting a dataset of query-context pairs that represent the kind of examples the model will see in production. Next, use an LLM-based autorater to label each example as having sufficient or insufficient context. “This already will give a good estimate of the % of sufficient context,” Rashtchian said. “If it is less than 80-90%, then there is likely a lot of room to improve on the retrieval or knowledge base side of things — this is a good observable symptom.”