Apple researchers have developed a breakthrough framework called EPICACHE that allows large language models to maintain context across extended conversations while using up to six times less memory than current approaches. The technique could prove crucial as businesses increasingly deploy AI systems for customer service, technical support, and other applications requiring sustained dialogue. “Recent advances in large language models (LLMs) have extended context lengths, enabling assistants to sustain long histories for coherent, personalized responses,” the researchers wrote in their paper. “This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly dominates under strict resource constraints.” The Apple team’s solution involves breaking down long conversations into coherent “episodes” based on topic, then selectively retrieving relevant portions when responding to new queries. This approach, they say, mimics how humans might recall specific parts of a long conversation. “EPICACHE bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction,” the researchers explained. Testing across three different conversational AI benchmarks, the system showed remarkable improvements. “Across three LongConvQA benchmarks, EPICACHE improves accuracy by up to 40% over recent baselines, sustains near-full KV accuracy under 4–6× compression, and reduces latency and memory by up to 2.4× and 3.5×,” according to the study. The new framework could be particularly valuable for enterprise applications where cost efficiency matters. By reducing both memory usage and computational latency, EPICACHE could make it more economical to deploy sophisticated AI assistants for customer service, technical support, and internal business processes.