Researchers from the University of California, Berkeley, Stanford University and Databricks have introduced a new AI optimization method called GEPA (Genetic-Pareto) that significantly outperforms traditional reinforcement learning (RL) techniques for adapting LLMs to specialized tasks. GEPA uses an LLM’s own language understanding to reflect on its performance, diagnose errors, and iteratively evolve its instructions. In addition to being more accurate than established techniques, GEPA is significantly more efficient, achieving superior results with up to 35 times fewer trial runs. For businesses building complex AI agents and workflows, this translates directly into faster development cycles, substantially lower computational costs, and more performant, reliable applications. GEPA is designed for teams that need to optimize systems built on top-tier models that often can’t be fine-tuned, allowing them to improve performance without managing custom GPU clusters. GEPA is a prompt optimizer that tackles this challenge by replacing sparse rewards with rich, natural language feedback. It leverages the fact that the entire execution of an AI system (including its reasoning steps, tool calls, and even error messages) can be serialized into text that an LLM can read and understand. GEPA’s methodology is built on three core pillars. First is “genetic prompt evolution,” where GEPA treats a population of prompts like a gene pool. It iteratively “mutates” prompts to create new, potentially better versions. This mutation is an intelligent process driven by the second pillar: “reflection with natural language feedback.” After a few rollouts, GEPA provides an LLM with the full execution trace (what the system tried to do) and the outcome (what went right or wrong). The LLM then “reflects” on this feedback in natural language to diagnose the problem and write an improved, more detailed prompt. The third pillar is “Pareto-based selection,” which ensures smart exploration. Instead of focusing only on the single best-performing prompt, which can lead to getting stuck in a suboptimal solution (a “local optimum”), GEPA maintains a diverse roster of “specialist” prompts. It tracks which prompts perform best on different individual examples, creating a list of top candidates. By sampling from this diverse set of winning strategies, GEPA ensures it explores more solutions and is more likely to discover a prompt that generalizes well across a wide range of inputs. GEPA’s core guidance is to structure feedback that surfaces not only outcomes but also intermediate trajectories and errors in plain text—the same evidence a human would use to diagnose system behavior. A major practical benefit is that GEPA’s instruction-based prompts are up to 9.2 times shorter than prompts produced by optimizers like MIPROv2, which include many few-shot examples. Shorter prompts decrease latency and reduce costs for API-based models. This makes the final application faster and cheaper to run in production.