MIT’s updated SEAL technique enables self-adapting LLMs that generate synthetic training data and optimization directives; using dual-loop supervised fine-tuning plus reinforcement learning to reduce catastrophic forgetting • DigiBanker

Researchers at the Massachusetts Institute of Technology (MIT) are gaining renewed attention for developing and open sourcing SEAL (Self-Adapting LLMs); a technique that allows large language models (LLMs) — like those underpinning ChatGPT and most modern AI chatbots — to improve themselves by generating synthetic data to fine-tune upon. SEAL allows LLMs to autonomously generate and apply their own fine-tuning strategies. Unlike conventional models that rely on fixed external data and human-crafted optimization pipelines, SEAL enables models to evolve by producing their own synthetic training data and corresponding optimization directives. The new version expands on the prior framework by demonstrating that SEAL’s self-adaptation ability scales with model size, integrates reinforcement learning more effectively to reduce catastrophic forgetting, and formalizes SEAL’s dual-loop structure (inner supervised fine-tuning and outer reinforcement optimization) for reproducibility. SEAL operates using a two-loop structure: an inner loop performs supervised fine-tuning based on the self-edit, while an outer loop uses reinforcement learning to refine the policy that generates those self-edits. The reinforcement learning algorithm used is based on ReSTEM, which combines sampling with filtered behavior cloning. During training, only self-edits that lead to performance improvements are reinforced. This approach effectively teaches the model which kinds of edits are most beneficial for learning. For efficiency, SEAL applies LoRA-based fine-tuning rather than full parameter updates, enabling rapid experimentation and low-cost adaptation.

Read Article