RL Finetuning for Small Models
RL, SFT, and Agentic Memory
Focus: Improving small language models under limited context, memory, and compute budgets.
Ongoing project: Reinforcement-learning fine-tuning, supervised fine-tuning, and agentic-memory policies that decide what to write, retain, and recall for downstream solvers.
Representative work: EMBER studies budgeted evidence retention for long-horizon agents; CPPO trains coordinated pass@K policies for diverse code reasoning attempts.