From Slop to Skill: An Agent Proficiency Engine using RLM • Gonzo Engineer

I set out to learn how Recursive Language Models work in practice. Several iterations later, I’d accidentally built a proficiency development engine for agents.

The starting point was AsyncReview, a clean reference implementation of PR review using an RLM loop. Rather than read the code passively, I gave myself a concrete goal: use it to understand the mem0 codebase well enough to write a complete, accurate integration guide for a new datastore backend called zvec.

The result is real and testable: mem0 PR #4124.

Where AsyncReview Left Off

AsyncReview already does the hardest part well: recursive reasoning over code with tool use and grounded evidence gathering. For one-shot analysis, it’s excellent. However, my goal needed more: a reusable investigation artifact I could refine across runs, with persistent learning between iterations and some way to know when the work was good enough to consider finished.

AsyncReview gave me an engine and I built a harness around it.

The Runbook

I introduced what I’m calling a runbook: a sequenced technical interview of a codebase.

Each question is designed to unlock context for the next one. The model moves from high-level architecture down through concrete implementation details to validation patterns. That sequencing is the whole point: later questions depend on artifacts from earlier ones (interfaces, wiring paths, test patterns), and that dependency is what makes the final output complete rather than shallow.

The inspiration was a well-run interview. A good interviewer doesn’t just ask questions but leads the conversation somewhere, surfacing the most relevant knowledge in a deliberate order.

Iteration One Was.. Lacking

The first runbook produced a weak report. Gaps everywhere, shallow on the details that mattered most. This is usually what we experience with shallow agent prompting, even among the strongest models.

That failure made the next step obvious. This wasn’t a prompting problem I could fix in one shot. I needed a feedback loop.

So, I built one:

Generate a runbook.
Execute it against the target repo to produce a report.
Evaluate gaps and inaccuracies in the report.
Persist lessons to a memory file.
Refine the runbook and repeat — until no new high-value improvements appear.

When Workflow Becomes Something Else

By the third iteration, something had shifted. This stopped feeling like prompting and started feeling like training, with the runbook as curriculum and the report as what persisted.

With this convergence, I had something that was more than just a good report on mem0. It was an accurate, complete, and reusable process for developing deep codebase proficiency from scratch.

The interesting implication: point this at a different project, define new proficiency goals, and generate them the same way. The loop is skill development. The runbook is the specification of how you’re trying to learn.

Agents can do a lot of things. Without structured guidance, they do them rather poorly. This is a direct attempt at solving that problem by engineering a system that refines the ability to investigate and improve.

What’s Next

A nascent “skill development” category seems to be forming. As I was finishing this, I saw the GRPO team working on a related idea from a different angle: https://x.com/skylar_b_payne/status/2024699151221739643

We’re approaching the same problem differently. Their path is reinforcement-based; mine is runbook-driven with explicit memory. I have real opinions about the tradeoffs — and I’ll write those up once I’ve stress-tested this on a second project.

For now: I came in trying to understand RLMs. I left with a new way to think about what agent capability actually means and how you build it deliberately.

My work: https://github.com/Dowwie/AsyncReview

Special thanks goes to Sheing Ng, author of AsyncReview.