Post-Training Is Just Dog Training
(And You Can Do It for $44)
There is a mystique around AI training that the industry loves to maintain. You hear phrases like “reinforcement learning,” “post-training alignment,” and “parameter-efficient fine-tuning,” and the implied message is: this is beyond you. This requires a PhD and a million-dollar GPU cluster. Leave it to the experts.
I am going to try to peel that back. Not because the techniques are trivial, but because the core ideas are genuinely intuitive once you strip away the jargon. I know this because I spent the better part of a weekend doing post-training at a hackathon, on a single rented GPU, for less than what you would spend on a nice dinner. And along the way I learned something real about how AI models develop new behaviors, something I think is worth sharing even if you have never written a line of code.
Start with the dog
The simplest way to understand how AI models learn new behaviors after their initial training is to think about training a dog.
When you bring home a puppy, it already knows how to walk, eat, bark, and interact with the world. It has general intelligence. What it does not know is how you want it to behave in your house: sit before eating, come when called, stay off the couch. Training the dog does not give it new physical capabilities. It shapes existing capabilities into specific behaviors through feedback. You give it a treat when it sits on command. You withhold the treat when it does not. Over time, the dog learns the strategy that earns rewards.
Post-training an AI model works the same way. A model like Mistral Small 24B has already been trained on massive amounts of text (that is the “puppy growing up” phase, called pre-training, and it costs millions of dollars). It already knows how to understand language, write code, follow instructions, and produce structured output. Post-training does not teach it new fundamental skills. It shapes the skills it already has into specific behaviors you want. And just like dog training, there are different techniques depending on what you are trying to teach.
The simplest technique is called Supervised Fine-Tuning, or SFT. Think of it as training by demonstration. You show the dog exactly what “sit” looks like, over and over, in different contexts, until it learns to reproduce the behavior. In AI terms, you give the model hundreds of examples of the behavior you want (inputs paired with ideal outputs) and it adjusts its internal weights to produce responses that look like those examples. “Learn by watching an expert.”
The more advanced technique is Reinforcement Learning, or RL. This is the treat-and-no-treat approach. Instead of showing the model what the right answer looks like, you let it try things and then score the result. Good output gets a reward. Bad output does not. Over many rounds, the model figures out its own strategy for earning rewards, which sometimes produces behaviors more creative than what any expert demonstration would have taught. This is how DeepSeek trained their R1 model to reason through math problems, and it is genuinely exciting. But it is also slower, more expensive, and harder to set up.
We used SFT with a twist. And the twist is where things get interesting.
The observation that started the project
Before the hackathon, I had been building DeepRepo, a system where a powerful AI model acts as a manager coordinating a team of cheaper, specialized AI models to analyze codebases. The manager model receives a task like “find all security vulnerabilities in this authentication system,” breaks it into subtasks, dispatches specialists, evaluates their work, and synthesizes the findings.
I noticed something while benchmarking different models in the manager seat. When Claude Opus (an expensive, powerful model) managed the team, it dispatched roughly 61 focused subtasks. It explored the codebase methodically, broke the analysis into specific pieces, checked the specialists’ work, retried when results were shallow, and cross-referenced findings before synthesizing. When Claude Sonnet (a cheaper, faster model) managed the exact same team on the exact same task, it dispatched 9. It sent out a few vague requests, accepted whatever came back, and called it done.
The specialists were equally competent in both cases. The tools were identical. The only difference was management style.
This is a distinction anyone who has worked on a team understands intuitively. A bad manager says “hey team, find all the bugs” and accepts whatever slides across the desk by 5 PM. A great manager says “check auth.py for SQL injection, verify the middleware tokens, cross-check your findings against the session handling docs,” and follows up when a report comes back half-baked. Both managers have the same team. One gets dramatically better results because of how they orchestrate the work.
The critical realization: Sonnet is perfectly capable of dispatching 61 tasks. It has the intelligence. It just does not use it. The gap between the great manager and the mediocre manager is not IQ. It is strategy. And strategy can be taught.
What we built at the hackathon
The Mistral Worldwide Hackathon SF gave us roughly 24 hours to test this thesis. Could we take a mid-tier model (Mistral Small, 24 billion parameters) and teach it to manage like an expensive one?
Here is what we did, translated into the dog training analogy.
First, we recorded the expert. We used Claude to generate 150 “demonstration sessions” showing perfect orchestration behavior across varied scenarios: security audits, bug detection, performance analysis, dependency checks, refactoring assessment. Each session showed the model exploring a codebase, breaking the problem into focused subtasks, dispatching specialists, evaluating their output critically, retrying when quality was low, and synthesizing findings with specific citations. These are the equivalent of showing the dog what “sit” looks like, 150 times, in 150 different rooms.
Second, we fine-tuned. Using a technique called QLoRA (which makes fine-tuning feasible on a single GPU by being very efficient about which parts of the model it adjusts), we trained Mistral Small on those demonstrations. The model has 23.7 billion parameters, but we only adjusted 92 million of them, about 0.39%. That sounds tiny until you realize 92 million parameters is roughly the size of the entire GPT-2 model, which was considered impressive in 2019. The key insight from the research literature is that when you teach a model a new behavior, the actual changes to its weights are concentrated in a small subspace. Most parameters barely move. QLoRA just makes this explicit by only training the parts that matter.
Third, and this is the twist, we built a self-improvement loop. After training, we tested the model against 25 ground-truth orchestration scenarios and scored it on four dimensions: Did it assign the right specialist to the right files? Did it catch known bugs? Did it know when to retry versus accept? Did its final synthesis cross-reference findings and cite specifics? Then the system diagnosed which dimension was weakest, generated 40 new training examples specifically targeting that weakness, merged them into the training set, and retrained. Automatically. Without us touching anything.
This loop is where the SFT-plus-a-twist sits on the spectrum between pure demonstration learning and full RL. We are still showing the model examples (SFT), but the evaluation scores function like a rough reward signal, telling the system what the model is bad at so it can generate better training data. It is not RL in the formal sense (no policy gradients, no exploration), but it produces a similar feedback-driven improvement dynamic. Think of it as showing the dog progressively harder tricks based on which ones it is struggling with, rather than just repeating the same tricks forever.
We kicked off the loop at midnight on a rented A100 GPU and went to sleep. It ran for 7.5 hours, completed 7 full improvement cycles, generated 875 traceable decision records in W&B Weave, and cost $44 total including the GPU rental and the API calls for generating training data. When we woke up, there were results.
What actually happened (honestly)
The results were genuinely interesting, and genuinely messy. Here is the table.
Scores did not go straight up. If you look at the overall column, it zigzags: up, down, up, down, up. That probably looks like failure if you expected a smooth climb. It is not. Let me explain what is actually happening.
The loop always targets the weakest dimension. In iteration 1, retry intelligence dropped to 0.300, so the loop generated heavily retry-focused training examples. In iteration 2, retry jumped to 0.530 (a 77% improvement in a single cycle). That is the system working exactly as designed, correctly diagnosing and then fixing a specific weakness. But the retry-heavy training data slightly diluted the other dimensions, so they dipped. Then the loop pivoted to target those, retry dropped again, and the cycle repeated.
It is like a dog trainer who spends a week drilling “stay” until it is perfect, then realizes the dog has gotten rusty on “come.” So they drill “come” for a week, and now “stay” has slipped a bit. The dog is not getting worse. The trainer is just juggling multiple behaviors with limited practice time.
Three things in the data tell me the system is converging, not spinning its wheels. The valleys are rising (the worst iterations score 0.202, then 0.204, then 0.239, each “bad” cycle better than the last). The swings are getting smaller (the gap between peaks and valleys went from 0.053 to 0.011). And synthesis quality completely ignores the oscillation, climbing steadily from 0.375 to 0.459 over the full run, a clean 22% improvement. The fix is straightforward: balanced training that maintains previously improved behaviors while targeting new ones. Solvable engineering, not a fundamental wall.
The honest limitation
The most conspicuous number in the table is error detection stuck at zero. The loop diagnosed it correctly as the top failure mode in every single iteration. It tried to fix it every time. It could not.
Two possible explanations. The first is a measurement problem: the model detects errors but describes them differently than the scorer expects (think user_py_line45_string_format_vulnerability versus sql_injection_user_py_line_45, same bug, different words, zero string overlap). The second is a genuine limit of the SFT approach: detecting novel bugs might require reasoning from first principles rather than imitating patterns, and that kind of transfer might need RL rather than demonstration learning.
I think it is mostly the measurement problem, but I am not certain. That uncertainty is worth stating rather than burying.
What I think this means
This project convinced me of three things.
First, the behavioral gap between expensive and cheap models in orchestration tasks is real, large, and trainable. The difference between 61 dispatches and 9 dispatches is not intelligence. It is learned strategy, and you can teach that strategy to a smaller model for $44 on a single GPU.
Second, self-improvement loops that combine SFT with automated evaluation work as a mechanism. The 77% retry improvement in one cycle is not noise. The 22% steady climb in synthesis quality is not luck. The diagnostic-and-intervene pattern produces real, measurable gains.
Third, and this is the one I keep coming back to, post-training is not the exclusive domain of frontier labs with million-dollar budgets. The ideas are intuitive (show the model good behavior, score the results, target the weaknesses, repeat). The tools exist. The compute is rentable by the hour. A single person at a weekend hackathon can run a self-improving training pipeline overnight and wake up to real results.
The natural next step is proper reinforcement learning on top of this SFT foundation, which is where the model stops just imitating the expert and starts developing its own orchestration strategies. DeepRepo already validated this path with a different model. SFT gives you 80% of the benefit in hours. RL is where the last 20% lives, and where genuinely novel behavior, strategies the expensive model itself never used, might emerge.
That is the part I find most exciting: not just compressing expensive behavior into cheap models, but eventually growing behavior that did not exist anywhere before. We are training the dog, but someday the dog might invent a trick the trainer never taught it.
RLM Distiller was built at the Mistral Worldwide Hackathon SF (Feb 28 to Mar 1, 2026) for Track 02: Fine-Tuning by W&B. The interactive demo dashboard is at demo-site-mu-rust.vercel.app. The underlying research comes from DeepRepo, a Dolores Research project for multi-agent codebase analysis.







