blog.krauza.com
~ / posts / raising_ai_toddler / index.md

Raising Your AI from Toddler to Adult

A practical guide to shifting your mental model from treating AI as a search engine to treating it as a colleague, with three components (prompts, agents, instructions) that make the difference between chaos and consistent output.

Most people use AI the way they used StackOverflow: paste a question in, pull a fragment out, keep moving. That works right up until you need something consistent, repeatable, and debuggable, and then it falls apart. The difference between AI as a search bar and AI as something genuinely useful comes down to three things: how you prompt it, what kind of agent it is, and what instructions you give it. Get all three right and you stop getting random output. You start getting a system.

But understanding why those three things matter requires understanding what AI is actually doing when you hand it a task. It’s not reasoning the way you think it is. It’s doing something much closer to pattern completion, and if you haven’t given it enough signal to work with, it will complete the pattern with whatever it has, which is usually not what you wanted. The most honest comparison I’ve found is a toddler.

The Toddler Problem

A pair of bright pink shoes is resting on top of a wooden box - <a href='https://www.vecteezy.com/free-photos/toddler-shoes'>Toddler Shoes Stock photos by Vecteezy</a>
A pair of bright pink shoes is resting on top of a wooden box - Toddler Shoes Stock photos by Vecteezy

I spent time with my niece a few weekends ago and couldn’t stop thinking about how much she has in common with an LLM. Tell a toddler to go put on her shoes and you might come back to shoes, or ballet flats instead of sneakers, or no clothes at all, because you only specified footwear. That’s indeterminate output. The AI does the same thing: without enough context, it either does something meaningless or invents something plausible-sounding and entirely wrong.

The difference between deterministic and non-deterministic output matters when you’re building anything that has to work reliably. Ask the same model fifty times what color the sky is, and you’ll get blue maybe six times out of ten. The other four times you’ll get something else entirely. That’s not a foundation you can build a production system on.

The solution isn’t to lower your expectations of AI. It’s to stop treating it like a vending machine (input goes in, output comes out, no conversation required) and start treating it like the smartest person in the room who still needs you to explain what room they’re in.

Shifting Mental Models

The old model: AI as search engine. You type a question, copy the answer, and move on. It’s StackOverflow with better prose and no comment section arguing about whether your use case is off-topic.

The better model: AI as colleague. Specifically, a colleague who has read everything, asks clarifying questions before starting, and doesn’t assume they know what you want. The shift sounds small. The practical difference is significant. A search engine returns information. A colleague helps you figure out whether you’re asking the right question in the first place.

Seven out of ten times, your intuition about what needs to happen is right, but the implementation is off. The AI helps you find the problem before you’ve committed to a solution. When it asks “What’s the expected output?” or “Are there constraints I should know about?” before starting on the task, that’s not friction. That’s the process working correctly.

Context is what makes the difference. The AI only knows what you tell it. Hand it a project brief, a folder of documents, or a half-formed problem statement and say “go do something” and it will, just not necessarily the something you wanted, because it has one piece of a much larger picture. The upfront work of explaining what you’re trying to accomplish, what the constraints are, and what success looks like is not overhead. It’s the job.

Which brings me to the actual mechanics. Most people who struggle with AI reliability are conflating three things that need to stay separate: what you’re asking for, who you’re asking, and how that entity is supposed to operate. Collapse all three into a single message and you get a coin flip every time. Keep them distinct and you have something you can reason about, tune, and actually trust.

The Three Components

Understanding these three individually, and how they work together, is what separates a one-shot query from a system you can actually rely on.

Prompts are your requests. In a well-structured system, they’re shorthand. “Go clean up the living room” is enough. You shouldn’t have to say “invoke the toddler agent with the blocks-in-the-blue-bin instruction set and execute the toy pickup skill.” The whole point of defining agents, instructions, and skills in advance is that the system can figure out what a natural language prompt implies and route it to the right combination automatically. The prompt carries your intent; the structure you’ve already built supplies the rest.

That only works, though, when the structure actually exists. “Go clean up the living room” only works as a prompt if the toddler already knows what clean means, which bin the blocks go in, and how to pick things up without breaking them. Strip that away and “go clean up the living room” produces whatever the toddler decides it means. Same with AI: each prompt is shorthand for something more specific: a particular agent, a particular set of criteria, a particular procedure. If those things aren’t defined, the prompt lands in a vacuum and the model fills in the blanks however it sees fit.

Agents are personality. Think of them as the toddler’s disposition: meticulous, creative, or genuinely chaotic, depending on what you got. The difference is that with an AI agent, you define the personality rather than inherit it. You can specify an agent as a senior technical writer who values clarity over cleverness, or a research analyst who always cites sources and flags uncertainty, or a project manager who asks about dependencies before committing to a timeline. You give it a role, a domain, quality standards it applies every time without exception. That definition becomes the consistent foundation for every interaction, rather than being rebuilt from scratch each time you start a new conversation. The agent knows who it is regardless of what task you put in front of it.

Instructions are the standards and criteria. Sneakers for the park, dress shoes for church. Blocks go in the blue bin, cars go in the red bin, toys away when done. For a research agent, that might mean: always cite your sources, flag anything you’re uncertain about, summarize in under three paragraphs, and never recommend a course of action without listing the tradeoffs. The point is that instructions can change without changing the agent. Tomorrow you hand that same research agent a competitive analysis task, and the instructions swap out entirely: focus only on pricing and positioning, ignore product features, structure the output as a comparison table. The agent is who it is. The instructions tell it what to do right now.

The separation sounds like extra overhead until you’ve debugged an agent that went wrong and had no idea whether the problem was the prompt, the personality, or the rules it was supposed to follow. When everything lives in one blob, every failure is a mystery. When it’s separated, failures have addresses.

Why Separation Matters

Three reasons, none of them abstract.

Reusability: The rule that says blocks go in the blue bin doesn’t care who’s doing the cleaning: Teddy, the babysitter, or a visiting cousin who’s never been to the house before. Hand them the list and they can follow it. Instructions work the same way: they live in their own file, separate from the agent definition, so any agent can pick them up without those rules being baked into one specific personality. One agent uses them today. A completely different agent uses them tomorrow for a different kind of task. The core rules stay consistent; the personalities executing them can change.

Consistency: “Clean” means something different to everyone. Ask ten people what it looks like and you’ll get ten different answers. Writing the standard down turns a vague expectation into something the agent can evaluate its own output against. When you define “excellent” explicitly (everything in a bin, lid closed), the agent can check its work against that definition rather than deciding arbitrarily that it’s finished. Without that self-check, agents tend to stop either too early or too late, and you won’t know which until something breaks downstream.

Traceability: This is the one that matters most when things go sideways. If an agent goes left when it should have gone right, was that because the instruction was ambiguous or because it hallucinated past the guidance you gave it? Separated concerns let you look at exactly which instruction produced which behavior, track changes to your instructions over time the same way you’d track changes to any important document, and make targeted fixes without tearing apart the whole system to find the one thing that broke. When you can answer “what state were my agents in when this happened,” debugging becomes something you can actually do rather than something you guess at.

Once you have the three-component structure working for a single agent, the natural next step is chaining them together, which introduces a new problem. If every agent in your pipeline is running on your most capable model with full context of everything happening everywhere, you’ve built something expensive, slow, and brittle. The interesting design question isn’t whether to use multiple agents. It’s how to coordinate them without making a mess.

Orchestrating the Chaos

Not every task needs your most capable model. Generating documentation is time-consuming but not computationally hard: no real reasoning required, just reading something that exists and rendering it in plain language. Running expensive compute on that is like calling in a senior engineer to write meeting notes. A cheaper model handles it just as well, and the cost difference compounds fast across a real workload.

The orchestration pattern works like a delegation chain. A capable orchestrator receives the task, breaks it into discrete pieces, and spawns sub-agents to handle each one, with each running on whatever model is appropriate for its complexity. Sub-agents run with isolated memory and context, because they don’t need the full picture to do a narrow job. An agent summarizing one section of a report doesn’t need the entire project history and stakeholder map. Keeping that context scoped keeps each agent faster, more focused, and less likely to hallucinate from noise. More context is not always better: past a certain threshold, it starts working against you by crowding out the signal with irrelevant information the model now has to reason around.

The orchestrator takes what comes back from the sub-agents and decides what happens next: what needs a second pass, what can proceed, what needs human input before anything else moves. That’s the role a senior person on any team used to play: pulling together output from multiple contributors, deciding what’s ready, and flagging what isn’t. Now that coordination logic lives in a file.

So what does this actually look like on disk?

Why Markdown

Everything is markdown, plain .md files. If you haven’t worked with it before: markdown is essentially HTML with the complexity stripped out. Instead of wrapping text in <h1> tags to make a heading, you put a # at the start of the line. Instead of <strong> for bold, you wrap the word in **asterisks**. A - at the start of a line is a bullet point where you’d otherwise write <li>. When markdown is rendered in GitHub, VS Code, your notes app, or a browser, it produces the same formatted output as the equivalent HTML, just written in a fraction of the time with a fraction of the syntax.

Not Word documents, not PDFs, not YAML configs or JSON schemas. That choice isn’t aesthetic. It’s functional, and it matters for a few reasons that compound on each other.

First, markdown is human-readable without tooling. You can open any of these files in a text editor, read them, edit them, and understand exactly what the agent is going to receive. There’s no serialization layer to reason about, no rendering step required before you can tell whether what you wrote is actually what you meant.

Second, markdown is natively readable by models. Hand a model a PDF and it has to figure out the file format before it can start on your content. That’s wasted compute and an opportunity for things to go wrong before the actual task even begins. Markdown is plain text, so the model starts reading your content immediately.

Third, and most importantly, markdown carries semantic structure that models actually use. A ## header signals a section boundary. A table is a table. A bullet list of rules reads differently from a paragraph describing the same rules. When the model encounters | Rating | Criteria | it knows it’s looking at a classification schema, not a narrative passage. That structure isn’t decoration. It’s part of the information you’re communicating, and the model uses it to understand the shape of what you’re giving it before it processes the content.

The practical consequence is that how you format within each file matters as much as what you put in it. A flat wall of prose in your agent file and a flat wall of prose in your instructions file look identical to the model. It has to infer what’s a rule versus what’s context versus what’s an example. Use headers to separate sections, tables for classification and rating criteria, and bullet lists for sequential rules. Give the model anchors it can orient around, not a document it has to interpret from scratch.

Putting It Together

Before deciding which files you need, it helps to ask a more basic question: does this project actually need agents and instructions at all, or will a context file alone do the job?

The Context File (CLAUDE.md)

Context Matters. Illustration by Christophe Vorlet. - samim
Context Matters. Illustration by Christophe Vorlet. - samim

The context file is the foundation, and for a lot of use cases it’s the only file you need. Think of it as the note you leave for the babysitter: here’s the bedtime, here’s what she can and can’t have for a snack, here’s what to do if she wakes up crying, here are the things you never decide without calling us first. It’s not instructions for a specific task and it’s not defining anyone’s personality. It’s the background that applies to everything, every session, regardless of what the specific ask is. If your work is varied enough that you don’t want to constrain the AI to a fixed personality or a fixed set of rules, but you do want it to always understand the context it’s operating in — a context file alone gets you most of the way there.

The honest question to ask before building out agents and instructions is: do you have a repeatable task with a consistent set of expectations? If the work is varied, open-ended, or different every time, you probably don’t need an agent. A context file and a good prompt will do.

There’s a second question worth asking even before that one: does the output actually need to be deterministic? If you’re processing files in a consistent way, generating a report with a fixed structure, or running the same sequence of operations every time, an agent is the wrong tool. Agents are non-deterministic by nature. That flexibility is valuable when you need judgment; it’s a liability when you need the same result every time. In those cases, the better move is to use the LLM to help you write a script or build a tool that does the work deterministically, then run the tool. The LLM is excellent at writing that code. The output of the code will be consistent in a way that an agent never will be. Reach for agents when you need judgment. Reach for a script when you need reliability.

Agents earn their place when the same kind of work shows up often enough that it’s worth encoding how it should be approached, and when the work genuinely requires reasoning rather than repetition. Instructions earn their place when “done” needs a specific definition that isn’t obvious from the task alone. When both of those conditions are true, agents and instructions are the right tool. But they layer on top of a context file, not instead of one.

It’s also worth knowing that not all AI tools support file-based context, and the category of tool you’re using determines what’s available to you.

Chat interfaces (Claude Chat, ChatGPT, Gemini, and similar) are conversational. Every session starts cold. The model has no memory of what you worked on yesterday, no awareness of your project, no knowledge of your conventions. You can paste context in manually or build it up through the conversation, but nothing persists between sessions. This is fine for one-off questions. It breaks down the moment you’re doing repeated work on the same project, because you’re re-explaining yourself every time.

Agentic task systems (Claude CoWork, and similar tools) are designed for complex, multi-step work that benefits from file access and extended execution time: document creation, research synthesis, batch processing, data transformation. Work you can hand off and come back to when it’s done. These tools support context files, agent files, and instruction files, which means the full structure we’ve been talking about is available to you here.

CLI-based agentic tools (Claude Code, and similar tools) are built for sustained, multi-turn work directly inside a project. Claude Code is what I’m using to write and edit this post right now. It has a specific file called CLAUDE.md that gets automatically loaded into every session as persistent project context. You write it once; it’s there every time. No pasting, no re-explaining, no hoping you included the important constraints before the model started doing something you didn’t intend.

The context file stands apart from the agent and instruction files for a reason: it doesn’t belong to any one agent, and it doesn’t belong to any one task. Every agent that touches the project loads it. Think of it less as a configuration file and more as institutional memory: the things that are always true about this project that no agent should ever have to be told twice.

If you’re not using Claude, the same pattern applies under a different name: AGENT_CONTEXT.md, project.md, or whatever your tooling picks up automatically. The name matters less than having a place for this that isn’t tangled up with personality or task rules.

The Agent and Instruction Files

Once you have context sorted, agents and instructions are the next layer, and they’re always paired.

The agent file (.agent.md) is descriptive and qualitative, using prose with clear section headers for role, domain expertise, quality standards, and behavioral rules. Think of it as a resume and a personality profile in one document. This file rarely changes. The agent’s identity stays consistent whether it’s drafting a report, reviewing a contract, or summarizing research.

The instructions file (.instructions.md) is structured and evaluative. Where the agent file describes identity, the instructions file describes criteria, and criteria work best as tables, scales, and lists rather than prose. It typically includes a classification table for when to use which approach, a rating scale for outputs (excellent, good, fair, poor), explicit positive and negative indicators, and specific validation requirements. This is what lets the agent self-evaluate. Without a concrete definition of “done,” an agent decides arbitrarily when to stop.

The agent is the constant. Its identity, its expertise, its standards don’t change. The instructions are the variable. You swap them out depending on the task at hand. One agent, many possible instruction sets.

The prompts file (prompts.md) is the entry point, how users or other agents invoke the system. Three interaction modes cover most cases: Create (generate something new from a request), Critique (evaluate existing output against the standards in the instructions file), and Refine (iterate on a draft until it clears the threshold). This file changes most often as you learn how people actually use the agent and what they need from it.

Skills Files

Skills are the most commonly misunderstood piece of this whole structure, and the confusion is understandable. Both instructions and skills look like lists of rules in a text file. The difference isn’t in what they look like. It’s in when they activate and what they’re for.

Instructions are always on. They’re the agent’s operating principles, present in every interaction, shaping every output, governing how the agent thinks regardless of what task is in front of it. An instructions file might say: always cite your sources, keep responses under three paragraphs, flag anything you’re not confident about, never make a recommendation without listing the tradeoffs. These apply whether the agent is writing a summary, reviewing a document, or answering a question. The agent doesn’t decide to apply them. They’re just how it works.

Skills are dormant until invoked. They’re discrete, named procedures with a defined input and a defined output, called by name when a specific task needs to happen, then done. A skill called draft_weekly_update might be: (1) pull the list of completed items, (2) group them by theme, (3) write one sentence per group in past tense, (4) flag anything still in progress, (5) return a formatted summary ready to send. That sequence only runs when something explicitly calls draft_weekly_update. It’s not shaping every response. It’s waiting to be used.

The analogy that makes this click: go back to the toddler. You’ve told her to clean her room. The skill is how she physically executes that: pick up each toy with your hands (not your feet, not your head, not by kicking things under the bed), carry it to the bin, set it down gently. That’s the procedure. It’s the same sequence of steps every time she cleans up, regardless of what the room looks like or what standard she’s working toward.

The instructions are what “clean” actually means in this house. Blocks go in the blue bin. Cars go in the red bin. Books go on the bottom shelf. Lids go on the bins when you’re done. Nothing left on the floor. Those are the standards, the definition of done she’s working toward. The skill tells her how to move the toys. The instructions tell her where things belong and what the finished state should look like.

You can run the same skill against different instructions. At grandma’s house, everything goes in one big basket and clean means nothing on the floor. At home, every category has its own bin and the lids have to be on. The procedure for picking things up and placing them doesn’t change. What changes is the standard being applied and the organization being enforced. That’s the difference: skills are the execution, instructions are the criteria. One tells you how to do the work. The other tells you what the work is supposed to produce.

Where it gets confusing is that skills can include procedural steps that look a lot like instructions. The difference is scope and activation. An instruction like “always verify sources” applies to everything, always. A skill that includes a step “verify that sources are credible before including them” only applies when that specific skill is running. If you find yourself writing the same sequence of steps into your prompts every time you need a specific type of output — that’s a skill waiting to be extracted. If you’re writing rules that should shape how the agent approaches all of its work — that’s instructions.

In practice: instructions change when your standards change. Skills change when your process changes. They evolve for different reasons on different timescales, which is another reason they belong in separate files.

Recap and Where to Start

A lot of ground covered, so here’s the decision hierarchy distilled:

Start with a context file. Before anything else, write down what’s always true about your project: the audience, the tone, the constraints, the things that don’t change task to task. This is your foundation, and for a lot of work, it’s the only file you need. If you’re doing varied, open-ended work and a good prompt gets you there, stop here.

Ask whether you need an agent at all, or a tool. If the output needs to be deterministic and repeatable in exactly the same way every time, an agent is the wrong choice. Have the LLM help you write a script or build a tool that does the work reliably, then run the tool. Reach for agents when you need judgment. Reach for code when you need consistency.

Reach for agents and instructions when the work is repeatable and “done” needs a definition. The agent is who’s doing the work: the identity, the expertise, the standards. The instructions define what done looks like for this specific task. They’re always paired, and the instructions are the thing you swap out when the task changes.

Extract skills when you’re repeating the same procedure. If you’re writing the same sequence of steps into your prompts more than a few times, that’s a skill. Pull it out, name it, let agents call it. Skills are the execution; instructions are the criteria. One tells you how to move the toys. The other tells you where they belong.

Orchestrate when you have multiple agents. Not every task needs your most capable model. Delegate narrow work to smaller models, keep context scoped to what each agent actually needs, and put the critical thinking at the orchestration layer where it belongs.

None of this has to be figured out alone, and there’s something fitting about that. If you’re not sure where to start, or which of these layers your problem actually needs, feed this post to your LLM of choice and ask it. Describe what you’re trying to do, tell it what you’ve read, and let it help you work out whether you need a context file, an agent, a skill, or just a well-written script. That’s the whole point of treating it like a colleague rather than a search bar. It can help you figure out the right tool before you build the wrong one.

Search