Context Engineering: Why AI Keeps Forgetting, and How to Fix It
Source "How to Master Context Engineering" — Digital Bricks
Adapted and expanded by the Tiro team.
We all thought a good prompt was enough. Then, at some point, the chatbot's answers start to drift. It forgets something you clearly told it five minutes ago, your coding assistant keeps losing track of your project structure, and the RAG-based internal bot can't seem to reference two documents at once.
Here's the key: none of this is because "the prompt was weak." As AI applications grow more complex, a one-line prompt becomes the tip of the iceberg. What really matters is designing—at the system level—what the model sees and what it remembers. That is exactly what context engineering is about.
In this post, we'll look at what context engineering is, how it differs from prompt engineering, the shapes it takes in production (RAG, agents, coding assistants), and—most importantly—the four classic ways context breaks down and how to handle each one.
What Is Context Engineering?
In one sentence: context engineering is the practice of designing, at the system level, what a model gets to see before it produces an answer. It's not about polishing a single line of prompt—it's about building something like a living information ecosystem around the model.
The "context" here is far broader than people assume. It isn't just the question the user just asked. It's everything the model receives right up to the moment it generates a response:
System instructions: role and rule setting, like "You are a friendly customer support agent."
The current question + conversation history: what was said five minutes ago.
Long-term memory and user preferences: the tone this person tends to prefer.
Retrieved external material: internal wikis, policy docs, database query results.
Available tools: things like "check calendar" or "send email."
Output format scaffolding: a JSON schema, an email template, or other fixed structure.
Real-time data: a stock price or weather reading a freshly called API just returned.
The catch is that the amount of context a model can take in isn't infinite. So context engineering is, in practice, the art of selection and compression. What do you show right now, what do you summarize, and what do you set aside for later? Short-term memory for recent exchanges, long-term memory for older facts, summaries for things that have aged—you have to layer it deliberately.
When this is done well, the AI finally feels like it "remembers like a person." It pulls in yesterday's conversation, reflects the user's preferences, references company documents, and factors in real-time information as it answers. Only then does it stop being one-off Q&A and start feeling like "working together."
How Is It Different from Prompt Engineering?
Typing "write a polite reply to this email" into ChatGPT is prompt engineering. But the moment you build a company support chatbot that has to remember earlier conversations, pull in the user's account details, look up relevant product docs, and hold a consistent tone throughout—that's where you've entered the territory of context engineering.
The difference, in short:
Prompt engineering: designing what to ask the model.
Context engineering: designing, as a system, what to give the model.
Andrej Karpathy's framing makes it easy to grasp:
"People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. […] In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."
Good prompts still matter. The difference is whether that prompt operates in empty space, or on top of well-organized background information.
What It Looks Like in Practice
In real products, context engineering shows up in three forms.
RAG: Letting the Model Take an "Open-Book Exam"
The most familiar pattern is RAG (Retrieval-Augmented Generation). It sounds intimidating, but the idea is simple. When a user asks something, you retrieve relevant material from an external knowledge base and show it to the model alongside the question. That lets the model answer using information it never saw during training.
It used to be that answering questions about internal policy meant fine-tuning a model on company documents. RAG flips this. You leave the model as-is and retrieve the information you need, when you need it, slotting it into the context. The fastest way to understand it: you're letting the model take an open-book exam.
The benefits are clear. The model makes fewer mistakes, and answers reflect facts that are actually relevant to your company. That's why RAG is the first step in almost every case where company material is central—internal Q&A bots, customer support bots, personal assistants, and more.
Agents: Models That Fill Their Own Context
If RAG puts material into the context, agents go one step further. The model picks and calls tools itself, then feeds the results back into its own context as it works through a task. The context shifts from a static block of text into a live workbench.
Say you ask, "Tell me Tesla's stock price and summarize the related news." A well-built agent moves like this: (1) recognizes "I need fresh data" → (2) calls a stock-price API → (3) calls a news-search tool → (4) ties the two together into a summary. Each step's result accumulates in the context, and the agent decides its next action based on what it sees. It's a loop of observe → act → update context → observe again.
Lately, instead of one agent doing everything, it's increasingly common to split the work across several agents. One summarizes documents, another searches the database, and a third merges the two results. The trap: when multiple agents pass context back and forth incorrectly, they confuse one another and easily fall into infinite loops. That's exactly why context engineering matters even more here.
Coding Assistants: The Hardest Case
Coding assistants like GitHub Copilot, Cursor, and Windsurf are arguably the ultimate test of context engineering. Codebases are usually huge, deeply interconnected, and a single file can affect dozens of others.
"Refactor the process_data function so it also handles empty values"—to handle this one-line request well, the assistant needs to know a lot. Where this function is called, what types it takes and returns, the project's coding style, what recent commits have touched it. Filling all of that into the context is the assistant's real job. In a sense, you could call it RAG running on top of a codebase.
The same tool tends to feel better the longer you use it on a project—and naturally so, because its context keeps getting richer.
The Four Ways Context Breaks Down
A question naturally surfaces here: "So can't we just crank the agent's usable context up to a million tokens and solve everything?"
Recent research says the opposite. AI researcher Drew Breunig is blunt about it: "longer context does not generate better responses." In fact, once there's too much context, models start to break down in new ways—they get poisoned, distracted, confused, or end up clashing with themselves.
1. Context Poisoning: When False Information Lodges In as Truth
If, even once during a conversation, a model lodges something untrue into the context as a "fact," it will keep reasoning on top of that false fact from then on. One small hallucination wrecks the entire flow.
DeepMind's Gemini agent playing Pokémon is the textbook example. The moment it wrongly registered a perfectly healthy character as "fainted," that information got lodged into the agent's goal list. From there, every action the agent took fell apart.
The fix is Quarantine. Don't just accept whatever the model claims to know—verify it once, then move it into long-term memory. Handle anything suspicious in a separate session or a separate agent, so that one stream of errors doesn't contaminate the others. It's like keeping several separate notebooks: even if one is a mess, the rest stay clean.
2. Context Distraction: A Model Captivated by Its Own Log
When the context grows too large, a model starts to lean more on the log piling up in front of it than on the knowledge it learned during training. It ends up repeating its own past actions instead of generating new answers.
The same Gemini agent from before stopped producing new plans and just repeated past behavior once its context passed 100,000 tokens. According to Databricks' measurements, even Llama 3.1 with 405 billion parameters starts losing accuracy around 32,000 tokens. Being able to receive a million tokens and being able to use a million tokens are two different things.
The fix is summarization and pruning. When a conversation gets long, don't drag the raw log along—swap in a summary that distills only the essentials. It's like wrapping up a long meeting with a few lines of action items. You're tidying the desk now and then so the model can focus on the work.
3. Context Confusion: More Tools Means Worse Choices
If you tell a model "here's this tool, and that tool, and this one too," it actually gets worse at picking the one that matters. UC Berkeley's function-calling leaderboard showed the same thing: nearly every model performed worse when given several tools than when given just one.
The "Less is More" research is even more concrete. A Llama 3.1 8B model failed outright when shown all 46 tools, but did fine when shown only the 19 most relevant ones. And that's even though they all fit inside the context window. The problem wasn't length—it was "too many irrelevant options."
The fix is Tool Loadout Management. Just as a gamer picks the weapons to bring for each mission, you look at the user's question and then hand the model only the two or three tools that fit. Interestingly, putting tool descriptions into a vector database and selecting them with RAG works well—which means you end up reapplying the principle of context engineering to context engineering itself.
4. Context Clash: A Model Fighting Its Own Words
The last one is the most serious. When a user doesn't give all the information at once but drips it out over several turns, the model puts out a half-baked guess before all the information is in. That half-baked guess then stays in the context. Even when the real information arrives later, the model starts getting confused between what it said earlier and the new input.
A joint Microsoft–Salesforce study measured exactly this: compared to giving the same information all at once, splitting it across multiple turns dropped performance by an average of 39%. One model crashed from 98% down to 64%. Same amount of information—just for the reason that it "trickled in step by step."
There are two fixes here.
Pruning: if a wrong intermediate guess is sitting in the context, cut that part out when building the next turn. You clear it away in advance so new information doesn't fight with old guesses.
Offloading: move the intermediate reasoning the model churns through in its "head" onto a separate "scratchpad." Anthropic's think tool, added to Claude, is a prime example. You strip that scratchpad out of the context that produces the final answer, so when it answers, it works from a clean state—without the "messy intermediate work." Anthropic reported that this method alone improved performance by up to 54% on certain agent tasks.
Wrapping Up
Looking at all this, context engineering might seem grand and daunting—but you don't have to start with everything in place. Starting small is more than enough to see results.
Add just one step—"search the internal wiki"—to your company chatbot (basic RAG).
Attach a small memory buffer that remembers just the last few turns of conversation.
Slip a single "one-line summary about this user" in front of a prompt that already works well.
After that, you keep asking one question: "Does our AI have all the context it needs to do its job well?" If the answer is "no," think of this post as a map for where to take the next step.
We're moving from an era of good model + good prompt to an era of good model + good context. From the user's side, that difference feels huge—"the AI is actually doing the work with me" versus "it forgot again." For your next project, we'd suggest tuning the context before the model.
Meetings Are the Biggest Blank Spot in Your Context
One last point to draw out of this. We said that getting external material to the model well is the starting point of context engineering—yet at most companies, the most important context isn't digitized anywhere.
Documents tidied up in Notion, messages left in Slack, tickets in Jira—these are only about 10% of a company's context. The other 90% is the decisions made in meetings, the reasons behind "why we settled on this," the concern someone voiced in passing—in other words, everything that was exchanged out loud. And this 90% vanishes the moment the meeting ends. No matter how good an AI agent's RAG is at digging through the internal wiki, it can't reach into this space.
Tiro is built to fill exactly that gap. It takes meeting audio and turns it into a form an AI agent can read directly (LLM-Ready Data). And it treats that not as the finish line but as the starting point, exposing that data to agents through four paths: API, MCP, CLI, and Skill.
What's interesting is that the context-breakdown patterns covered in this post are reflected directly in Tiro's own design. To prevent context confusion, the MCP uses a Progressive Disclosure pattern that spends tokens in stages (list → summary → full text); to prevent context distraction, large meeting transcripts are dropped to disk via the CLI, leaving only a receipt in the main context. If you're curious what these principles look like once they're translated into actual product design, we go deeper in How Tiro Remembers Business Context.
If you'd like to connect your company's meeting data to your AI agents, start below.