← /posts / Agentic AI

The Post-MCP Era: Token Economics, Context Bloat, and the No-MCP Shift

Explore how to build an MCP server, the hidden costs of context bloat, and why token economics are driving the "No-MCP" shift in lean agentic engineering.

/ 15 min read /

In my last article in August 2025, I promised we’d explore writing your own MCP server in this follow-up piece. Since then, there has been an unbelievable amount of news about AI, agents and the Model Context Protocol. To stay on top of this ever evolving space, I started organizing and moderating an AI Monthly meeting at InCrowd. Additionally, I transitioned to entirely agent-driven development at the beginning of 2026. Since then, I've led and delivered a web project in 65 days that would normally take 100, migrated a legacy project from Next.js 12 and SCSS to Next.js 16 and Tailwind CSS within a couple of weeks, and shipped several other bits with heavy reliance on tools like the Figma and Chrome DevTools MCP servers.

Note

Before we dive into the details, I recommend reading my previous article, which features a practical demo on how to connect and utilize the Figma MCP server with gemini-cli to build a layout using Vue, Tailwind CSS, and Anime.js

The Model Context Protocol (MCP) is an open-source standard designed by Anthropic to create uniformity with how AI systems interact with external tools and resources. Let’s take a look at the structure of a standard TypeScript MCP Server:

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";

/**
 * We use an in-memory array to ensure this example is fully standalone.
 * For production, this would be replaced with a database client.
*/
interface Task {
  id: number;
  text: string;
  status: "todo" | "in-progress" | "done";
  priority: "low" | "medium" | "high" | "unassigned";
  estimatedMinutes: number | null;
}

let taskDatabase: Task[] = [
  { id: 1, text: "Draft MCP architecture article", status: "in-progress", priority: "high", estimatedMinutes: 120 },
  { id: 2, text: "Update documentation examples", status: "todo", priority: "unassigned", estimatedMinutes: null },
];

const server = new McpServer({
  name: "standalone-task-manager",
  description: "A server to add and triage tasks",
  version: "1.0.0"
});

/**
 * Tools allow the AI to perform side effects. 
 * We define a focused tool with a clear schema using Zod 
 * so the LLM knows exactly what data types to provide.
*/
server.tool("add_task",
  "Add a new task to the manager.",
  { 
    text: z.string().describe("Description of the task"),
    status: z.enum(["todo", "in-progress", "done"]).optional(),
    priority: z.enum(["low", "medium", "high"]).optional(),
    minutes: z.number().optional().describe("Estimated time in minutes")
  },

  async ({ text, status, priority, minutes }) => {
    const newTask: Task = {
      id: taskDatabase.length > 0 ? Math.max(...taskDatabase.map(t => t.id)) + 1 : 1,
      text,
      status: status ?? "todo",
      priority: priority ?? "unassigned",
      estimatedMinutes: minutes ?? null
    };
    taskDatabase.push(newTask);
    return { content: [{ type: "text", text: `Added task ${newTask.id}: ${text}` }] };
  }
);

/**
 * Resources act as read-only state indicators. 
 * We use the application/json mimeType so structured-data-aware LLMs 
 * can parse the state accurately.
*/
server.resource("current_tasks", "tasks://list",
  async (uri) => ({
    contents: [{
      uri: uri.href,
      text: JSON.stringify(taskDatabase, null, 2),
      mimeType: "application/json"
    }]
  })
);

/**
 * Prompts define templated workflows. 
 * They are beneficial for guiding the model toward specific reasoning patterns,
 * like estimating complexity before taking an action.
*/
server.prompt("triage_tasks",
{
	  description: "Analyze the task list and propose follow-up tasks"
	},
  () => ({
    messages: [{
      role: "user",
      content: {
        type: "text", 
        text: `Please examine the tasks in tasks://list. 
        Based on the current tasks, identify a logical next step or follow-up task.
        1. Estimate the total time required in minutes for this new task.
        2. Assign a priority (low/medium/high) based on its impact.
        3. Use the add_task tool to add it to the list.`
      }
    }]
  })
);

async function main() {
  const transport = new StdioServerTransport();
  await server.connect(transport);
}

main().catch(console.error);
{
  "name": "standalone-task-manager",
  "version": "1.0.0",
  "description": "A standalone MCP server for task management",
  "type": "module",
  "main": "dist/index.js",
  "scripts": {
    "build": "tsc",
    "watch": "tsc --watch",
    "start": "node dist/index.js"
  },
  "dependencies": {
    "@modelcontextprotocol/sdk": "^1.0.0",
    "zod": "^3.23.8"
  },
  "devDependencies": {
    "@types/node": "^22.0.0",
    "typescript": "^5.6.0"
  }
}
# Install dependencies
npm install

# Compile the code to the dist/ directory
npm run build

# For best practice debugging
# https://github.com/modelcontextprotocol/inspector
# Spins up a local web app where you can manually trigger tools, read resources, and inspect the transport layer cleanly
npx @modelcontextprotocol/inspector node dist/index.js
{
  "mcpServers": {
    "standalone-task-manager": {
      "command": "node",
      "args": [
        "/absolute/path/to/your-project/dist/index.js"
      ]
    }
  }
}

At the top of the index.ts file, you can see the import of the default @modelcontextprotocol/sdk package that enables the functionality to register tools, resources and prompts that are offered to your AI.

Tools registered allow the AI to take action such as adding a to-do entry to a to-do list array. With resources registered, we provide the AI with live updatable additional information such as our existing list of to-dos. With Prompts, we offer the AI instructions it can use to do as instructed in our example above. It can then utilize the instruction prompt to triage the to-dos in our list for further analysis potentially adding follow up tasks with estimates.

To build the dist/index.js you need to put the index.ts together with the package.json in one folder and run npm (node package manager) as seen in the "Bash" tab. Afterwards, you can debug tools, resources and prompts specifically via the bash command using the @modelcontextprotocol/inspector (also in the bash tab) or connect it to your agent right away adding the server as seen in the mcp.json tab above.

WebMCP is a recent W3C standard proposal introduced by Google and Microsoft. It takes the core MCP paradigm a step further by allowing websites to natively expose tools, resources and prompts directly to browser-based AI agents via a standard client-side API. Because the native API is still rolling out, developers currently utilize the open-source MCP-B (WebMCP Bridge) polyfill to support newer WebMCP features across all browsers. Here is what that looks like in practice:

// Implemented via @mcp-b/global (MCP-B Polyfill) to enable Resources & Prompts
import '@mcp-b/global'; 

// 1. TOOL: Actionable function (Mutation)
navigator.modelContext.registerTool({
  name: "add_to_cart",
  description: "Adds a specific product to the user's shopping cart.",
  inputSchema: { 
    type: "object", 
    properties: {
      productId: { type: "string", description: "The ID or name of the product" },
      quantity: { type: "number", default: 1 }
    },
    required: ["productId"]
  },

  async execute({ productId, quantity }) {
    window.dispatchEvent(new CustomEvent('cart:add', { detail: { productId, quantity } }));
    return { content: [{ type: "text", text: `Added ${quantity}x ${productId} to cart.` }] };
  }
});

// 2. RESOURCE: Real-time context (State)
navigator.modelContext.registerResource({
  uri: "site://cart/items",
  name: "Current Cart Contents",
  description: "A live JSON list of items in the user's shopping cart",
  async read() {
    const items = JSON.parse(localStorage.getItem('cart') || '[]');
    return { contents: [{ uri: "site://cart/items", text: JSON.stringify(items), mimeType: "application/json" }] };
  }
});

// 3. PROMPT: Discovery Assistant (Interrogative)
navigator.modelContext.registerPrompt({
  name: "find_fitting_products",
  description: "Asks the user 3 specific questions to find the best product for them",
  async execute() {
    return {
      messages: [{
        role: "user",
        content: { 
          type: "text", 
          text: `I want you to act as a Personal Shopping Assistant. 
Before making any recommendations, look at the items currently in 'site://cart/items'. 
Then, ask me exactly three targeted questions to understand my specific needs, skill level, and budget. Once I answer, use your tools to suggest the best fitting product from the store.` 
        }
      }]
    };
  }
});

With this architecture, an autonomous agent browsing an e-commerce platform can dynamically discover tool-, resource- and prompt definitions to inspect the shopping cart, ask targeted product questions, add items to the cart, and execute checkouts. All this without the user or developer having to use or build a custom API integration.

Despite its massive adoption, MCP Servers do have their disadvantages. A growing "No-MCP" movement argues that the architecture introduces structural overhead where lightweight alternatives could perform better. To understand why, we have to look closely at token economics and context window management when interacting with a Large Language Model (LLM).

Every prompt sent to an LLM like Gemini, Claude or ChatGPT is a stateless payload handled by an API. The model processes the text, tokenizes it, maps semantic weights, and uses its multiple billion connected parameters to predict the ideal answer. The currency of this ecosystem is the token (~4 characters of English text). The larger your input payload, the more compute (inference) the model provider spends, directly increasing latency and cost.

For instance, Google's "Gemini 3 Flash Preview" at the time of writing this article for regular text input costs $0.50 per 1M input tokens and $3 per 1M output tokens. While that sounds cheap, a bloated development loop that runs hundreds of times a day accumulates massive overhead if the baseline context payload is weighed down by heavy schemas.

Context Window Tokens Input from AI Agent Output from AI Model
200 [Setup] AI Agent with System prompt e.g. "You are a helpful AI Agent, helping the user with Coding Web application" (+200 tokens)
+600 = 800 [Setup] MCP Server Feature (Tools, Resources, Prompts) definitions (+600 tokens)
+400 = 1200 [Manual Input] Prompt: "Look at the following Figma Design [link] to code the Fixtures Carousel with TypeScript and React" (+400 tokens)
+200 + 400 = 1800 [Answer] "Sure" Editing File 1 and File 2 with [Code] (+200 Thinking & 400 Output tokens)
Reset = 0 [Auto Prompt / Compaction] Maximum Context window hit (e.g., 2000 tokens). Agent sends automatic "Summarize the current conversation..." prompt and resets window.
+1000 = 1000 [Answer] Summary of previous context window: "We are building a feature from a Figma layout... We already edited File 1 and 2..." (+1000 tokens)

  • Setup: Initial configuration, system instructions, and tool/MCP definitions.
  • Manual Input: Direct request or data provided by the user.
  • Answer: AI model response (includes thinking process and generated output).
  • Automatic Prompt: System-triggered action for context management (e.g., compaction/summarization).

As this sample timeline demonstrates, an AI agent doesn't evaluate your most recent prompt in a vacuum. Instead, it continuously reads a trailing window of your entire conversation history to maintain context.

A turn represents a user prompt combined with the model's generated answer. Every single turn permanently appends text to the active context window.

When the cumulative token count of these turns breaches a safe operating threshold, the agent client executes an automated garbage-collection routine called compaction. It sends a behind the scenes prompt asking the model to compress the entire conversation history down into a condensed "summary checkpoint". This drops raw details but creates a lightweight summary of key objectives and findings as well as complete tasks to pass into subsequent turns so the session can continue without crashing after hitting the maximum amount of context tokens.

When you connect to an MCP server, you declare its path or command in your local agent configuration. The underlying catch is that this awareness is not a one-time registration handshake. Because LLM APIs are entirely stateless, your agent application must include the definition schemas for all enabled MCP server tools, resources and prompts into the prompt payload of every single turn.

As a multi-turn workflow unfolds, these lengthy JSON definitions remain trapped in your trailing history payload. You pay a compounding token tax just to remind the model that those capabilities still exist. This constant structural overhead rapidly accelerates the compaction timeline, leading to what the community calls Context Bloat or Context Rot.

Context Bloat occurs when your context includes a lot of irrelevant information for the majority of prompts you send. Imagine a frontend engineer working with a Figma MCP Server. The server exposes dozens of tools to get context, screenshots and meta data for design layouts. If you shift gears mid-session to ask a localized TypeScript refactoring question, those heavy Figma tool definitions are still injected into the prompt payload.

When this happens, you experience two immediate penalties:

  1. Financial Overhead: Every following turn costs more because you are paying for redundant schema text over and over.

  2. Degraded Execution Quality: The model's attention is divided across unused tool definitions, causing it to lose focus on the actual codebase problem at hand, which results in worse answers.

This directly forces the agent into early compaction. To preserve long-term coherence within a restricted window, agents use historical summaries once a specific token threshold is breached. If an MCP schema occupies 60% of your maximum window budget, the agent is forced to constantly compress your actual code conversations into compact summaries far sooner than it normally would, stripping away critical implementation details.

To mitigate context bloat, engineering teams at the forefront of the ecosystem are shifting towards lazy loading instructions and relying on capable AI models to extract what they need directly from your code or API definitions. This strategy is far more token-efficient than injecting massive, manually written tool descriptions upfront.

Anthropic popularized this via the SKILL.md specification. Instead of registering explicit structural schemas for every possible operation upfront, a skill defines how to do something using a lightweight YAML frontmatter header inside a standard Markdown file (e.g. pre-pr-security-audit/SKILL.md):

---
name: pre_pr_security_audit
description: Scans staged changes for OWASP vulnerabilities, secrets, and logic flaws to ensure a "zero-comment" PR approval.
---

# Pre-PR Security & Quality Audit
A comprehensive "self-review" protocol designed to catch blockers, secrets, and nitpicks before code reaches a human reviewer.

## When to use
Trigger this skill when a developer says "I'm ready to push," "Review my staged changes," or "Check this for security" to ensure the submission is production-ready and friction-free.

## How it works
1.  **Analyze Diffs:** Use `git diff --cached` to isolate new logic and ignore unrelated noise in the codebase.
2.  **Secret & Vulnerability Scan:** Search for high-entropy strings (keys/tokens) and cross-reference new patterns against **OWASP Top 10** risks (e.g., SQL injection, insecure direct object references).
3.  **Logic & Resilience:** 
    *   **Edge Cases:** Verify how the code handles `null`, `empty`, or `undefined` inputs.
    *   **Error Handling:** Ensure all new I/O operations (API calls/DB queries) have explicit `try/catch` blocks.
4.  **Style & "Nitpick" Check:** 
    *   Validate naming conventions (descriptive vs. generic).
    *   Confirm that new functionality is reflected in the **README** or relevant documentation.
    *   Check for 1:1 test coverage on new logic.
5.  **Summary:** Output a "Pass/Fail" report categorized by **Critical Blockers** (Security) and **Suggested Refactors** (Style).

With that at startup the agent loads only the name and description strings of available skills to generate local vector embeddings. Vector embeddings are mathematical representations of meaning and essentially semantic coordinates in a high-dimensional space. When you submit a prompt, the agent performs a quick semantic vector search. It compares the mathematical meaning of your request against your available skills, instantly fetching only the ones that match your intent. Only for these high-confidence matched skills does the agent inject the full markdown instructions below the "---" separator into the active context window. This shifts the payload cost from a permanent tax to a just-in-time asset.

Since then, Anthropic released MCP Tool Search as a native, automatic feature within Claude Code. Rather than loading all MCP Server tools right away, it indexes your connected tool descriptions and dynamically lazy-loads tools based on your highly specific prompt. However this only delays the inevitable. The moment your prompt triggers a match or worse matches multiple MCP Server tools, the agent dumps all of those schemas straight into the context window. From that turn onwards, those definitions are permanently trapped in your trailing history, sticking you with the exact same compounding token tax on every subsequent prompt.

A more sophisticated approach is Cloudflare's Code Mode MCP Server, which is now considered the best practice for an MCP Server with more than 2 tools. Rather than exposing hundreds of distinct API endpoints as individual tools, Cloudflare exposes only a search() and an execute() tool. One tool searches their embedded OpenAPI specification to find the right API and schema. The other tool executes the needed API calls in a secure JavaScript sandbox, providing the agent with the tools it needs and the results of their execution.

As major platforms like GitHub Copilot move toward token based billing, and it is being reported that even global players like Microsoft go away from frontier models like Claude adjusting their pricing tiers, token optimization is becoming a core discipline of software engineering.

We have to ask ourselves: are all the local MCP servers running in our background truly necessary? Or could a meticulously written SKILL.md paired with a native CLI tool or a direct API key with a cheaper AI model deliver identical execution accuracy at a fraction of the token cost? In the age of agentic engineering, keeping your context lean is the ultimate competitive advantage.

// DISCUSSION

// RELATED POSTS

Back to all posts