Skip to Content
HeadGym PABLO
Skip to Content
PostsLlmsClaude Code: Hidden Limitations and Workarounds
Tags:#ai_and_agents#software_engineering

What Anthropic Did Not Tell You About Claude Code

The Reverse Engineering Report That Changes Everything

I reverse-engineered Claude Code’s leaked source code against billions of tokens from my own agent logs. What I discovered is stark: Anthropic is acutely aware of Claude Code’s hallucination and reliability problems—and the fixes exist. They’re just gated behind employee verification.

The internal documentation reveals a 29-30% false-claims rate on the current model. Anthropic built the solutions. Then they kept them for themselves.

Here’s what you need to know to work around it.


1. The Employee-Only Verification Gate: Success Without Verification

The Problem

You ask Claude Code to edit three files. It completes the task and reports “Done!” with the enthusiasm of a fresh intern desperate for approval. You open the project to find 40 errors.

Here’s why: In services/tools/toolExecution.ts, the agent’s success metric is brutally simple—did the write operation complete? Not “does the code compile.” Not “did I introduce type errors.” Just: did bytes hit disk? If yes, the task is marked complete.

The Hidden Fix

The source contains explicit post-edit verification instructions. The agent is designed to check that all tests pass, run the script, confirm output. These instructions exist in the code. But they’re gated behind a single condition:

process.env.USER_TYPE === 'ant'

Anthropic employees get verification. Everyone else doesn’t.

The Workaround

You must inject the verification loop manually. In your CLAUDE.md system prompt, make post-edit verification non-negotiable:

After every file modification, before reporting success, you must run npx tsc --noEmit and npx eslint . --quiet. If either check fails, fix the errors and re-run. Only report completion when both checks pass.

This single override eliminates the false-claims problem. It’s not optional—it’s the difference between usable output and technical debt.


2. Context Death Spiral: The 167K Token Cliff

The Problem

You start a long refactor. The first 10 messages are surgical, precise, intelligent. By message 15, the agent halluccinates variable names, references functions that don’t exist, and breaks things it understood perfectly five minutes ago. It feels like degradation. It’s not.

It’s amputation.

The Root Cause

services/compact/autoCompact.ts runs a compaction routine when context pressure crosses approximately 167,000 tokens. When it fires, the system keeps 5 files (capped at 5K tokens each), compresses everything else into a single 50,000-token summary, and discards every file read, every reasoning chain, every intermediate decision.

All of it. Gone.

The system doesn’t warn you. The agent doesn’t know what was lost. It continues working from a lossy representation.

Why It Accelerates

Dirty, sloppy, “vibecoded” codebases accelerate this collapse. Every dead import, every unused export, every orphaned prop is eating tokens that contribute nothing to the task but everything to triggering compaction earlier.

The Workaround

Step 0 of any refactor must be deletion—not restructuring, but aggressive pruning of dead weight.

  1. Strip dead props, unused exports, orphaned imports, debug logs
  2. Commit this separately
  3. Only then start the real work with a clean token budget
  4. Keep each phase under 5 files to prevent compaction mid-task

This isn’t cleanup—it’s token budget management. Treat it as a prerequisite, not a nice-to-have.


3. The Brevity Mandate: System Prompt vs. Your Intent

The Problem

You ask Claude Code to fix a complex architectural bug. Instead of addressing the root cause, it adds a messy if/else band-aid and moves on. You assume it’s being lazy. It’s not. It’s being obedient.

The Hidden Instructions

constants/prompts.ts contains explicit system-level directives that actively fight your intent:

  • “Try the simplest approach first.”
  • “Don’t refactor code beyond what was asked.”
  • “Three similar lines of code is better than a premature abstraction.”

These aren’t suggestions. They’re system-level instructions that define what “done” means. Your prompt says “fix the architecture.” The system prompt says “do the minimum amount of work you can.” System prompt wins unless you override it.

The Workaround

You must redefine what “minimum” and “simple” mean. Override the brevity mandate explicitly:

What would a senior, experienced, perfectionist developer reject in code review? Fix all of it. Don’t optimize for brevity—optimize for correctness and maintainability. Assume you’re building for a team that will maintain this code for years.

You’re not adding requirements. You’re reframing what constitutes acceptable work.


4. The Agent Swarm Nobody Told You About: Unused Parallelism

The Problem

You ask the agent to refactor 20 files. By file 12, it’s lost coherence on file 3. Obvious context decay. What’s less obvious—and deeply frustrating—is that Anthropic built the solution and never surfaced it.

The Hidden Architecture

utils/agentContext.ts shows that each sub-agent runs in its own isolated AsyncLocalStorage with:

  • Own memory
  • Own compaction cycle
  • Own token budget

There is no hardcoded MAX_WORKERS limit in the codebase. Anthropic built a multi-agent orchestration system with no ceiling—and left you using a single agent like it’s 2023.

One agent has approximately 167K tokens of working memory. Five parallel agents = 835K tokens of total capacity.

The Workaround

Force sub-agent deployment. Batch files into groups of 5-8 and launch them in parallel. Each gets its own context window:

Batch 1 (files 1-5): Agent A Batch 2 (files 6-10): Agent B Batch 3 (files 11-15): Agent C Batch 4 (files 16-20): Agent D

This isn’t a performance optimization—it’s a correctness requirement for large refactors.


5. The 2,000-Line Blind Spot: Silent Truncation

The Problem

The agent “reads” a 3,000-line file, then makes edits referencing code from line 2,400 that it clearly never processed. The edits break because the agent was working from incomplete information.

The Root Cause

tools/FileReadTool/limits.ts hard-caps each file read at 2,000 lines / 25,000 tokens. Everything past that is silently truncated. The agent doesn’t know what it didn’t see. It doesn’t warn you. It hallucinates the rest and keeps going.

The Workaround

Any file over 500 LOC gets read in chunks using offset and limit parameters. Never let the agent assume a single read captured the full file:

Read file with offset=0, limit=500 Read file with offset=500, limit=500 Read file with offset=1000, limit=500 ...continue until EOF

If you don’t enforce this, you’re trusting edits against code the agent literally cannot see.


6. Tool Result Blindness: The 50K Character Trap

The Problem

You ask for a codebase-wide grep. It returns “3 results.” You check manually—there are 47.

The Mechanism

utils/toolResultStorage.ts shows that tool results exceeding 50,000 characters get persisted to disk and replaced with a 2,000-byte preview. The agent works from the preview. It doesn’t know results were truncated. It reports 3 because that’s all that fit in the preview window.

The Workaround

Scope narrowly. If results look suspiciously small, re-run directory by directory. When in doubt, assume truncation happened and verify manually:

grep pattern src/ grep pattern src/components/ grep pattern src/utils/ grep pattern src/services/

This is tedious. It’s also necessary.


7. Grep Is Not an AST: Text Matching vs. Semantic Understanding

The Problem

You rename a function. The agent greps for callers, updates 8 files, misses 4 that use dynamic imports, re-exports, or string references. The code compiles in the files it touched. Of course, it breaks everywhere else.

Why It Happens

Claude Code has no semantic code understanding. GrepTool is raw text pattern matching. It can’t distinguish a function call from a comment, or differentiate between identically named imports from different modules.

The Workaround

On any rename or signature change, force separate searches for:

  • Direct calls: functionName(
  • Type references: typeof functionName
  • String literals: "functionName"
  • Dynamic imports: import( patterns
  • Require calls: require() patterns
  • Re-exports: export { functionName }
  • Barrel files: index.ts files
  • Test mocks: jest.mock() patterns

Assume grep missed something. Verify manually or accept the regression.


The Enterprise Implication

Anthropic has demonstrated that Claude Code can be reliable for their employees. The gap between internal capabilities and public product reveals a strategic choice: optimize for internal productivity, ship a capable-but-flawed product externally, and document the gap internally.

This isn’t incompetence. It’s triage.

For enterprises betting on Claude Code for mission-critical development, the question is whether these workarounds are sufficient, or whether the architectural limitations are too fundamental to overcome through prompt engineering alone.

The answer depends on your tolerance for friction, your team’s discipline, and whether you can enforce these patterns consistently across a large codebase.

What Anthropic didn’t tell you is that you can make Claude Code reliable. You just have to do the work they reserved for themselves.

Last updated on