Skip to Content
HeadGym PABLO
Skip to Content
PostsUncategorizedWhy Agents Brought Grep Back—and Why Cursor Had to Rethink It
Tags:#software_engineering

Why Agents Brought Grep Back—and Why Cursor Had to Rethink It

For decades, grep felt like a solved problem.

Editors grew smarter. IDEs learned syntax, types, symbols, and references. LSP standardized code navigation. Searching raw text became something you only reached for when everything else failed.

Then agents showed up.

Modern coding agents don’t “navigate” codebases the way humans do. They don’t jump to definitions or follow symbol graphs unless explicitly guided. What they do—constantly—is search for text. Literal strings. Config keys. Error messages. Macro names. Edge-case patterns buried in millions of lines of code.

And they do it with regex.

That immediately exposed a problem: even the fastest grep in the world still has to read every file.

In small repos, that’s fine. In enterprise monorepos, it’s catastrophic. Cursor routinely sees ripgrep calls taking 10–15 seconds. For an interactive agent loop, that’s an eternity.

So Cursor did something that feels obvious in hindsight but surprisingly rare in practice:
they treated regex search as a first‑class indexed operation, the same way IDEs treat “Go to Definition”.


The Core Insight: Regex Search Is Now Infrastructure

Agentic development changes the cost model of tooling.

Regex search is no longer an occasional developer action. It’s a hot path in the agent’s reasoning loop. Every delay compounds: slower searches mean longer agent chains, wasted tokens, and degraded reasoning quality.

Semantic search helps—but it doesn’t replace regex. Some questions are inherently textual:

  • Where is this constant defined?
  • What code checks this env var?
  • Where does this error string originate?
  • What writes to MAX_FILE_SIZE?

If the agent can’t answer those precisely, it thrashes.

The goal isn’t to replace grep—but to stop running grep against the entire repository every time.


Why Naive Indexing Doesn’t Work for Code + Regex

Classic search engines rely on inverted indexes over tokens—words, identifiers, terms. That breaks down fast for source code and regex:

  • Tokenization becomes language-specific and brittle
  • Regex doesn’t respect token boundaries
  • Identifiers, punctuation, and formatting matter

GitHub learned this the hard way with early code search. Token-based indexing couldn’t answer regex queries reliably, and results were noisy or incomplete.

The breakthrough that does generalize is character‑level indexing.


Trigrams: The Workhorse You’ve Probably Used Without Noticing

At the heart of most fast regex search systems is a simple idea: trigrams.

Every file is broken into overlapping three‑character sequences:

MAX_FILE_SIZE → MAX, AX_, X_F, _FI, FIL, ILE, LE_, E_S, _SI, SIZ, IZE

Those trigrams become keys in an inverted index. Each key points to the files where it appears.

When a regex comes in, you don’t run it everywhere. You decompose it into the trigrams it must contain, intersect the posting lists, and only scan the surviving candidates.

This already turns “scan the entire repo” into “scan a few dozen files”.

But trigrams alone have limits:

  • Indexes get large
  • Common trigrams explode posting lists
  • Query decomposition becomes a trade‑off between precision and speed

Cursor didn’t stop here.


Making Trigrams Smarter Instead of Bigger

The key design decision Cursor made was to optimize selectivity, not completeness.

Instead of indexing every trigram equally, Cursor applies ideas that have emerged in large-scale systems over the last decade.

Sparse n‑grams

Rather than extracting every overlapping trigram, Cursor uses sparse n‑grams—substrings of varying length chosen deterministically based on character‑pair weights.

Think of it as letting the data decide what’s interesting:

  • Rare character sequences get higher weight
  • Common sequences fade into the background
  • Only substrings that stand out relative to their neighbors get indexed

Indexing is more expensive up front. Querying becomes dramatically cheaper.

At query time, the engine extracts only a minimal covering set of n‑grams—often far fewer lookups than classic trigram approaches—while still narrowing the search space aggressively.

Frequency‑aware weighting

Instead of random weights, Cursor biases the system using real‑world statistics: character‑pair frequencies learned from massive open‑source corpora.

Unusual sequences—things that actually disambiguate code—become the strongest anchors in the index.

The result: fewer candidate files, faster searches, and less wasted work.


Why Everything Runs Locally (and Why That Matters)

One of the more counterintuitive choices Cursor made is to keep this entire system on the client.

Regex indexes aren’t just lookup structures—they’re part of a pipeline that still ends with deterministic scanning of files. Shipping file contents to a server or syncing search state remotely introduces latency, complexity, and privacy risks.

Local indexing gives Cursor several advantages:

  • Zero network round‑trips in the agent loop
  • Immediate consistency when the agent edits code and searches again
  • No server‑side storage of proprietary repositories

To make this practical, the index is:

  • Disk‑backed
  • Memory‑mapped
  • Incrementally updated using a Git‑commit baseline plus a local write layer

Only a compact lookup table is memory‑resident. Posting lists are read directly from disk. Hash collisions may slightly widen results—but never produce incorrect matches.

False positives are acceptable. False negatives are not.


Why This Changes Agent Behavior (Not Just Performance)

The most interesting outcome isn’t that searches are faster.

It’s that agents behave differently when search is cheap.

When regex queries return instantly:

  • Agents verify assumptions instead of guessing
  • Exploration becomes cheaper than speculation
  • Backtracking decreases
  • Debugging chains become shorter and more deterministic

In large repositories, this creates a qualitative shift. The agent stops hallucinating structure and starts interrogating reality.

Cursor’s internal benchmarks show large reductions in end‑to‑end task time—not because models think faster, but because they stop waiting on grep.


Indexes as Cognitive Infrastructure

This work points to something bigger.

As agents become primary users of developer tools, we need to rethink what we index, where we index it, and why. Semantic embeddings help with fuzzy recall. Graphs help with structure. But textual certainty still matters.

Regex indexing isn’t glamorous. It isn’t ML‑heavy. It’s systems work.

And that’s exactly why it works.

Cursor didn’t invent these algorithms—but they assembled them into something purpose‑built for agentic workflows, embedded directly into the editor, optimized for the worst case: massive repositories, constant mutation, and zero tolerance for latency.

In the age of agents, grep didn’t die.

It got an index—and finally caught up with the rest of the IDE.

Last updated on