AI coding agents using grep/ripgrep waste thousands of tokens and context on false positives. CodeGrok MCP uses AST-based semantic search with local vector embeddingsAI coding agents using grep/ripgrep waste thousands of tokens and context on false positives. CodeGrok MCP uses AST-based semantic search with local vector embeddings

CodeGrok MCP: Semantic Code Search That Saves AI Agents 10x in Context Usage

When you ask Claude Code, Cursor, or Windsurf "how does authentication work in this project?", here's what actually happens behind the scenes:

$ grep -r "authentication" src/ src/auth/login.py:42:def verify_user(username, password): src/models.py:10:user_email = "user@example.com" src/config.py:5:# authentication settings src/utils.py:150:verify_user_input() ... 30+ more results, mostly noise

The agent then reads entire files to understand context. For a 10,000-file codebase, this means burning thousands of tokens and context per query tokens that could be answering your actual question.

I built CodeGrok MCP to fix this.

What CodeGrok Actually Does

CodeGrok MCP takes a fundamentally different approach: AST-based semantic indexing that runs entirely on your machine. No cloud. No API calls. Your code never leaves your device.

Instead of searching text, CodeGrok parses code into Abstract Syntax Trees using Tree-sitter. It extracts semantic symbols functions, classes, methods, variables from 9 languages and 30+ file extensions:

  • Python (.py, .pyi, .pyw)
  • JavaScript (.js, .jsx, .mjs, .cjs)
  • TypeScript (.ts, .tsx, .mts, .cts)
  • C/C++ (.c, .cpp, .h, .hpp)
  • Go, Java, Kotlin, Bash

Each symbol becomes a single chunk with rich metadata. Not arbitrary line splits. Not entire files. Just the code you need.

The Embedding Pipeline

Here's where it gets interesting. CodeGrok uses nomic-ai/CodeRankEmbed a model specifically trained for code retrieval to generate 768-dimensional vectors for each symbol:

'coderankembed': { 'hf_name': 'nomic-ai/CodeRankEmbed', 'dimensions': 768, 'max_seq_length': 8192, 'query_prefix': 'Represent this query for searching relevant code: ', }

Performance characteristics:

  • ~50 embeddings/second on CPU (faster with GPU)
  • LRU cache with 1000 entries for repeated queries
  • Incremental reindexing via mtime comparison only changed files get re-processed

Each symbol gets formatted with everything an AI agent needs:

# src/auth/login.py:42 function: verify_user def verify_user(username: str, password: str) -> bool: Verifies user credentials against the database. def verify_user(username: str, password: str) -> bool: user = db.query(User).filter_by(username=username).first() return check_password(password, user.password_hash) Imports: db, check_password Calls: db.query, check_password

File location, symbol type, signature, docstring, implementation, and dependencies all in one indexed chunk.

How AI Agents Connect

CodeGrok exposes semantic search through the Model Context Protocol (MCP). If you're using Claude Desktop, Cursor, or any MCP-compatible client, integration is straightforward.

Four tools handle everything:

| Tool | Purpose | |----|----| | learn | Index a codebase (auto/full/load_only modes) | | get_sources | Semantic search with language/symbol filters | | get_stats | Return index statistics | | list_supported_languages | List supported languages |

The get_sources tool is where the magic happens:

@mcp.tool(name="get_sources") def get_sources( question: str, # "How does user authentication work?" n_results: int = 10, # Top-k results language: str = None, # Filter: "python", "javascript" symbol_type: str = None # Filter: "function", "class", "method" ) -> Dict[str, Any]:

Query "How does authentication work?" and get:

  • src/auth/login.py:42 - verify_user()
  • src/auth/mfa.py:78 - validate_mfa_token()

No comment matches. No string literals. No config files mentioning the word "authentication." Just the functions that actually handle authentication.

The Numbers That Matter

| Aspect | Grep | CodeGrok MCP | |----|----|----| | Matching | Keyword/regex | Semantic similarity | | False positives | High | Very low | | Synonyms | ❌ "authenticate" ≠ "verify" | ✅ Understands intent | | Metadata | None | Line #, signature, type, language | | Token usage | Read entire files | Returns exact functions | | Persistence | Scan every time | Pre-indexed, instant search |

For enterprises, this means code stays on-premises. For solo developers, it means no API keys, no subscriptions, and it works offline after the initial model download.

Getting Started

pip install codegrok-mcp codegrok-mcp # Starts MCP server on stdio

Configure your MCP client to connect. Then:

  1. learn your codebase
  2. get_sources with natural language queries
  3. Get precise code references instead of grep noise

Embeddings persist in .codegrok/ within your project directory. Subsequent indexes are near-instant because only changed files get re-processed.

GitHub: github.com/dondetir/CodeGrok_mcp


I'm a Engineer who builds open-source AI tools through DS APPS Inc. CodeGrok MCP came from frustration with watching AI agents burn context windows on irrelevant grep results. The source is MIT licensed contributions welcome.

\

Market Opportunity
Sleepless AI Logo
Sleepless AI Price(AI)
$0.04074
$0.04074$0.04074
-1.59%
USD
Sleepless AI (AI) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Real Estate Tokenization: Why Legal Architecture Matters More Than Technology

Real Estate Tokenization: Why Legal Architecture Matters More Than Technology

Oleg Lebedev on How Corporate Law Determines the Success or Failure of Digital Asset Projects. Real estate tokenization is gaining momentum worldwide.Visit Website
Share
Coinstats2026/01/10 02:00
Why Altcoins Could Be Primed for 5–10x Gains After Years of Consolidation

Why Altcoins Could Be Primed for 5–10x Gains After Years of Consolidation

Altcoins are poised for a potential 5-10x surge after long consolidation, with dominance set to rise in 2025 based on historical trends. The cryptocurrency market
Share
LiveBitcoinNews2026/01/10 02:32
Whales Dump 200 Million XRP in Just 2 Weeks – Is XRP’s Price on the Verge of Collapse?

Whales Dump 200 Million XRP in Just 2 Weeks – Is XRP’s Price on the Verge of Collapse?

Whales offload 200 million XRP leaving market uncertainty behind. XRP faces potential collapse as whales drive major price shifts. Is XRP’s future in danger after massive sell-off by whales? XRP’s price has been under intense pressure recently as whales reportedly offloaded a staggering 200 million XRP over the past two weeks. This massive sell-off has raised alarms across the cryptocurrency community, as many wonder if the market is on the brink of collapse or just undergoing a temporary correction. According to crypto analyst Ali (@ali_charts), this surge in whale activity correlates directly with the price fluctuations seen in the past few weeks. XRP experienced a sharp spike in late July and early August, but the price quickly reversed as whales began to sell their holdings in large quantities. The increased volume during this period highlights the intensity of the sell-off, leaving many traders to question the future of XRP’s value. Whales have offloaded around 200 million $XRP in the last two weeks! pic.twitter.com/MiSQPpDwZM — Ali (@ali_charts) September 17, 2025 Also Read: Shiba Inu’s Price Is at a Tipping Point: Will It Break or Crash Soon? Can XRP Recover or Is a Bigger Decline Ahead? As the market absorbs the effects of the whale offload, technical indicators suggest that XRP may be facing a period of consolidation. The Relative Strength Index (RSI), currently sitting at 53.05, signals a neutral market stance, indicating that XRP could move in either direction. This leaves traders uncertain whether the XRP will break above its current resistance levels or continue to fall as more whales sell off their holdings. Source: Tradingview Additionally, the Bollinger Bands, suggest that XRP is nearing the upper limits of its range. This often points to a potential slowdown or pullback in price, further raising concerns about the future direction of the XRP. With the price currently around $3.02, many are questioning whether XRP can regain its footing or if it will continue to decline. The Aftermath of Whale Activity: Is XRP’s Future in Danger? Despite the large sell-off, XRP is not yet showing signs of total collapse. However, the market remains fragile, and the price is likely to remain volatile in the coming days. With whales continuing to influence price movements, many investors are watching closely to see if this trend will reverse or intensify. The coming weeks will be critical for determining whether XRP can stabilize or face further declines. The combination of whale offloading and technical indicators suggest that XRP’s price is at a crossroads. Traders and investors alike are waiting for clear signals to determine if the XRP will bounce back or continue its downward trajectory. Also Read: Metaplanet’s Bold Move: $15M U.S. Subsidiary to Supercharge Bitcoin Strategy The post Whales Dump 200 Million XRP in Just 2 Weeks – Is XRP’s Price on the Verge of Collapse? appeared first on 36Crypto.
Share
Coinstats2025/09/17 23:42