Semantic Chunk Compression

Semantic Chunk Compression

Semantic Chunk Compression

Semantic Chunk Compression (SCC) is a code intelligence engine that parses repositories into AST-aware semantic units, deduplicates them using canonical fingerprinting, and stores a single compressed chunk per unique piece of logic, powering both smarter version control diffs and dramatically cheaper RAG retrieval.

Semantic Chunk Compression (SCC) is a code intelligence engine that parses repositories into AST-aware semantic units, deduplicates them using canonical fingerprinting, and stores a single compressed chunk per unique piece of logic, powering both smarter version control diffs and dramatically cheaper RAG retrieval.

Semantic Chunk Compression (SCC) is a code intelligence engine that parses repositories into AST-aware semantic units, deduplicates them using canonical fingerprinting, and stores a single compressed chunk per unique piece of logic, powering both smarter version control diffs and dramatically cheaper RAG retrieval.

Year

2026

Client

Personal R&D

Category

AI Infrastructure / Systems Design

Product Duration

In Progress
The Problem
The Problem

Modern codebases are stored as repeated file snapshots where every commit duplicates unchanged functions, every RAG pipeline re-embeds the same logic. The result is bloated vector stores, noisy retrieval, and line-based diffs that cannot reason about what actually changed. SCC was built to fix the foundation.

The Core Engine
The Core Engine

SCC works by running every repository through a multi-stage ingestion pipeline: language detection, AST parsing via Tree-sitter, semantic chunk extraction by type (functions, classes, imports, interfaces, config blocks), and dual-form normalization thereby producing both an exact source for lossless reconstruction and a canonical form for stable identity hashing. Each chunk gets a unique ID, a canonical hash, a compressed exact source, and an embedding.

enginnering
Chunk Record & Repo Dictionary
Chunk Record & Repo Dictionary

Every unique chunk is stored in a per-repo dictionary keyed by canonical hash. The dictionary holds: chunk ID, language, type, symbol name, path hints, AST summary, dependency edges in and out, and the embedding vector. Files are not stored whole, they are represented as ordered sequences of chunk IDs plus raw glue segments, keeping reconstruction fully lossless.

blockchain
Version Control Layer
Version Control Layer

On top of the chunk store, SCC enables semantic commits. A commit is a file manifest where each file is a sequence of chunk references and raw glue. Diffs surface at the semantic unit level first: unchanged functions are trivially skipped, changed functions are flagged by chunk ID, and merge conflicts are scoped to the exact function rather than a raw line range. This gives both a semantic diff and a textual diff within the changed chunk.

dev
RAG Layer
RAG Layer

For retrieval, SCC acts as a deduplication and indexing layer beneath vector search. If the same helper function appears 15 times across a repo, it gets one embedding — not fifteen. Retrieval returns chunk IDs, which are then materialized into exact source on demand. The retrieval pipeline is hybrid: vector search, symbol search, path search, and dependency graph expansion are combined before ranking.

Architecture & Stack
Architecture & Stack

Backend: Node.js and TypeScript, Express, PostgreSQL with pgvector for chunk embeddings, Redis for queuing and caching, and object storage for compressed chunk blobs. Parsing: Tree-sitter for multi-language AST extraction. Schema: chunks, chunk_occurrences, dependency_edges, manifests, and manifest_entries fully normalized with separate metadata and blob storage layers.

Analytics & Performance
Analytics & Performance

On a representative mid-size repo: 18,904 semantic chunks found across 4,281 files, with 11,622 unique chunks after deduplication a 38.5% embedding reduction before any queries are run. Analytics and summary cache entries are keyed by chunk ID, giving near-zero recomputation cost for repeated retrievals. This is slightly theoretical

Scale & Scope
Scale & Scope

Built entirely solo as a research and infrastructure project. SCC is designed in three build phases: Phase 1 is the RAG-focused chunk store (current); Phase 2 adds semantic diff and function-level history on top of Git; Phase 3 replaces Git's file snapshot store with a native chunk manifest store. The engine is also incorporated as the core retrieval layer in Gity, a semantic developer tool in development.

  • My Works My Works

03

//FAQ

Concerns

quick

context

01

Designer or Engineer

02

Types of Projects

03

Working Remotely

04

Music Production Background

05

End-to-End Project Capability

06

Contact

//FAQ

Concerns

quick

context

Designer or Engineer
Types of Projects
Working Remotely
Music Production Background
End-to-End Project Capability
Contact

Let'S WORK

TOGETHER

BASED IN abuja,

nigeria

product engineer
+ Interface Architect

Product Engineer focused on AI-powered tools, scalable systems, and interface architecture.

Let'S WORK

TOGETHER

Product Engineer focused on AI-powered tools, scalable systems, and interface architecture.

Create a free website with Framer, the website builder loved by startups, designers and agencies.