Year
2026
Client
Personal R&D
Category
AI Infrastructure / Systems Design
Product Duration
In Progress
Modern codebases are stored as repeated file snapshots where every commit duplicates unchanged functions, every RAG pipeline re-embeds the same logic. The result is bloated vector stores, noisy retrieval, and line-based diffs that cannot reason about what actually changed. SCC was built to fix the foundation.
SCC works by running every repository through a multi-stage ingestion pipeline: language detection, AST parsing via Tree-sitter, semantic chunk extraction by type (functions, classes, imports, interfaces, config blocks), and dual-form normalization thereby producing both an exact source for lossless reconstruction and a canonical form for stable identity hashing. Each chunk gets a unique ID, a canonical hash, a compressed exact source, and an embedding.

Every unique chunk is stored in a per-repo dictionary keyed by canonical hash. The dictionary holds: chunk ID, language, type, symbol name, path hints, AST summary, dependency edges in and out, and the embedding vector. Files are not stored whole, they are represented as ordered sequences of chunk IDs plus raw glue segments, keeping reconstruction fully lossless.

On top of the chunk store, SCC enables semantic commits. A commit is a file manifest where each file is a sequence of chunk references and raw glue. Diffs surface at the semantic unit level first: unchanged functions are trivially skipped, changed functions are flagged by chunk ID, and merge conflicts are scoped to the exact function rather than a raw line range. This gives both a semantic diff and a textual diff within the changed chunk.

For retrieval, SCC acts as a deduplication and indexing layer beneath vector search. If the same helper function appears 15 times across a repo, it gets one embedding — not fifteen. Retrieval returns chunk IDs, which are then materialized into exact source on demand. The retrieval pipeline is hybrid: vector search, symbol search, path search, and dependency graph expansion are combined before ranking.

Backend: Node.js and TypeScript, Express, PostgreSQL with pgvector for chunk embeddings, Redis for queuing and caching, and object storage for compressed chunk blobs. Parsing: Tree-sitter for multi-language AST extraction. Schema: chunks, chunk_occurrences, dependency_edges, manifests, and manifest_entries fully normalized with separate metadata and blob storage layers.

On a representative mid-size repo: 18,904 semantic chunks found across 4,281 files, with 11,622 unique chunks after deduplication a 38.5% embedding reduction before any queries are run. Analytics and summary cache entries are keyed by chunk ID, giving near-zero recomputation cost for repeated retrievals. This is slightly theoretical

Built entirely solo as a research and infrastructure project. SCC is designed in three build phases: Phase 1 is the RAG-focused chunk store (current); Phase 2 adds semantic diff and function-level history on top of Git; Phase 3 replaces Git's file snapshot store with a native chunk manifest store. The engine is also incorporated as the core retrieval layer in Gity, a semantic developer tool in development.


