DIG Technical White Paper - Architecture, Data Strategy, MCP/API

Architecture

One retrieval core. Two interfaces.

Dig should ship as a modular monolith for a 1–2 person team. The same domain modules power a mobile-first web app and an MCP/API layer, which keeps agent answers aligned with what users see in the UI.

flowchart TB subgraph Clients["Clients"] Web["Mobile-First Web App"] Apps["Apps / Partners"] LLM["LLM Agent Runtime"] end subgraph Surface["Interfaces"] REST["REST API"] MCP["MCP Server"] Admin["Editorial/Admin"] end subgraph Core["Modular Monolith"] Catalog["Catalog"] Search["Search (Postgres FTS + pg_trgm)"] Curation["Curation"] Media["Media Links"] Jobs["Ingest / Workers"] Match["Matching / Export (later)"] end subgraph Data["Data"] PG["Postgres"] Redis["Redis"] end Web --> REST Apps --> REST LLM --> MCP MCP --> REST Admin --> REST REST --> Catalog REST --> Search REST --> Curation REST --> Media REST --> Match REST --> Jobs Catalog --> PG Search --> PG Curation --> PG Media --> PG Match --> PG Jobs --> PG Jobs --> Redis

Data strategy

Normalization decisions before importer code.
No exceptions.

The risky part of this project is not the landing page or the API framework. It is the normalization logic for Discogs dumps. The importer should be driven by a profiling-backed normalization dictionary, not by assumptions.

Raw payload staging is non-negotiable

Store parsed entities in ingest.raw_entities before normalization so canonical tables can be re-derived as the schema evolves.

Messy fields stay raw in v1

Preserve raw credit role strings, identifier values, and track positions. Parse conservatively, add normalized helpers later.

Load-bearing edge cases

ANV handling, track position encoding, credits nesting, and master/release linkage should all be profiled before importer logic is locked.

Search starts in Postgres

Use tsvector + pg_trgm first. Upgrade to OpenSearch only when measured relevance/latency needs justify the ops overhead.

flowchart LR A["Discogs XML Dumps"] --> B["dump_batches"] B --> C["XML Stream Parse"] C --> D["raw_entities (JSON payloads)"] D --> E["Normalize / Validate"] E --> F["catalog.* canonical tables"] F --> G["search_documents (Postgres FTS)"] F --> H["curation + media links"] F --> I["matching/export (later)"]

API + MCP

Deterministic tools

LLMs orchestrate, Dig retrieves. Tool outputs should be structured, reproducible, and explainable.

Shared domain logic

MCP and REST should call the same retrieval services so the UI and agent answers stay aligned.

Provenance + confidence

Return why a record matched, where link data came from, and where the system is uncertain.

No hidden writes

Agent tools should read, rank, explain, or draft by default. Writes require explicit intent.

Core retrieval API (REST)

GET /search (catalog entities + filters)
GET /artists/:id
GET /labels/:id
GET /masters/:id
GET /releases/:id
GET /releases/:id/media-links
GET /curators/:slug, /lists/:slug

LLM-facing MCP tools

search_catalog
get_release, get_master_release
get_artist, get_label
get_related_releases
get_media_links
create_crate_draft
explain_relationships

Implementation plan

Risk-first build plan for a 1–2 person team

The project gets expensive and chaotic when teams start with infrastructure ambition instead of data decisions. This sequence keeps the build grounded and reversible.

Pre-M0

Profile real Discogs dump samples

Write the normalization dictionary from real records, not assumptions. Validate ANV handling, track positions, credits nesting, identifiers, and master/release linkage before importer code.

Ingestion foundation

Schema migrations, dump_batches, raw_entities, parser pipeline, canonical upserts, and QA/reconciliation reports.

Search + read API

Postgres FTS + pg_trgm, core entity endpoints, query profiling/indexes, and a mobile-first retrieval UI.

Curation + media links + editorial tooling

Curators, crates, link provenance, validation jobs, and the first useful human-facing output loops.

MCP layer

Agent tools backed by the same domain services as the REST API. Deterministic retrieval, explicit side effects, and confidence/provenance in responses.

Matching / export beta (optional)

External catalog candidate store, confidence scoring, export jobs, and a review path for ambiguous matches.

Search cluster decision gate

Only move to OpenSearch/Elasticsearch when measured relevance or latency makes the migration worth the overhead.

Current build defaults (the useful ones)

Architecture: modular monolith
Search v1: Postgres FTS + pg_trgm
Ingest: raw payload staging in ingest.raw_entities
Auth v1: editorial-only
LLM strategy: no proxying; expose MCP/API and let users bring models
Non-goals: marketplace, compliance stack, Discogs write-back dependency

Feedback

This is a working paper.
Tear it apart.

The most useful feedback right now is on the thesis, scope boundaries, and implementation order — especially whether the data and MCP decisions feel tight enough for a small team build.

Product

Is the AI-layer positioning clear enough, or does it still read like a better Discogs UI?

Data

Are the normalization risks and raw-first staging approach called out clearly enough before implementation starts?

Technical

Does the modular monolith + Postgres search-first plan feel credible for a 1–2 person team?

Commercial

Does the API/MCP + curation path sound monetizable before any marketplace/compliance overhead?

Back to dig.baby Share white paper ↗