Dig should ship as a modular monolith for a 1–2 person team. The same domain modules power a mobile-first web app and an MCP/API layer, which keeps agent answers aligned with what users see in the UI.
The risky part of this project is not the landing page or the API framework. It is the normalization logic for Discogs dumps. The importer should be driven by a profiling-backed normalization dictionary, not by assumptions.
Store parsed entities in ingest.raw_entities before normalization so canonical tables can be re-derived as the schema evolves.
Preserve raw credit role strings, identifier values, and track positions. Parse conservatively, add normalized helpers later.
ANV handling, track position encoding, credits nesting, and master/release linkage should all be profiled before importer logic is locked.
Use tsvector + pg_trgm first. Upgrade to OpenSearch only when measured relevance/latency needs justify the ops overhead.
LLMs orchestrate, Dig retrieves. Tool outputs should be structured, reproducible, and explainable.
MCP and REST should call the same retrieval services so the UI and agent answers stay aligned.
Return why a record matched, where link data came from, and where the system is uncertain.
Agent tools should read, rank, explain, or draft by default. Writes require explicit intent.
/search (catalog entities + filters)/artists/:id/labels/:id/masters/:id/releases/:id/releases/:id/media-links/curators/:slug, /lists/:slugsearch_catalogget_release, get_master_releaseget_artist, get_labelget_related_releasesget_media_linkscreate_crate_draftexplain_relationshipsThe project gets expensive and chaotic when teams start with infrastructure ambition instead of data decisions. This sequence keeps the build grounded and reversible.
Write the normalization dictionary from real records, not assumptions. Validate ANV handling, track positions, credits nesting, identifiers, and master/release linkage before importer code.
Schema migrations, dump_batches, raw_entities, parser pipeline, canonical upserts, and QA/reconciliation reports.
Postgres FTS + pg_trgm, core entity endpoints, query profiling/indexes, and a mobile-first retrieval UI.
Curators, crates, link provenance, validation jobs, and the first useful human-facing output loops.
Agent tools backed by the same domain services as the REST API. Deterministic retrieval, explicit side effects, and confidence/provenance in responses.
External catalog candidate store, confidence scoring, export jobs, and a review path for ambiguous matches.
Only move to OpenSearch/Elasticsearch when measured relevance or latency makes the migration worth the overhead.
Architecture: modular monolith Search v1: Postgres FTS + pg_trgm Ingest: raw payload staging in ingest.raw_entities Auth v1: editorial-only LLM strategy: no proxying; expose MCP/API and let users bring models Non-goals: marketplace, compliance stack, Discogs write-back dependency
The most useful feedback right now is on the thesis, scope boundaries, and implementation order — especially whether the data and MCP decisions feel tight enough for a small team build.
Is the AI-layer positioning clear enough, or does it still read like a better Discogs UI?
Are the normalization risks and raw-first staging approach called out clearly enough before implementation starts?
Does the modular monolith + Postgres search-first plan feel credible for a 1–2 person team?
Does the API/MCP + curation path sound monetizable before any marketplace/compliance overhead?