74 lines
7.0 KiB
Markdown
74 lines
7.0 KiB
Markdown
# Embedding & Retrieval Pipeline Checklist: Enhanced Embedding with Image Parsing and Metadata
|
||
|
||
**Purpose**: Author self-review of embedding pipeline and retrieval requirements quality — validates completeness, clarity, and measurability before implementation tasks are written
|
||
**Created**: 2026-04-03
|
||
**Feature**: [spec.md](../spec.md) | [research.md](../research.md) | [data-model.md](../data-model.md)
|
||
**Focus**: A (Embedding pipeline) + B (Retrieval & ranking) | Depth: Standard | Audience: Author
|
||
|
||
---
|
||
|
||
## Requirement Completeness — Embedding Pipeline
|
||
|
||
- [X] CHK001 - Is the definition of "inspect every page" complete — does the spec cover pages that have no extractable content layer (fully scanned/rasterised pages)? Yes [Completeness, Spec §FR-001, Assumption §6]
|
||
|
||
- [X] CHK002 - Does FR-002 define what "independently searchable" means in practice — specifically, is it clear that image chunks must be retrievable without a co-located text chunk? [Clarity, Spec §FR-002] - No image should be retrieved along linked text.
|
||
|
||
- [X] CHK003 - Is the minimum acceptable quality of the "descriptive textual representation" (FR-003) specified — e.g., must it include structural relationships, labelled regions, or clinical terms — or is any non-empty description sufficient? [Clarity, Spec §FR-003, Gap] - any non-empty description sufficient. Text just below the image should have the correct clinical term.
|
||
|
||
- [C] CHK004 - Are the caption-detection rules defined at spec level — specifically, what pattern or signal determines that a piece of text is a caption vs. body text adjacent to an image? [Clarity, Spec §FR-004, Gap] - We assume a text starting with Fig. follewed by number is a text description of a give image.
|
||
|
||
- [X] CHK005 - Does FR-004 specify what metadata is stored when a caption is absent — is the caption field omitted, left empty, or populated with a generated substitute? [Completeness, Spec §FR-004] - generated substitute
|
||
|
||
- [X] CHK006 - Is the "minimum meaningful-content threshold" (FR-007) quantified in the spec, or is it deferred entirely to implementation? The assumption section says "size threshold determined during implementation" — is this intentional and acceptable at the spec level? [Ambiguity, Spec §FR-007, Assumption §3] - Deferred to implementation
|
||
|
||
- [X] CHK007 - Does FR-008 specify the observable outcome of per-page image failures — specifically, is there a requirement that the book's processing status or error log is accessible to the user or admin after partial failure? [Completeness, Spec §FR-008, Gap] online logs
|
||
|
||
- [X] CHK008 - Is FR-010 ("MUST NOT degrade accuracy or completeness of text-only embedding") measurable — does the spec define a baseline or acceptance criterion against which degradation can be detected? [Measurability, Spec §FR-010, Gap] no definition
|
||
|
||
- [X] CHK009 - Are re-embedding requirements complete — does the spec cover what happens to in-progress queries and cached results while a book is being re-embedded? [Coverage, Assumption §8, Gap] - No need to take that into account.
|
||
|
||
---
|
||
|
||
## Requirement Completeness — Retrieval & Ranking
|
||
|
||
- [X] CHK010 - Does FR-006 define how image and text chunks are ranked relative to each other — is ranking unified (single score), or are the two modalities ranked independently with separate topK controls? [Clarity, Spec §FR-006, Gap] - independent separated topK
|
||
|
||
- [X] CHK011 - Is the relevance threshold for figure retrieval specified — i.e., at what similarity score (or other criterion) should a figure be excluded from results? [Clarity, Spec §FR-006, Gap] not specified
|
||
|
||
- [X] CHK012 - Are deduplication rules defined for the case where the same figure appears both in the semantic figure search and the chunk-to-figure reference lookup — which representation wins, or are both included? [Completeness, data-model.md §RetrievalResult, Gap] not specified
|
||
|
||
- [X] CHK013 - Is the requirement for parent section context expansion in the spec — specifically, is there a requirement that the LLM receives the full section text (not just the chunk) when a text chunk is retrieved? [Gap, research.md §Decision 1] - the LLM should receive the full section to have maximum context.
|
||
|
||
- [X] CHK014 - Does the spec define the required structure of the LLM prompt when both text context and figures are present — or is prompt design left entirely to implementation? [Completeness, Gap] - Left to implementation
|
||
|
||
- [X] CHK015 - Is SC-002 ("70% recall on image queries") sufficient as a measurability criterion — is the test set composition (10 queries) and evaluation method documented, or does it rely on an undefined manual process? [Measurability, Spec §SC-002] - Manual process.
|
||
|
||
---
|
||
|
||
## Scenario Coverage — Edge & Exception Cases
|
||
|
||
- [X] CHK016 - Does the spec address the scenario where a query is relevant to a book section that has figures but none of those figures rank above the retrieval threshold — is the expected fallback behaviour defined? [Coverage, Edge Case, Gap] - The figure should in this case be retrieved and shon to the user.
|
||
|
||
- [X] CHK017 - Is the scenario of a figure retrieved in search results but whose image file is missing from the file store covered — what should the system return to the user in that case? [Coverage, Exception Flow, Gap] - missing image error, shown in the front as a broken image link.
|
||
|
||
- [X] CHK018 - Are requirements defined for multi-image pages where images have conflicting captions or share a single composite caption — which image gets the caption, or is it duplicated? [Coverage, Spec §FR-004, Edge Case] - this case not exist.
|
||
|
||
---
|
||
|
||
## Consistency & Alignment
|
||
|
||
- [X] CHK019 - Are the metadata fields required by FR-004 and FR-005 fully consistent with the metadata schema defined in data-model.md — specifically, do the mandatory fields in the spec match the `type`, `section_id`, and `section_title` fields in the data model? [Consistency, Spec §FR-004, data-model.md §Vector Store Documents] - Left to implementation
|
||
|
||
- [X] CHK020 - Is SC-003 ("processing time ≤ 3× baseline") consistent with FR-003 — if description generation requires a vision model call per image, is the 3× cap realistic for a 500-page book with dense figures, and is this assumption documented? [Consistency, Spec §SC-003, Assumption §3, Gap] - not documented
|
||
|
||
- [X] CHK021 - Does the spec's description of citation display (FR-009) align with the `sources` format change documented in contracts/api.md — are the fields the spec says must be "distinct" actually represented distinctly in the API response? [Consistency, Spec §FR-009, contracts/api.md §4] - A section with image-source should be displayed in the front. Text source and image-source are distinct
|
||
|
||
---
|
||
|
||
## Notes
|
||
|
||
- Items marked `[Gap]` indicate requirements that appear absent or deferred; resolve before generating tasks
|
||
- Items marked `[Ambiguity]` require a clearer definition in the spec before implementation starts
|
||
- Items marked `[Consistency]` should be cross-checked between spec.md, data-model.md, and contracts/api.md
|
||
- Mark items `[x]` when resolved; add inline notes with the resolution for traceability
|