Files
ai-teacher/specs/002-image-aware-embedding/checklists/embedding-retrieval.md

7.0 KiB
Raw Permalink Blame History

Embedding & Retrieval Pipeline Checklist: Enhanced Embedding with Image Parsing and Metadata

Purpose: Author self-review of embedding pipeline and retrieval requirements quality — validates completeness, clarity, and measurability before implementation tasks are written Created: 2026-04-03 Feature: spec.md | research.md | data-model.md Focus: A (Embedding pipeline) + B (Retrieval & ranking) | Depth: Standard | Audience: Author


Requirement Completeness — Embedding Pipeline

  • CHK001 - Is the definition of "inspect every page" complete — does the spec cover pages that have no extractable content layer (fully scanned/rasterised pages)? Yes [Completeness, Spec §FR-001, Assumption §6]

  • CHK002 - Does FR-002 define what "independently searchable" means in practice — specifically, is it clear that image chunks must be retrievable without a co-located text chunk? [Clarity, Spec §FR-002] - No image should be retrieved along linked text.

  • CHK003 - Is the minimum acceptable quality of the "descriptive textual representation" (FR-003) specified — e.g., must it include structural relationships, labelled regions, or clinical terms — or is any non-empty description sufficient? [Clarity, Spec §FR-003, Gap] - any non-empty description sufficient. Text just below the image should have the correct clinical term.

  • [C] CHK004 - Are the caption-detection rules defined at spec level — specifically, what pattern or signal determines that a piece of text is a caption vs. body text adjacent to an image? [Clarity, Spec §FR-004, Gap] - We assume a text starting with Fig. follewed by number is a text description of a give image.

  • CHK005 - Does FR-004 specify what metadata is stored when a caption is absent — is the caption field omitted, left empty, or populated with a generated substitute? [Completeness, Spec §FR-004] - generated substitute

  • CHK006 - Is the "minimum meaningful-content threshold" (FR-007) quantified in the spec, or is it deferred entirely to implementation? The assumption section says "size threshold determined during implementation" — is this intentional and acceptable at the spec level? [Ambiguity, Spec §FR-007, Assumption §3] - Deferred to implementation

  • CHK007 - Does FR-008 specify the observable outcome of per-page image failures — specifically, is there a requirement that the book's processing status or error log is accessible to the user or admin after partial failure? [Completeness, Spec §FR-008, Gap] online logs

  • CHK008 - Is FR-010 ("MUST NOT degrade accuracy or completeness of text-only embedding") measurable — does the spec define a baseline or acceptance criterion against which degradation can be detected? [Measurability, Spec §FR-010, Gap] no definition

  • CHK009 - Are re-embedding requirements complete — does the spec cover what happens to in-progress queries and cached results while a book is being re-embedded? [Coverage, Assumption §8, Gap] - No need to take that into account.


Requirement Completeness — Retrieval & Ranking

  • CHK010 - Does FR-006 define how image and text chunks are ranked relative to each other — is ranking unified (single score), or are the two modalities ranked independently with separate topK controls? [Clarity, Spec §FR-006, Gap] - independent separated topK

  • CHK011 - Is the relevance threshold for figure retrieval specified — i.e., at what similarity score (or other criterion) should a figure be excluded from results? [Clarity, Spec §FR-006, Gap] not specified

  • CHK012 - Are deduplication rules defined for the case where the same figure appears both in the semantic figure search and the chunk-to-figure reference lookup — which representation wins, or are both included? [Completeness, data-model.md §RetrievalResult, Gap] not specified

  • CHK013 - Is the requirement for parent section context expansion in the spec — specifically, is there a requirement that the LLM receives the full section text (not just the chunk) when a text chunk is retrieved? [Gap, research.md §Decision 1] - the LLM should receive the full section to have maximum context.

  • CHK014 - Does the spec define the required structure of the LLM prompt when both text context and figures are present — or is prompt design left entirely to implementation? [Completeness, Gap] - Left to implementation

  • CHK015 - Is SC-002 ("70% recall on image queries") sufficient as a measurability criterion — is the test set composition (10 queries) and evaluation method documented, or does it rely on an undefined manual process? [Measurability, Spec §SC-002] - Manual process.


Scenario Coverage — Edge & Exception Cases

  • CHK016 - Does the spec address the scenario where a query is relevant to a book section that has figures but none of those figures rank above the retrieval threshold — is the expected fallback behaviour defined? [Coverage, Edge Case, Gap] - The figure should in this case be retrieved and shon to the user.

  • CHK017 - Is the scenario of a figure retrieved in search results but whose image file is missing from the file store covered — what should the system return to the user in that case? [Coverage, Exception Flow, Gap] - missing image error, shown in the front as a broken image link.

  • CHK018 - Are requirements defined for multi-image pages where images have conflicting captions or share a single composite caption — which image gets the caption, or is it duplicated? [Coverage, Spec §FR-004, Edge Case] - this case not exist.


Consistency & Alignment

  • CHK019 - Are the metadata fields required by FR-004 and FR-005 fully consistent with the metadata schema defined in data-model.md — specifically, do the mandatory fields in the spec match the type, section_id, and section_title fields in the data model? [Consistency, Spec §FR-004, data-model.md §Vector Store Documents] - Left to implementation

  • CHK020 - Is SC-003 ("processing time ≤ 3× baseline") consistent with FR-003 — if description generation requires a vision model call per image, is the 3× cap realistic for a 500-page book with dense figures, and is this assumption documented? [Consistency, Spec §SC-003, Assumption §3, Gap] - not documented

  • CHK021 - Does the spec's description of citation display (FR-009) align with the sources format change documented in contracts/api.md — are the fields the spec says must be "distinct" actually represented distinctly in the API response? [Consistency, Spec §FR-009, contracts/api.md §4] - A section with image-source should be displayed in the front. Text source and image-source are distinct


Notes

  • Items marked [Gap] indicate requirements that appear absent or deferred; resolve before generating tasks
  • Items marked [Ambiguity] require a clearer definition in the spec before implementation starts
  • Items marked [Consistency] should be cross-checked between spec.md, data-model.md, and contracts/api.md
  • Mark items [x] when resolved; add inline notes with the resolution for traceability