10 KiB
Feature Specification: Enhanced Embedding with Image Parsing and Metadata
Feature Branch: 002-image-aware-embedding
Created: 2026-04-03
Status: Draft
Input: User description: "I want to enhance the embedding process. I want also parse image from each pages if any and add proper metadata so that it can match the retrieved chunk/vector that match what user are querying."
User Scenarios & Testing (mandatory)
User Story 1 - Image Content Surfaced in Query Results (Priority: P1)
A neurosurgeon asks a question in the chat (e.g., "Show me the anatomy of the Circle of Willis") that is best answered by a diagram or figure in an uploaded book. The system retrieves the image content — its description and surrounding context — and uses it to construct a grounded answer, citing the page and book where the image appeared.
Why this priority: This is the direct, user-visible payoff of the feature. Without it, the enhancement has no observable benefit. All other stories support this outcome.
Independent Test: Upload a book containing a labelled anatomical diagram. Ask a query whose answer is conveyed by that diagram (not in the surrounding text). Confirm the system returns an answer that references the diagram's content and cites the correct book and page.
Acceptance Scenarios:
- Given a book with an anatomical diagram on page 42, When a user asks a question whose answer is only depicted in that diagram, Then the system returns a response that draws on the diagram's content and cites "Page 42, [Book Title]".
- Given a page with both text and an image, When the system retrieves that page's content, Then the image-derived content and the surrounding text are each independently retrievable and independently citable.
- Given a query that has no relevant image in any uploaded book, When the system searches, Then it does not fabricate image-derived content and falls back to text-only results (or states no relevant content was found).
User Story 2 - All Pages Scanned for Images During Embedding (Priority: P1)
When a book is uploaded and processed, every page is inspected for images. Any image found is extracted and represented as a searchable content chunk enriched with metadata (page number, book title, position on page, caption if present). Pages without images are processed as text-only chunks, unchanged from the existing behaviour.
Why this priority: This is the prerequisite for User Story 1. Without systematic per-page image detection, image content cannot be retrieved.
Independent Test: Upload a book whose pages include a mix of text-only and image-containing pages. After processing completes, verify that chunks exist for each image page and that each image chunk carries the correct metadata (page number, source book, caption).
Acceptance Scenarios:
- Given a book being processed, When the embedding pipeline runs, Then every page is evaluated for images and each detected image generates at least one content chunk.
- Given an image with a caption or label, When the chunk is created, Then the caption or label text is included in the chunk's content and metadata.
- Given a page with multiple images, When processing completes, Then each image is represented as a separate chunk with its own metadata, not merged into a single chunk.
- Given a page with no images, When processing completes, Then no image chunk is created for that page and text processing is unaffected.
User Story 3 - Rich Metadata Enables Precise Source Attribution (Priority: P2)
When the system returns a result based on image content, the user can see exactly where that image appeared: which book, which page, and what type of content (diagram, table, photograph, etc.). This gives the user confidence in the source and lets them locate the original image in their physical or digital copy of the book.
Why this priority: Metadata quality directly impacts user trust. Neurosurgeons require traceable, citable evidence. Richer metadata also improves retrieval accuracy by giving the search engine more signals to match against a query.
Independent Test: Retrieve a result sourced from an image chunk. Inspect the displayed citation and verify it includes: book title, page number, content type (e.g., "diagram"), and caption (if present in the original).
Acceptance Scenarios:
- Given a retrieved image chunk, When the system displays the source citation, Then the citation includes at minimum: book title, page number, and a content-type label (e.g., diagram, table, figure).
- Given an image chunk with a detected caption, When the citation is displayed, Then the caption text is shown alongside the other metadata fields.
- Given a topic summary that draws on both text and image chunks, When the user inspects citations, Then image-sourced and text-sourced claims are distinguishable from each other.
Edge Cases
- What happens when an image is too small to contain meaningful content (e.g., a decorative bullet icon or a publisher logo)?
- How does the system handle a page that is entirely an image (scanned page with no digital text)?
- What if an image spans multiple pages (e.g., a fold-out diagram)?
- How does the system behave when an image has no caption and its surrounding text provides no useful context?
- What happens if image processing fails for a specific page — does it abort the whole book or continue with the remaining pages?
Requirements (mandatory)
Functional Requirements
- FR-001: System MUST inspect every page of an uploaded book for the presence of images during the embedding process.
- FR-002: System MUST extract each detected image and create a dedicated, independently searchable content chunk for it.
- FR-003: System MUST generate a descriptive textual representation of each extracted image so its content is semantically searchable by the retrieval system.
- FR-004: System MUST associate the following metadata with every image chunk: book title, page number, content type (e.g., diagram, table, figure, photograph), and caption text (where present).
- FR-005: System MUST include the same base metadata (book title, page number) on text chunks so that all retrieved content — image or text — carries consistent, comparable source attribution.
- FR-006: System MUST treat image chunks as first-class retrievable units: they must be ranked and returned alongside text chunks when they are relevant to a user query.
- FR-007: System MUST skip images that fall below a minimum meaningful-content threshold (e.g., decorative icons, page separators) and MUST NOT create chunks for them.
- FR-008: If image processing fails for a specific page, the system MUST log the failure, skip that page's image, and continue processing the remaining pages and text content of the book.
- FR-009: System MUST display image-sourced content citations distinctly from text-sourced citations so users can identify when a result originates from a visual element.
- FR-010: Processing a book that contains images MUST NOT degrade the accuracy or completeness of the existing text-only embedding for that book.
Key Entities
- Image Chunk: A searchable content unit derived from a page image. Attributes: generated description, source book title, page number, content type, caption (optional), embedding vector.
- Text Chunk: Existing unit; extended to carry explicit metadata: source book title, page number, section heading (if detectable), content type ("text").
- Chunk Metadata: Structured attributes attached to every chunk regardless of type, enabling consistent filtering and citation. Mandatory fields: book title, page number, content type. Optional fields: caption, section heading.
Success Criteria (mandatory)
Measurable Outcomes
- SC-001: At least 90% of pages containing images in a test book result in a retrievable image chunk after processing completes.
- SC-002: A controlled set of 10 queries whose answers are conveyed by diagrams in an uploaded book returns at least 7 correct image-sourced answers (70% recall on image queries).
- SC-003: Embedding processing time for a book with images increases by no more than 3× compared to processing the same book as text-only, for books up to 500 pages.
- SC-004: Every retrieved result — text or image — includes a citation that identifies at minimum the source book title and page number, with 100% coverage across a test result set.
- SC-005: In a user evaluation with 5 representative queries that previously returned no useful results (because the answer was only in a diagram), at least 4 now return a useful, grounded answer.
Assumptions
- Books are still uploaded exclusively as PDFs; image parsing applies to PDF pages only.
- The platform already has a working text-only embedding pipeline (from feature 001); this feature enhances it without replacing or rewriting the text processing logic.
- Images worth processing are those that occupy a meaningful portion of the page; small decorative or structural images (logos, dividers, icons) are excluded based on a size threshold determined during implementation.
- The descriptive representation of an image (FR-003) is generated at embedding time, not at query time; query latency is not affected by image interpretation.
- The shared global book library model from feature 001 is retained; image chunks from a processed book are available to all users immediately upon completion.
- Scanned pages (fully rasterised pages with no digital text layer) are treated as a single full-page image; the system attempts to extract content from them but does not guarantee the same fidelity as pages with digital text.
- Per-chunk metadata is stored alongside the vector so it can be used for both retrieval filtering and source citation display without a separate lookup.
- Books already processed under feature 001 (text-only) are not automatically re-processed; re-embedding must be triggered explicitly by the user or an administrator.