177 lines
10 KiB
Markdown
177 lines
10 KiB
Markdown
# Feature Specification: Enhanced Embedding with Image Parsing and Metadata
|
||
|
||
**Feature Branch**: `002-image-aware-embedding`
|
||
**Created**: 2026-04-03
|
||
**Status**: Draft
|
||
**Input**: User description: "I want to enhance the embedding process. I want also parse image from each pages if any and add proper metadata so that it can match the retrieved chunk/vector that match what user are querying."
|
||
|
||
## User Scenarios & Testing *(mandatory)*
|
||
|
||
### User Story 1 - Image Content Surfaced in Query Results (Priority: P1)
|
||
|
||
A neurosurgeon asks a question in the chat (e.g., "Show me the anatomy of the Circle of Willis")
|
||
that is best answered by a diagram or figure in an uploaded book. The system retrieves the image
|
||
content — its description and surrounding context — and uses it to construct a grounded answer,
|
||
citing the page and book where the image appeared.
|
||
|
||
**Why this priority**: This is the direct, user-visible payoff of the feature. Without it, the
|
||
enhancement has no observable benefit. All other stories support this outcome.
|
||
|
||
**Independent Test**: Upload a book containing a labelled anatomical diagram. Ask a query whose
|
||
answer is conveyed by that diagram (not in the surrounding text). Confirm the system returns an
|
||
answer that references the diagram's content and cites the correct book and page.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** a book with an anatomical diagram on page 42, **When** a user asks a question whose
|
||
answer is only depicted in that diagram, **Then** the system returns a response that draws on
|
||
the diagram's content and cites "Page 42, [Book Title]".
|
||
2. **Given** a page with both text and an image, **When** the system retrieves that page's content,
|
||
**Then** the image-derived content and the surrounding text are each independently retrievable
|
||
and independently citable.
|
||
3. **Given** a query that has no relevant image in any uploaded book, **When** the system searches,
|
||
**Then** it does not fabricate image-derived content and falls back to text-only results (or
|
||
states no relevant content was found).
|
||
|
||
---
|
||
|
||
### User Story 2 - All Pages Scanned for Images During Embedding (Priority: P1)
|
||
|
||
When a book is uploaded and processed, every page is inspected for images. Any image found is
|
||
extracted and represented as a searchable content chunk enriched with metadata (page number,
|
||
book title, position on page, caption if present). Pages without images are processed as
|
||
text-only chunks, unchanged from the existing behaviour.
|
||
|
||
**Why this priority**: This is the prerequisite for User Story 1. Without systematic per-page
|
||
image detection, image content cannot be retrieved.
|
||
|
||
**Independent Test**: Upload a book whose pages include a mix of text-only and image-containing
|
||
pages. After processing completes, verify that chunks exist for each image page and that each
|
||
image chunk carries the correct metadata (page number, source book, caption).
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** a book being processed, **When** the embedding pipeline runs, **Then** every page
|
||
is evaluated for images and each detected image generates at least one content chunk.
|
||
2. **Given** an image with a caption or label, **When** the chunk is created, **Then** the
|
||
caption or label text is included in the chunk's content and metadata.
|
||
3. **Given** a page with multiple images, **When** processing completes, **Then** each image is
|
||
represented as a separate chunk with its own metadata, not merged into a single chunk.
|
||
4. **Given** a page with no images, **When** processing completes, **Then** no image chunk is
|
||
created for that page and text processing is unaffected.
|
||
|
||
---
|
||
|
||
### User Story 3 - Rich Metadata Enables Precise Source Attribution (Priority: P2)
|
||
|
||
When the system returns a result based on image content, the user can see exactly where that
|
||
image appeared: which book, which page, and what type of content (diagram, table, photograph,
|
||
etc.). This gives the user confidence in the source and lets them locate the original image
|
||
in their physical or digital copy of the book.
|
||
|
||
**Why this priority**: Metadata quality directly impacts user trust. Neurosurgeons require
|
||
traceable, citable evidence. Richer metadata also improves retrieval accuracy by giving the
|
||
search engine more signals to match against a query.
|
||
|
||
**Independent Test**: Retrieve a result sourced from an image chunk. Inspect the displayed
|
||
citation and verify it includes: book title, page number, content type (e.g., "diagram"),
|
||
and caption (if present in the original).
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** a retrieved image chunk, **When** the system displays the source citation,
|
||
**Then** the citation includes at minimum: book title, page number, and a content-type
|
||
label (e.g., diagram, table, figure).
|
||
2. **Given** an image chunk with a detected caption, **When** the citation is displayed,
|
||
**Then** the caption text is shown alongside the other metadata fields.
|
||
3. **Given** a topic summary that draws on both text and image chunks, **When** the user
|
||
inspects citations, **Then** image-sourced and text-sourced claims are distinguishable
|
||
from each other.
|
||
|
||
---
|
||
|
||
### Edge Cases
|
||
|
||
- What happens when an image is too small to contain meaningful content (e.g., a decorative
|
||
bullet icon or a publisher logo)?
|
||
- How does the system handle a page that is entirely an image (scanned page with no digital text)?
|
||
- What if an image spans multiple pages (e.g., a fold-out diagram)?
|
||
- How does the system behave when an image has no caption and its surrounding text provides
|
||
no useful context?
|
||
- What happens if image processing fails for a specific page — does it abort the whole book
|
||
or continue with the remaining pages?
|
||
|
||
## Requirements *(mandatory)*
|
||
|
||
### Functional Requirements
|
||
|
||
- **FR-001**: System MUST inspect every page of an uploaded book for the presence of images
|
||
during the embedding process.
|
||
- **FR-002**: System MUST extract each detected image and create a dedicated, independently
|
||
searchable content chunk for it.
|
||
- **FR-003**: System MUST generate a descriptive textual representation of each extracted
|
||
image so its content is semantically searchable by the retrieval system.
|
||
- **FR-004**: System MUST associate the following metadata with every image chunk: book title,
|
||
page number, content type (e.g., diagram, table, figure, photograph), and caption text
|
||
(where present).
|
||
- **FR-005**: System MUST include the same base metadata (book title, page number) on text
|
||
chunks so that all retrieved content — image or text — carries consistent, comparable
|
||
source attribution.
|
||
- **FR-006**: System MUST treat image chunks as first-class retrievable units: they must be
|
||
ranked and returned alongside text chunks when they are relevant to a user query.
|
||
- **FR-007**: System MUST skip images that fall below a minimum meaningful-content threshold
|
||
(e.g., decorative icons, page separators) and MUST NOT create chunks for them.
|
||
- **FR-008**: If image processing fails for a specific page, the system MUST log the failure,
|
||
skip that page's image, and continue processing the remaining pages and text content of
|
||
the book.
|
||
- **FR-009**: System MUST display image-sourced content citations distinctly from text-sourced
|
||
citations so users can identify when a result originates from a visual element.
|
||
- **FR-010**: Processing a book that contains images MUST NOT degrade the accuracy or
|
||
completeness of the existing text-only embedding for that book.
|
||
|
||
### Key Entities
|
||
|
||
- **Image Chunk**: A searchable content unit derived from a page image. Attributes: generated
|
||
description, source book title, page number, content type, caption (optional), embedding vector.
|
||
- **Text Chunk**: Existing unit; extended to carry explicit metadata: source book title,
|
||
page number, section heading (if detectable), content type ("text").
|
||
- **Chunk Metadata**: Structured attributes attached to every chunk regardless of type,
|
||
enabling consistent filtering and citation. Mandatory fields: book title, page number,
|
||
content type. Optional fields: caption, section heading.
|
||
|
||
## Success Criteria *(mandatory)*
|
||
|
||
### Measurable Outcomes
|
||
|
||
- **SC-001**: At least 90% of pages containing images in a test book result in a retrievable
|
||
image chunk after processing completes.
|
||
- **SC-002**: A controlled set of 10 queries whose answers are conveyed by diagrams in an
|
||
uploaded book returns at least 7 correct image-sourced answers (70% recall on image queries).
|
||
- **SC-003**: Embedding processing time for a book with images increases by no more than 3×
|
||
compared to processing the same book as text-only, for books up to 500 pages.
|
||
- **SC-004**: Every retrieved result — text or image — includes a citation that identifies
|
||
at minimum the source book title and page number, with 100% coverage across a test result set.
|
||
- **SC-005**: In a user evaluation with 5 representative queries that previously returned
|
||
no useful results (because the answer was only in a diagram), at least 4 now return a
|
||
useful, grounded answer.
|
||
|
||
## Assumptions
|
||
|
||
- Books are still uploaded exclusively as PDFs; image parsing applies to PDF pages only.
|
||
- The platform already has a working text-only embedding pipeline (from feature 001); this
|
||
feature enhances it without replacing or rewriting the text processing logic.
|
||
- Images worth processing are those that occupy a meaningful portion of the page; small
|
||
decorative or structural images (logos, dividers, icons) are excluded based on a size
|
||
threshold determined during implementation.
|
||
- The descriptive representation of an image (FR-003) is generated at embedding time, not
|
||
at query time; query latency is not affected by image interpretation.
|
||
- The shared global book library model from feature 001 is retained; image chunks from a
|
||
processed book are available to all users immediately upon completion.
|
||
- Scanned pages (fully rasterised pages with no digital text layer) are treated as a single
|
||
full-page image; the system attempts to extract content from them but does not guarantee
|
||
the same fidelity as pages with digital text.
|
||
- Per-chunk metadata is stored alongside the vector so it can be used for both retrieval
|
||
filtering and source citation display without a separate lookup.
|
||
- Books already processed under feature 001 (text-only) are not automatically re-processed;
|
||
re-embedding must be triggered explicitly by the user or an administrator.
|