ai-teacher/specs/002-image-aware-embedding/spec.md

# Feature Specification: Enhanced Embedding with Image Parsing and Metadata

**Feature Branch**: `002-image-aware-embedding`
**Created**: 2026-04-03
**Status**: Draft
**Input**: User description: "I want to enhance the embedding process. I want also parse image from each pages if any and add proper metadata so that it can match the retrieved chunk/vector that match what user are querying."

## User Scenarios & Testing *(mandatory)*

### User Story 1 - Image Content Surfaced in Query Results (Priority: P1)

A neurosurgeon asks a question in the chat (e.g., "Show me the anatomy of the Circle of Willis")
that is best answered by a diagram or figure in an uploaded book. The system retrieves the image
content — its description and surrounding context — and uses it to construct a grounded answer,
citing the page and book where the image appeared.

**Why this priority**: This is the direct, user-visible payoff of the feature. Without it, the
enhancement has no observable benefit. All other stories support this outcome.

**Independent Test**: Upload a book containing a labelled anatomical diagram. Ask a query whose
answer is conveyed by that diagram (not in the surrounding text). Confirm the system returns an
answer that references the diagram's content and cites the correct book and page.

**Acceptance Scenarios**:

1. **Given** a book with an anatomical diagram on page 42, **When** a user asks a question whose
   answer is only depicted in that diagram, **Then** the system returns a response that draws on
   the diagram's content and cites "Page 42, [Book Title]".
2. **Given** a page with both text and an image, **When** the system retrieves that page's content,
   **Then** the image-derived content and the surrounding text are each independently retrievable
   and independently citable.
3. **Given** a query that has no relevant image in any uploaded book, **When** the system searches,
   **Then** it does not fabricate image-derived content and falls back to text-only results (or
   states no relevant content was found).

---

### User Story 2 - All Pages Scanned for Images During Embedding (Priority: P1)

When a book is uploaded and processed, every page is inspected for images. Any image found is
extracted and represented as a searchable content chunk enriched with metadata (page number,
book title, position on page, caption if present). Pages without images are processed as
text-only chunks, unchanged from the existing behaviour.

**Why this priority**: This is the prerequisite for User Story 1. Without systematic per-page
image detection, image content cannot be retrieved.

**Independent Test**: Upload a book whose pages include a mix of text-only and image-containing
pages. After processing completes, verify that chunks exist for each image page and that each
image chunk carries the correct metadata (page number, source book, caption).

**Acceptance Scenarios**:

1. **Given** a book being processed, **When** the embedding pipeline runs, **Then** every page
   is evaluated for images and each detected image generates at least one content chunk.
2. **Given** an image with a caption or label, **When** the chunk is created, **Then** the
   caption or label text is included in the chunk's content and metadata.
3. **Given** a page with multiple images, **When** processing completes, **Then** each image is
   represented as a separate chunk with its own metadata, not merged into a single chunk.
4. **Given** a page with no images, **When** processing completes, **Then** no image chunk is
   created for that page and text processing is unaffected.

---

### User Story 3 - Rich Metadata Enables Precise Source Attribution (Priority: P2)

When the system returns a result based on image content, the user can see exactly where that
image appeared: which book, which page, and what type of content (diagram, table, photograph,
etc.). This gives the user confidence in the source and lets them locate the original image
in their physical or digital copy of the book.

**Why this priority**: Metadata quality directly impacts user trust. Neurosurgeons require
traceable, citable evidence. Richer metadata also improves retrieval accuracy by giving the
search engine more signals to match against a query.

**Independent Test**: Retrieve a result sourced from an image chunk. Inspect the displayed
citation and verify it includes: book title, page number, content type (e.g., "diagram"),
and caption (if present in the original).

**Acceptance Scenarios**:

1. **Given** a retrieved image chunk, **When** the system displays the source citation,
   **Then** the citation includes at minimum: book title, page number, and a content-type
   label (e.g., diagram, table, figure).
2. **Given** an image chunk with a detected caption, **When** the citation is displayed,
   **Then** the caption text is shown alongside the other metadata fields.
3. **Given** a topic summary that draws on both text and image chunks, **When** the user
   inspects citations, **Then** image-sourced and text-sourced claims are distinguishable
   from each other.

---

### Edge Cases

- What happens when an image is too small to contain meaningful content (e.g., a decorative
  bullet icon or a publisher logo)?
- How does the system handle a page that is entirely an image (scanned page with no digital text)?
- What if an image spans multiple pages (e.g., a fold-out diagram)?
- How does the system behave when an image has no caption and its surrounding text provides
  no useful context?
- What happens if image processing fails for a specific page — does it abort the whole book
  or continue with the remaining pages?

## Requirements *(mandatory)*

### Functional Requirements

- **FR-001**: System MUST inspect every page of an uploaded book for the presence of images
  during the embedding process.
- **FR-002**: System MUST extract each detected image and create a dedicated, independently
  searchable content chunk for it.
- **FR-003**: System MUST generate a descriptive textual representation of each extracted
  image so its content is semantically searchable by the retrieval system.
- **FR-004**: System MUST associate the following metadata with every image chunk: book title,
  page number, content type (e.g., diagram, table, figure, photograph), and caption text
  (where present).
- **FR-005**: System MUST include the same base metadata (book title, page number) on text
  chunks so that all retrieved content — image or text — carries consistent, comparable
  source attribution.
- **FR-006**: System MUST treat image chunks as first-class retrievable units: they must be
  ranked and returned alongside text chunks when they are relevant to a user query.
- **FR-007**: System MUST skip images that fall below a minimum meaningful-content threshold
  (e.g., decorative icons, page separators) and MUST NOT create chunks for them.
- **FR-008**: If image processing fails for a specific page, the system MUST log the failure,
  skip that page's image, and continue processing the remaining pages and text content of
  the book.
- **FR-009**: System MUST display image-sourced content citations distinctly from text-sourced
  citations so users can identify when a result originates from a visual element.
- **FR-010**: Processing a book that contains images MUST NOT degrade the accuracy or
  completeness of the existing text-only embedding for that book.

### Key Entities

- **Image Chunk**: A searchable content unit derived from a page image. Attributes: generated
  description, source book title, page number, content type, caption (optional), embedding vector.
- **Text Chunk**: Existing unit; extended to carry explicit metadata: source book title,
  page number, section heading (if detectable), content type ("text").
- **Chunk Metadata**: Structured attributes attached to every chunk regardless of type,
  enabling consistent filtering and citation. Mandatory fields: book title, page number,
  content type. Optional fields: caption, section heading.

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001**: At least 90% of pages containing images in a test book result in a retrievable
  image chunk after processing completes.
- **SC-002**: A controlled set of 10 queries whose answers are conveyed by diagrams in an
  uploaded book returns at least 7 correct image-sourced answers (70% recall on image queries).
- **SC-003**: Embedding processing time for a book with images increases by no more than 3×
  compared to processing the same book as text-only, for books up to 500 pages.
- **SC-004**: Every retrieved result — text or image — includes a citation that identifies
  at minimum the source book title and page number, with 100% coverage across a test result set.
- **SC-005**: In a user evaluation with 5 representative queries that previously returned
  no useful results (because the answer was only in a diagram), at least 4 now return a
  useful, grounded answer.

## Assumptions

- Books are still uploaded exclusively as PDFs; image parsing applies to PDF pages only.
- The platform already has a working text-only embedding pipeline (from feature 001); this
  feature enhances it without replacing or rewriting the text processing logic.
- Images worth processing are those that occupy a meaningful portion of the page; small
  decorative or structural images (logos, dividers, icons) are excluded based on a size
  threshold determined during implementation.
- The descriptive representation of an image (FR-003) is generated at embedding time, not
  at query time; query latency is not affected by image interpretation.
- The shared global book library model from feature 001 is retained; image chunks from a
  processed book are available to all users immediately upon completion.
- Scanned pages (fully rasterised pages with no digital text layer) are treated as a single
  full-page image; the system attempts to extract content from them but does not guarantee
  the same fidelity as pages with digital text.
- Per-chunk metadata is stored alongside the vector so it can be used for both retrieval
  filtering and source citation display without a separate lookup.
- Books already processed under feature 001 (text-only) are not automatically re-processed;
  re-embedding must be triggered explicitly by the user or an administrator.