giteaadmin/ai-teacher

Fork 0

Files

T

Adrien 5acfdd33c1 first implementation - image/drawing integration

2026-04-04 12:56:56 +02:00

8.1 KiB

Raw Blame History

Research: Enhanced Embedding with Image Parsing and Metadata

Branch: 002-image-aware-embedding | Date: 2026-04-03

This document resolves all technical unknowns identified during planning. The primary source for decisions is the detailed architecture provided directly by the project owner, supplemented by Spring AI 2.0.0-M4 API specifics.

Decision 1: Document Hierarchy Model

Decision: Adopt a four-level hierarchy — BookNode → ChapterNode → SectionNode → TextChunkNode + FigureNode. The SectionNode is the pivotal unit: it holds the full section text in Postgres and is used for parent-child context expansion at retrieval time.

Rationale: A flat page-per-document model (current implementation) loses structural context. When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text, not just the matching fragment. Parent-child retrieval — where chunks point to their parent section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section association explicit and queryable.

Alternatives considered:

Keep flat page model, add metadata only → rejected: insufficient for precise citation and context expansion
Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent to LLM; cost and latency increase

Decision 2: Image Extraction Strategy

Decision: Use PDFBox (already on classpath via spring-ai-pdf-document-reader) to extract images per page. Each image is tagged with page, figure_id (derived from caption, e.g. "Fig. 12-4"), and the parent sectionId. Images are saved to local disk under /uploads/figures/{bookId}/.

Rationale: PDFBox is already present (Spring AI bundles it). No new dependency needed. Per-page extraction ensures every image is captured regardless of PDF structure.

Alternatives considered:

iText / iText7 → additional commercial dependency; overkill for extraction
Screenshot each page as PNG, then OCR → far slower; loses vector quality

Decision 3: Figure Content Representation

Decision: Generate a textual description of each extracted image using the OpenAI vision model (GPT-4o). This description becomes the content field of the figure's vector store document. The figure caption (parsed from the surrounding text) is also included to maximise retrieval signal.

Rationale: Caption-only embedding would miss figures with no caption or with sparse labels. Vision-generated descriptions produce richer semantic content (anatomy terms, structural relationships) that matches clinical queries. The OpenAI client already in use supports image inputs; no additional dependency is required.

Alternatives considered:

Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
Local vision model (LLaVA) → requires self-hosting; out of scope for POC
OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)

Decision 4: Dual Vector Search

Decision: At query time, run two parallel similarity searches:

Text chunk search (filtered by type = "TEXT" and book_id)
Figure caption search (filtered by type = "FIGURE" and book_id)

Results are merged and deduplicated. The LLM prompt receives the expanded parent section text plus a structured figure reference list.

Rationale: A single search would rank text and figures against each other; figures with terse captions would systematically lose to text chunks. Separate searches with independent topK allow tuning each modality independently.

Alternatives considered:

Single search, filter by relevance score → figure captions score lower than text; figures are systematically under-retrieved
Post-process text results to look up linked figures only → misses figures that are relevant to the query but not explicitly referenced in the retrieved text chunks

Decision 5: Chunk-to-Figure Linking

Decision: During text parsing, whenever a pattern matching Fig.\s+\d+[\-\.]\d+ or Figure\s+\d+[\-\.]\d+ is found in a chunk, insert a row into the chunk_figure_refs table linking chunkId → figureId. At retrieval time, after text chunks are retrieved, their associated figures are fetched from this table and added to the LLM prompt.

Rationale: Explicit linking ensures that when a text chunk is retrieved, its referenced figures are always surfaced — even if the figure's caption did not score highly in the vector search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.

Alternatives considered:

Rely entirely on dual vector search → may miss figures referenced in retrieved text but scoring below the topK threshold in the figure search

Decision 6: Image Storage

Decision: Extracted images are saved as PNG files to a local directory (${app.figure-storage.base-path}, defaults to ./uploads/figures/{bookId}/). The path is stored in figure.image_path in Postgres. A FigureStorageService interface wraps all disk I/O so the implementation can be swapped to S3 or another object store without changing callers.

Rationale: Local disk is the simplest viable option for a POC with <10 users. The interface boundary satisfies Constitution Principle II (Easy to Change).

Alternatives considered:

S3 from day 1 → operational overhead not justified at POC scale
Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades

Decision 7: Figure Type Classification

Decision: Use the enum FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN, TABLE, CHART, INTRAOPERATIVE_IMAGE }. Classification is derived from:

Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
Fall back to ANATOMICAL_DIAGRAM if unclassifiable

Rationale: Allows the frontend to render different icon/label per type (e.g., "MRI" badge). Heuristic classification avoids a separate model call per image at extraction time.

Alternatives considered:

Vision model classification → accurate but adds latency and cost per figure; deferrable
Single FIGURE type → loses citation granularity required by spec FR-004

Decision 8: Metadata Schema for Vector Store Documents

Decision: All vector store documents carry a flat Map<String, Object> metadata for Spring AI filtering. Schema:

Field	Text Chunk	Figure Chunk
`type`	`"TEXT"`	`"FIGURE"`
`book_id`	✓	✓
`book_title`	✓	✓
`chapter_id`	✓	✓
`section_id`	✓	✓
`section_title`	✓	✓
`page_start`	✓	—
`page_end`	✓	—
`chunk_index`	✓	—
`total_chunks`	✓	—
`figure_id`	—	✓
`figure_type`	—	✓
`image_path`	—	✓
`label`	—	✓
`page`	—	✓

Rationale: Flat map is required by Spring AI FilterExpressionBuilder. Separation by type allows independent filtering in dual search.

Decision 9: Re-embedding Existing Books

Decision: Books already processed under feature 001 (text-only) are NOT automatically re-embedded. An explicit re-embed action is exposed via POST /api/v1/books/{id}/reembed (admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.

Rationale: Automatic re-embedding on deploy would block the system and risk data loss if the process fails mid-way. An explicit, idempotent trigger is safer and more observable.

Decision 10: Minimum Image Size Threshold

Decision: Images smaller than 100×100 pixels are discarded and no chunk is created. This threshold filters out decorative elements (bullets, dividers, publisher logos) without a classification model.

Rationale: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px. The threshold is configurable via app.figure-storage.min-image-size-px in application.properties.

Alternatives considered:

No threshold → decorative icons pollute the figure index
ML-based classification → accurate but adds model dependency; not needed at POC scale

8.1 KiB Raw Blame History Unescape Escape

Research: Enhanced Embedding with Image Parsing and Metadata

Decision 1: Document Hierarchy Model

Decision 2: Image Extraction Strategy

Decision 3: Figure Content Representation

Decision 4: Dual Vector Search

Decision 5: Chunk-to-Figure Linking

Decision 6: Image Storage

Decision 7: Figure Type Classification

Decision 8: Metadata Schema for Vector Store Documents

Decision 9: Re-embedding Existing Books

Decision 10: Minimum Image Size Threshold

8.1 KiB

Raw Blame History