first implementation - image/drawing integration

This commit is contained in:
Adrien
2026-04-04 12:56:56 +02:00
parent fc5b22fba1
commit 5acfdd33c1
42 changed files with 2854 additions and 151 deletions
+188
View File
@@ -0,0 +1,188 @@
# Research: Enhanced Embedding with Image Parsing and Metadata
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
This document resolves all technical unknowns identified during planning. The primary source for
decisions is the detailed architecture provided directly by the project owner, supplemented by
Spring AI 2.0.0-M4 API specifics.
---
## Decision 1: Document Hierarchy Model
**Decision**: Adopt a four-level hierarchy — `BookNode``ChapterNode``SectionNode`
`TextChunkNode` + `FigureNode`. The `SectionNode` is the pivotal unit: it holds the full section
text in Postgres and is used for parent-child context expansion at retrieval time.
**Rationale**: A flat page-per-document model (current implementation) loses structural context.
When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text,
not just the matching fragment. Parent-child retrieval — where chunks point to their parent
section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section
association explicit and queryable.
**Alternatives considered**:
- Keep flat page model, add metadata only → rejected: insufficient for precise citation and
context expansion
- Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent
to LLM; cost and latency increase
---
## Decision 2: Image Extraction Strategy
**Decision**: Use PDFBox (already on classpath via `spring-ai-pdf-document-reader`) to extract
images per page. Each image is tagged with `page`, `figure_id` (derived from caption, e.g.
"Fig. 12-4"), and the parent `sectionId`. Images are saved to local disk under
`/uploads/figures/{bookId}/`.
**Rationale**: PDFBox is already present (Spring AI bundles it). No new dependency needed.
Per-page extraction ensures every image is captured regardless of PDF structure.
**Alternatives considered**:
- iText / iText7 → additional commercial dependency; overkill for extraction
- Screenshot each page as PNG, then OCR → far slower; loses vector quality
---
## Decision 3: Figure Content Representation
**Decision**: Generate a textual description of each extracted image using the OpenAI vision
model (GPT-4o). This description becomes the `content` field of the figure's vector store
document. The figure caption (parsed from the surrounding text) is also included to maximise
retrieval signal.
**Rationale**: Caption-only embedding would miss figures with no caption or with sparse labels.
Vision-generated descriptions produce richer semantic content (anatomy terms, structural
relationships) that matches clinical queries. The OpenAI client already in use supports image
inputs; no additional dependency is required.
**Alternatives considered**:
- Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
- Local vision model (LLaVA) → requires self-hosting; out of scope for POC
- OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)
---
## Decision 4: Dual Vector Search
**Decision**: At query time, run two parallel similarity searches:
1. Text chunk search (filtered by `type = "TEXT"` and `book_id`)
2. Figure caption search (filtered by `type = "FIGURE"` and `book_id`)
Results are merged and deduplicated. The LLM prompt receives the expanded parent section text
plus a structured figure reference list.
**Rationale**: A single search would rank text and figures against each other; figures with
terse captions would systematically lose to text chunks. Separate searches with independent
`topK` allow tuning each modality independently.
**Alternatives considered**:
- Single search, filter by relevance score → figure captions score lower than text; figures
are systematically under-retrieved
- Post-process text results to look up linked figures only → misses figures that are relevant
to the query but not explicitly referenced in the retrieved text chunks
---
## Decision 5: Chunk-to-Figure Linking
**Decision**: During text parsing, whenever a pattern matching `Fig.\s+\d+[\-\.]\d+` or
`Figure\s+\d+[\-\.]\d+` is found in a chunk, insert a row into the `chunk_figure_refs` table
linking `chunkId``figureId`. At retrieval time, after text chunks are retrieved, their
associated figures are fetched from this table and added to the LLM prompt.
**Rationale**: Explicit linking ensures that when a text chunk is retrieved, its referenced
figures are always surfaced — even if the figure's caption did not score highly in the vector
search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.
**Alternatives considered**:
- Rely entirely on dual vector search → may miss figures referenced in retrieved text but
scoring below the topK threshold in the figure search
---
## Decision 6: Image Storage
**Decision**: Extracted images are saved as PNG files to a local directory
(`${app.figure-storage.base-path}`, defaults to `./uploads/figures/{bookId}/`). The path is
stored in `figure.image_path` in Postgres. A `FigureStorageService` interface wraps all disk
I/O so the implementation can be swapped to S3 or another object store without changing
callers.
**Rationale**: Local disk is the simplest viable option for a POC with <10 users. The interface
boundary satisfies Constitution Principle II (Easy to Change).
**Alternatives considered**:
- S3 from day 1 → operational overhead not justified at POC scale
- Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
---
## Decision 7: Figure Type Classification
**Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
2. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
**Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
Heuristic classification avoids a separate model call per image at extraction time.
**Alternatives considered**:
- Vision model classification → accurate but adds latency and cost per figure; deferrable
- Single `FIGURE` type → loses citation granularity required by spec FR-004
---
## Decision 8: Metadata Schema for Vector Store Documents
**Decision**: All vector store documents carry a flat `Map<String, Object>` metadata for Spring
AI filtering. Schema:
| Field | Text Chunk | Figure Chunk |
|-------|-----------|-------------|
| `type` | `"TEXT"` | `"FIGURE"` |
| `book_id` | ✓ | ✓ |
| `book_title` | ✓ | ✓ |
| `chapter_id` | ✓ | ✓ |
| `section_id` | ✓ | ✓ |
| `section_title` | ✓ | ✓ |
| `page_start` | ✓ | — |
| `page_end` | ✓ | — |
| `chunk_index` | ✓ | — |
| `total_chunks` | ✓ | — |
| `figure_id` | — | ✓ |
| `figure_type` | — | ✓ |
| `image_path` | — | ✓ |
| `label` | — | ✓ |
| `page` | — | ✓ |
**Rationale**: Flat map is required by Spring AI `FilterExpressionBuilder`. Separation by `type`
allows independent filtering in dual search.
---
## Decision 9: Re-embedding Existing Books
**Decision**: Books already processed under feature 001 (text-only) are NOT automatically
re-embedded. An explicit re-embed action is exposed via `POST /api/v1/books/{id}/reembed`
(admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.
**Rationale**: Automatic re-embedding on deploy would block the system and risk data loss if
the process fails mid-way. An explicit, idempotent trigger is safer and more observable.
---
## Decision 10: Minimum Image Size Threshold
**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. This
threshold filters out decorative elements (bullets, dividers, publisher logos) without a
classification model.
**Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
The threshold is configurable via `app.figure-storage.min-image-size-px` in
`application.properties`.
**Alternatives considered**:
- No threshold → decorative icons pollute the figure index
- ML-based classification → accurate but adds model dependency; not needed at POC scale