first implementation - image/drawing integration
This commit is contained in:
@@ -0,0 +1,188 @@
|
||||
# Research: Enhanced Embedding with Image Parsing and Metadata
|
||||
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
|
||||
|
||||
This document resolves all technical unknowns identified during planning. The primary source for
|
||||
decisions is the detailed architecture provided directly by the project owner, supplemented by
|
||||
Spring AI 2.0.0-M4 API specifics.
|
||||
|
||||
---
|
||||
|
||||
## Decision 1: Document Hierarchy Model
|
||||
|
||||
**Decision**: Adopt a four-level hierarchy — `BookNode` → `ChapterNode` → `SectionNode` →
|
||||
`TextChunkNode` + `FigureNode`. The `SectionNode` is the pivotal unit: it holds the full section
|
||||
text in Postgres and is used for parent-child context expansion at retrieval time.
|
||||
|
||||
**Rationale**: A flat page-per-document model (current implementation) loses structural context.
|
||||
When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text,
|
||||
not just the matching fragment. Parent-child retrieval — where chunks point to their parent
|
||||
section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section
|
||||
association explicit and queryable.
|
||||
|
||||
**Alternatives considered**:
|
||||
- Keep flat page model, add metadata only → rejected: insufficient for precise citation and
|
||||
context expansion
|
||||
- Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent
|
||||
to LLM; cost and latency increase
|
||||
|
||||
---
|
||||
|
||||
## Decision 2: Image Extraction Strategy
|
||||
|
||||
**Decision**: Use PDFBox (already on classpath via `spring-ai-pdf-document-reader`) to extract
|
||||
images per page. Each image is tagged with `page`, `figure_id` (derived from caption, e.g.
|
||||
"Fig. 12-4"), and the parent `sectionId`. Images are saved to local disk under
|
||||
`/uploads/figures/{bookId}/`.
|
||||
|
||||
**Rationale**: PDFBox is already present (Spring AI bundles it). No new dependency needed.
|
||||
Per-page extraction ensures every image is captured regardless of PDF structure.
|
||||
|
||||
**Alternatives considered**:
|
||||
- iText / iText7 → additional commercial dependency; overkill for extraction
|
||||
- Screenshot each page as PNG, then OCR → far slower; loses vector quality
|
||||
|
||||
---
|
||||
|
||||
## Decision 3: Figure Content Representation
|
||||
|
||||
**Decision**: Generate a textual description of each extracted image using the OpenAI vision
|
||||
model (GPT-4o). This description becomes the `content` field of the figure's vector store
|
||||
document. The figure caption (parsed from the surrounding text) is also included to maximise
|
||||
retrieval signal.
|
||||
|
||||
**Rationale**: Caption-only embedding would miss figures with no caption or with sparse labels.
|
||||
Vision-generated descriptions produce richer semantic content (anatomy terms, structural
|
||||
relationships) that matches clinical queries. The OpenAI client already in use supports image
|
||||
inputs; no additional dependency is required.
|
||||
|
||||
**Alternatives considered**:
|
||||
- Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
|
||||
- Local vision model (LLaVA) → requires self-hosting; out of scope for POC
|
||||
- OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)
|
||||
|
||||
---
|
||||
|
||||
## Decision 4: Dual Vector Search
|
||||
|
||||
**Decision**: At query time, run two parallel similarity searches:
|
||||
1. Text chunk search (filtered by `type = "TEXT"` and `book_id`)
|
||||
2. Figure caption search (filtered by `type = "FIGURE"` and `book_id`)
|
||||
|
||||
Results are merged and deduplicated. The LLM prompt receives the expanded parent section text
|
||||
plus a structured figure reference list.
|
||||
|
||||
**Rationale**: A single search would rank text and figures against each other; figures with
|
||||
terse captions would systematically lose to text chunks. Separate searches with independent
|
||||
`topK` allow tuning each modality independently.
|
||||
|
||||
**Alternatives considered**:
|
||||
- Single search, filter by relevance score → figure captions score lower than text; figures
|
||||
are systematically under-retrieved
|
||||
- Post-process text results to look up linked figures only → misses figures that are relevant
|
||||
to the query but not explicitly referenced in the retrieved text chunks
|
||||
|
||||
---
|
||||
|
||||
## Decision 5: Chunk-to-Figure Linking
|
||||
|
||||
**Decision**: During text parsing, whenever a pattern matching `Fig.\s+\d+[\-\.]\d+` or
|
||||
`Figure\s+\d+[\-\.]\d+` is found in a chunk, insert a row into the `chunk_figure_refs` table
|
||||
linking `chunkId` → `figureId`. At retrieval time, after text chunks are retrieved, their
|
||||
associated figures are fetched from this table and added to the LLM prompt.
|
||||
|
||||
**Rationale**: Explicit linking ensures that when a text chunk is retrieved, its referenced
|
||||
figures are always surfaced — even if the figure's caption did not score highly in the vector
|
||||
search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.
|
||||
|
||||
**Alternatives considered**:
|
||||
- Rely entirely on dual vector search → may miss figures referenced in retrieved text but
|
||||
scoring below the topK threshold in the figure search
|
||||
|
||||
---
|
||||
|
||||
## Decision 6: Image Storage
|
||||
|
||||
**Decision**: Extracted images are saved as PNG files to a local directory
|
||||
(`${app.figure-storage.base-path}`, defaults to `./uploads/figures/{bookId}/`). The path is
|
||||
stored in `figure.image_path` in Postgres. A `FigureStorageService` interface wraps all disk
|
||||
I/O so the implementation can be swapped to S3 or another object store without changing
|
||||
callers.
|
||||
|
||||
**Rationale**: Local disk is the simplest viable option for a POC with <10 users. The interface
|
||||
boundary satisfies Constitution Principle II (Easy to Change).
|
||||
|
||||
**Alternatives considered**:
|
||||
- S3 from day 1 → operational overhead not justified at POC scale
|
||||
- Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
|
||||
|
||||
---
|
||||
|
||||
## Decision 7: Figure Type Classification
|
||||
|
||||
**Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
|
||||
TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
|
||||
1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
|
||||
2. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
|
||||
|
||||
**Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
|
||||
Heuristic classification avoids a separate model call per image at extraction time.
|
||||
|
||||
**Alternatives considered**:
|
||||
- Vision model classification → accurate but adds latency and cost per figure; deferrable
|
||||
- Single `FIGURE` type → loses citation granularity required by spec FR-004
|
||||
|
||||
---
|
||||
|
||||
## Decision 8: Metadata Schema for Vector Store Documents
|
||||
|
||||
**Decision**: All vector store documents carry a flat `Map<String, Object>` metadata for Spring
|
||||
AI filtering. Schema:
|
||||
|
||||
| Field | Text Chunk | Figure Chunk |
|
||||
|-------|-----------|-------------|
|
||||
| `type` | `"TEXT"` | `"FIGURE"` |
|
||||
| `book_id` | ✓ | ✓ |
|
||||
| `book_title` | ✓ | ✓ |
|
||||
| `chapter_id` | ✓ | ✓ |
|
||||
| `section_id` | ✓ | ✓ |
|
||||
| `section_title` | ✓ | ✓ |
|
||||
| `page_start` | ✓ | — |
|
||||
| `page_end` | ✓ | — |
|
||||
| `chunk_index` | ✓ | — |
|
||||
| `total_chunks` | ✓ | — |
|
||||
| `figure_id` | — | ✓ |
|
||||
| `figure_type` | — | ✓ |
|
||||
| `image_path` | — | ✓ |
|
||||
| `label` | — | ✓ |
|
||||
| `page` | — | ✓ |
|
||||
|
||||
**Rationale**: Flat map is required by Spring AI `FilterExpressionBuilder`. Separation by `type`
|
||||
allows independent filtering in dual search.
|
||||
|
||||
---
|
||||
|
||||
## Decision 9: Re-embedding Existing Books
|
||||
|
||||
**Decision**: Books already processed under feature 001 (text-only) are NOT automatically
|
||||
re-embedded. An explicit re-embed action is exposed via `POST /api/v1/books/{id}/reembed`
|
||||
(admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.
|
||||
|
||||
**Rationale**: Automatic re-embedding on deploy would block the system and risk data loss if
|
||||
the process fails mid-way. An explicit, idempotent trigger is safer and more observable.
|
||||
|
||||
---
|
||||
|
||||
## Decision 10: Minimum Image Size Threshold
|
||||
|
||||
**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. This
|
||||
threshold filters out decorative elements (bullets, dividers, publisher logos) without a
|
||||
classification model.
|
||||
|
||||
**Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
|
||||
The threshold is configurable via `app.figure-storage.min-image-size-px` in
|
||||
`application.properties`.
|
||||
|
||||
**Alternatives considered**:
|
||||
- No threshold → decorative icons pollute the figure index
|
||||
- ML-based classification → accurate but adds model dependency; not needed at POC scale
|
||||
Reference in New Issue
Block a user