Files
2026-04-04 21:30:18 +02:00

412 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Research: Enhanced Embedding with Image Parsing and Metadata
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)
This document resolves all technical unknowns identified during planning. Decisions 110 cover
the core pipeline. The **Marker Study** section at the bottom explains why Marker was chosen
over Google Document AI to drive PDF parsing and figure extraction.
---
## Decision 1: Document Hierarchy Model
**Decision**: Adopt a four-level hierarchy — `BookNode``ChapterNode``SectionNode`
`TextChunkNode` + `FigureNode`. The `SectionNode` is the pivotal unit: it holds the full section
text in Postgres and is used for parent-child context expansion at retrieval time.
**Rationale**: A flat page-per-document model (current implementation) loses structural context.
When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text,
not just the matching fragment. Parent-child retrieval — where chunks point to their parent
section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section
association explicit and queryable.
**Alternatives considered**:
- Keep flat page model, add metadata only → rejected: insufficient for precise citation and
context expansion
- Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent
to LLM; cost and latency increase
---
## Decision 2: Document Parsing Strategy
**Decision**: Use **Marker** (local HTTP server, `http://localhost:8000/marker/upload`) as the
single entry point for PDF parsing. A single `POST` with `output_format=json` returns:
- Reading-order text blocks (headings, paragraphs) — no column-split heuristic needed
- Pre-cropped figure images as base64-encoded PNG in the `images` map of each `Figure` block
- Table, equation, and code blocks as structured HTML
`MarkerPageParser` translates the Marker JSON response into `List<PageResult>`, which is the
same internal DTO used by the rest of the pipeline.
**Rationale**: Marker handles column reordering, scanned-page OCR, and figure cropping in one
call, eliminating the PDFBox column heuristic (`PdfStructureParser`) and the PDFBox
render+crop loop in `FigureExtractionService`. Net result: fewer classes, no cloud dependency,
no GCP credentials.
**Alternatives considered**:
- PDFBox column heuristic (previous approach) → rejected: 50/50 split fails on asymmetric
columns and scanned pages
- Google Document AI Layout Parser → rejected: adds GCP credentials, per-page billing, 15-page
batch limit, and still requires PDFBox to render+crop figure regions from bounding boxes.
See Marker Study below for detailed comparison.
- Screenshot each page + OCR → far slower; loses digital text quality
---
## Decision 3: Figure Content Representation
**Decision**: Generate a textual description of each extracted image using the OpenAI vision
model (GPT-4o). This description becomes the `content` field of the figure's vector store
document. The figure caption (parsed from the surrounding text) is also included to maximise
retrieval signal.
**Rationale**: Caption-only embedding would miss figures with no caption or with sparse labels.
Vision-generated descriptions produce richer semantic content (anatomy terms, structural
relationships) that matches clinical queries. The OpenAI client already in use supports image
inputs; no additional dependency is required.
**Alternatives considered**:
- Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
- Local vision model (LLaVA) → requires self-hosting; out of scope for POC
- OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)
---
## Decision 4: Dual Vector Search
**Decision**: At query time, run two parallel similarity searches:
1. Text chunk search (filtered by `type = "TEXT"` and `book_id`)
2. Figure caption search (filtered by `type = "FIGURE"` and `book_id`)
Results are merged and deduplicated. The LLM prompt receives the expanded parent section text
plus a structured figure reference list.
**Rationale**: A single search would rank text and figures against each other; figures with
terse captions would systematically lose to text chunks. Separate searches with independent
`topK` allow tuning each modality independently.
**Alternatives considered**:
- Single search, filter by relevance score → figure captions score lower than text; figures
are systematically under-retrieved
- Post-process text results to look up linked figures only → misses figures that are relevant
to the query but not explicitly referenced in the retrieved text chunks
---
## Decision 5: Chunk-to-Figure Linking
**Decision**: During text parsing, whenever a pattern matching `Fig.\s+\d+[\-\.]\d+` or
`Figure\s+\d+[\-\.]\d+` is found in a chunk, insert a row into the `chunk_figure_refs` table
linking `chunkId``figureId`. At retrieval time, after text chunks are retrieved, their
associated figures are fetched from this table and added to the LLM prompt.
**Rationale**: Explicit linking ensures that when a text chunk is retrieved, its referenced
figures are always surfaced — even if the figure's caption did not score highly in the vector
search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.
**Alternatives considered**:
- Rely entirely on dual vector search → may miss figures referenced in retrieved text but
scoring below the topK threshold in the figure search
---
## Decision 6: Image Storage
**Decision**: Marker returns figure images as base64-encoded PNG bytes in the JSON response.
`FigureExtractionService` decodes these bytes and passes them to `FigureStorageService`, which
persists them to an S3-compatible bucket (`${app.figure-storage.bucket}`). The image path/URL
is stored in `figure.image_path` in Postgres.
The `FigureStorageService` interface is unchanged; only the caller changes (from PDFBox crop
to base64 decode).
**Rationale**: Marker's pre-cropped images remove the need for PDFBox rendering.
`FigureStorageService` interface boundary satisfies Constitution Principle II (Easy to Change).
**Alternatives considered**:
- Store base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
---
## Decision 7: Figure Type Classification
**Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
2. Marker `block_type` hint (`"Table"` → TABLE, `"Figure"` / `"Picture"` → ANATOMICAL_DIAGRAM default)
3. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
**Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
Heuristic classification avoids a separate model call per image at extraction time.
**Alternatives considered**:
- Vision model classification → accurate but adds latency and cost per figure; deferrable
- Single `FIGURE` type → loses citation granularity required by spec FR-004
---
## Decision 8: Metadata Schema for Vector Store Documents
**Decision**: All vector store documents carry a flat `Map<String, Object>` metadata for Spring
AI filtering. Schema:
| Field | Text Chunk | Figure Chunk |
|-------|-----------|-------------|
| `type` | `"TEXT"` | `"FIGURE"` |
| `book_id` | ✓ | ✓ |
| `book_title` | ✓ | ✓ |
| `chapter_id` | ✓ | ✓ |
| `section_id` | ✓ | ✓ |
| `section_title` | ✓ | ✓ |
| `page_start` | ✓ | — |
| `page_end` | ✓ | — |
| `chunk_index` | ✓ | — |
| `total_chunks` | ✓ | — |
| `figure_id` | — | ✓ |
| `figure_type` | — | ✓ |
| `image_path` | — | ✓ |
| `label` | — | ✓ |
| `page` | — | ✓ |
**Rationale**: Flat map is required by Spring AI `FilterExpressionBuilder`. Separation by `type`
allows independent filtering in dual search.
---
## Decision 9: Re-embedding Existing Books
**Decision**: Books already processed under feature 001 (text-only) are NOT automatically
re-embedded. An explicit re-embed action is exposed via `POST /api/v1/books/{id}/reembed`
(admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.
**Rationale**: Automatic re-embedding on deploy would block the system and risk data loss if
the process fails mid-way. An explicit, idempotent trigger is safer and more observable.
---
## Decision 10: Minimum Image Size Threshold
**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. Marker
returns PNG bytes; `FigureExtractionService` decodes to `BufferedImage` solely to check
dimensions. This threshold filters out decorative elements without a classification model.
**Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
The threshold is configurable via `app.figure-storage.min-image-size-px`.
**Alternatives considered**:
- No threshold → decorative icons pollute the figure index
- ML-based classification → accurate but adds model dependency; not needed at POC scale
---
# Marker Study — Why Marker Replaces Google Document AI
*Added 2026-04-04.*
## What Marker Offers
Marker is an open-source, locally-runnable PDF-to-structured-content converter that uses a
pipeline of deep-learning models (surya for OCR + layout detection, texify for equations).
Key capabilities relevant to this project:
| Capability | Marker | Google Document AI |
|-----------|--------|--------------------|
| Multi-column reading order | ✅ | ✅ |
| OCR on scanned pages | ✅ | ✅ |
| Figure detection | ✅ returns pre-cropped images | ⚠️ returns bbox only; PDFBox still needed |
| Table extraction | ✅ HTML tables | ✅ |
| JSON output with image bytes | ✅ base64 in `images` map | ❌ |
| No cloud credentials | ✅ | ❌ GCP service account required |
| No per-page billing | ✅ | ❌ ~$10/1,000 pages |
| Batch size limits | None (local) | 15 pages / 20 MB per sync call |
| Setup | `pip install marker-pdf && marker_server` | GCP project + processor + IAM |
---
## Does Marker Solve the Current Pain Points?
### Pain Point 1: Naive 50/50 Column Split
**Answer: Yes, Marker fixes this completely.**
`PdfStructureParser.extractPageText()` splits pages at the horizontal midpoint with a 20%
threshold. This fails on asymmetric columns and scanned pages. Marker's surya layout model
returns blocks in natural reading order — no heuristic needed.
### Pain Point 2: Figure Detection Misses Rasterized Figures
**Answer: Yes, Marker fixes this for most cases.**
`FigureExtractionService` previously iterated PDF XObjects (only finds embedded XObject images,
misses rasterized figures and vector-path drawings). Marker's layout model detects visual
elements by type and returns the cropped image bytes directly — no PDFBox page rendering needed.
### Pain Point 3: OCR on Scanned Pages
**Answer: Yes, Marker handles scanned pages transparently via surya OCR.**
### Pain Point 4: Caption Detection
**Answer: Improved — Marker groups caption blocks with their figure block.**
The `block_type = "Caption"` block appears as a sibling or child adjacent to the `"Figure"`
block in the Marker JSON, making caption association structural rather than regex-based.
---
## Marker API Integration
### Local Server Setup
```bash
pip install marker-pdf
marker_server --port 8000
```
The server exposes `POST /marker/upload` (the user's configured endpoint).
### Request
```
POST http://localhost:8000/marker/upload
Content-Type: multipart/form-data
file=@document.pdf
output_format=json
```
### Response (abbreviated)
```json
{
"output_format": "json",
"output": {
"block_type": "Document",
"children": [
{
"block_type": "Page",
"id": "/page/0/Page/0",
"children": [
{
"block_type": "SectionHeader",
"id": "/page/0/SectionHeader/0",
"html": "<h1>Cavernous Sinus Anatomy</h1>"
},
{
"block_type": "Text",
"id": "/page/0/Text/1",
"html": "<p>The cavernous sinus contains...</p>"
},
{
"block_type": "Figure",
"id": "/page/0/Figure/2",
"html": "<figure><img src='/page/0/Figure/2'/></figure>",
"images": {
"/page/0/Figure/2": "iVBORw0KGgo..."
}
},
{
"block_type": "Caption",
"id": "/page/0/Caption/3",
"html": "<p>Fig. 12-4. Coronal cross-section...</p>"
}
]
}
],
"metadata": { "page_stats": [...] }
}
}
```
### Java Integration Pattern
```java
// MarkerPageParser — core call
MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
body.add("file", new FileSystemResource(pdfPath));
body.add("output_format", "json");
JsonNode response = restClient.post()
.uri(baseUrl + "/marker/upload")
.contentType(MediaType.MULTIPART_FORM_DATA)
.body(body)
.retrieve()
.body(JsonNode.class);
JsonNode document = response.get("output");
```
### Mapping Marker Blocks to PageResult
```
Page block (id "/page/N/Page/M") → PageResult(pageNumber = N+1)
SectionHeader children → headingTitle (first match)
Text, TextInlineMath children → orderedText (HTML stripped, joined \n\n)
Figure children with images map → FigureData(imageBytes = base64decode(images[id]))
Caption sibling of Figure → FigureData.nearestCaption
```
---
## Architecture Change
```
Before (Document AI — removed):
DocumentAiPageParser
→ Google Document AI API (GCP, 15-page batches, credentials)
→ returns text blocks + figure bboxes
PdfStructureParser (PDFBox column heuristic)
FigureExtractionService
→ renders page via PDFBox at 150 DPI
→ crops bbox region
After (Marker):
MarkerPageParser
→ POST PDF to http://localhost:8000/marker/upload (output_format=json)
→ returns text blocks (correct reading order) + Figure blocks with base64 images
→ produces List<PageResult> (same DTO, FigureData carries bytes not bbox)
FigureExtractionService (simplified)
→ base64-decodes image bytes from PageResult.FigureData
→ checks min size (ImageIO.read → getWidth/getHeight)
→ saves to S3 via FigureStorageService (UNCHANGED)
VisionDescriptionService (UNCHANGED)
BookEmbeddingService orchestration (MINOR: inject MarkerPageParser)
```
**What is removed**:
- `DocumentAiPageParser` — replaced by `MarkerPageParser`
- `DocumentAiConfig` — replaced by `MarkerConfig`
- `PdfStructureParser` — Marker handles reading order
- `google-cloud-document-ai` Maven dependency
- `app.document-ai.*` configuration properties
**What stays the same**:
- `PageResult` DTO structure (fields renamed, not restructured)
- `FigureExtractionService` public interface
- `TextChunkingService`, `VisionDescriptionService`, `BookEmbeddingService` orchestration
- All JPA entities, repositories, vector store, S3 storage
---
## Constitution Compliance
| Principle | Assessment |
|-----------|------------|
| **I. KISS** | ✅ Simpler than Document AI — one HTTP call replaces GCP SDK + PDFBox render loop. No new dependency beyond an HTTP client (Spring RestClient, already available). |
| **II. Easy to Change** | ✅ `MarkerPageParser` is the only Marker-aware class. Swap it to use any other parser. `PageResult` DTO unchanged in contract. |
| **III. Web-First** | ✅ Internal pipeline change; no API contract change. |
| **IV. Documentation** | ✅ README must show Marker as a local external service dependency. |
---
## Risks & Mitigations
| Risk | Likelihood | Mitigation |
|------|-----------|------------|
| Marker server not running when book is uploaded | Medium | `BookEmbeddingService` catches exception from `MarkerPageParser`, marks book as `FAILED`, logs full error. |
| Marker misses some figures (complex PDFs) | Medium | `app.figure-storage.min-image-size-px` threshold can be tuned. Add fallback: if Marker returns 0 figures for a page with known images, log a warning. |
| SC-003 (≤ 3× processing time) violated | Low | Marker runs locally (no network latency to cloud). Benchmark with a real 500-page book early. |
| Large PDF upload to Marker (>100MB) | Low | Marker server handles the full file; no batching needed. Multipart upload limit configurable. |
| Marker image quality vs PDFBox crop | Low | Marker crops at native resolution; quality is equivalent or better than 150 DPI PDFBox render. |