412 lines
16 KiB
Markdown
412 lines
16 KiB
Markdown
# Research: Enhanced Embedding with Image Parsing and Metadata
|
||
|
||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)
|
||
|
||
This document resolves all technical unknowns identified during planning. Decisions 1–10 cover
|
||
the core pipeline. The **Marker Study** section at the bottom explains why Marker was chosen
|
||
over Google Document AI to drive PDF parsing and figure extraction.
|
||
|
||
---
|
||
|
||
## Decision 1: Document Hierarchy Model
|
||
|
||
**Decision**: Adopt a four-level hierarchy — `BookNode` → `ChapterNode` → `SectionNode` →
|
||
`TextChunkNode` + `FigureNode`. The `SectionNode` is the pivotal unit: it holds the full section
|
||
text in Postgres and is used for parent-child context expansion at retrieval time.
|
||
|
||
**Rationale**: A flat page-per-document model (current implementation) loses structural context.
|
||
When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text,
|
||
not just the matching fragment. Parent-child retrieval — where chunks point to their parent
|
||
section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section
|
||
association explicit and queryable.
|
||
|
||
**Alternatives considered**:
|
||
- Keep flat page model, add metadata only → rejected: insufficient for precise citation and
|
||
context expansion
|
||
- Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent
|
||
to LLM; cost and latency increase
|
||
|
||
---
|
||
|
||
## Decision 2: Document Parsing Strategy
|
||
|
||
**Decision**: Use **Marker** (local HTTP server, `http://localhost:8000/marker/upload`) as the
|
||
single entry point for PDF parsing. A single `POST` with `output_format=json` returns:
|
||
- Reading-order text blocks (headings, paragraphs) — no column-split heuristic needed
|
||
- Pre-cropped figure images as base64-encoded PNG in the `images` map of each `Figure` block
|
||
- Table, equation, and code blocks as structured HTML
|
||
|
||
`MarkerPageParser` translates the Marker JSON response into `List<PageResult>`, which is the
|
||
same internal DTO used by the rest of the pipeline.
|
||
|
||
**Rationale**: Marker handles column reordering, scanned-page OCR, and figure cropping in one
|
||
call, eliminating the PDFBox column heuristic (`PdfStructureParser`) and the PDFBox
|
||
render+crop loop in `FigureExtractionService`. Net result: fewer classes, no cloud dependency,
|
||
no GCP credentials.
|
||
|
||
**Alternatives considered**:
|
||
- PDFBox column heuristic (previous approach) → rejected: 50/50 split fails on asymmetric
|
||
columns and scanned pages
|
||
- Google Document AI Layout Parser → rejected: adds GCP credentials, per-page billing, 15-page
|
||
batch limit, and still requires PDFBox to render+crop figure regions from bounding boxes.
|
||
See Marker Study below for detailed comparison.
|
||
- Screenshot each page + OCR → far slower; loses digital text quality
|
||
|
||
---
|
||
|
||
## Decision 3: Figure Content Representation
|
||
|
||
**Decision**: Generate a textual description of each extracted image using the OpenAI vision
|
||
model (GPT-4o). This description becomes the `content` field of the figure's vector store
|
||
document. The figure caption (parsed from the surrounding text) is also included to maximise
|
||
retrieval signal.
|
||
|
||
**Rationale**: Caption-only embedding would miss figures with no caption or with sparse labels.
|
||
Vision-generated descriptions produce richer semantic content (anatomy terms, structural
|
||
relationships) that matches clinical queries. The OpenAI client already in use supports image
|
||
inputs; no additional dependency is required.
|
||
|
||
**Alternatives considered**:
|
||
- Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
|
||
- Local vision model (LLaVA) → requires self-hosting; out of scope for POC
|
||
- OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)
|
||
|
||
---
|
||
|
||
## Decision 4: Dual Vector Search
|
||
|
||
**Decision**: At query time, run two parallel similarity searches:
|
||
1. Text chunk search (filtered by `type = "TEXT"` and `book_id`)
|
||
2. Figure caption search (filtered by `type = "FIGURE"` and `book_id`)
|
||
|
||
Results are merged and deduplicated. The LLM prompt receives the expanded parent section text
|
||
plus a structured figure reference list.
|
||
|
||
**Rationale**: A single search would rank text and figures against each other; figures with
|
||
terse captions would systematically lose to text chunks. Separate searches with independent
|
||
`topK` allow tuning each modality independently.
|
||
|
||
**Alternatives considered**:
|
||
- Single search, filter by relevance score → figure captions score lower than text; figures
|
||
are systematically under-retrieved
|
||
- Post-process text results to look up linked figures only → misses figures that are relevant
|
||
to the query but not explicitly referenced in the retrieved text chunks
|
||
|
||
---
|
||
|
||
## Decision 5: Chunk-to-Figure Linking
|
||
|
||
**Decision**: During text parsing, whenever a pattern matching `Fig.\s+\d+[\-\.]\d+` or
|
||
`Figure\s+\d+[\-\.]\d+` is found in a chunk, insert a row into the `chunk_figure_refs` table
|
||
linking `chunkId` → `figureId`. At retrieval time, after text chunks are retrieved, their
|
||
associated figures are fetched from this table and added to the LLM prompt.
|
||
|
||
**Rationale**: Explicit linking ensures that when a text chunk is retrieved, its referenced
|
||
figures are always surfaced — even if the figure's caption did not score highly in the vector
|
||
search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.
|
||
|
||
**Alternatives considered**:
|
||
- Rely entirely on dual vector search → may miss figures referenced in retrieved text but
|
||
scoring below the topK threshold in the figure search
|
||
|
||
---
|
||
|
||
## Decision 6: Image Storage
|
||
|
||
**Decision**: Marker returns figure images as base64-encoded PNG bytes in the JSON response.
|
||
`FigureExtractionService` decodes these bytes and passes them to `FigureStorageService`, which
|
||
persists them to an S3-compatible bucket (`${app.figure-storage.bucket}`). The image path/URL
|
||
is stored in `figure.image_path` in Postgres.
|
||
|
||
The `FigureStorageService` interface is unchanged; only the caller changes (from PDFBox crop
|
||
to base64 decode).
|
||
|
||
**Rationale**: Marker's pre-cropped images remove the need for PDFBox rendering.
|
||
`FigureStorageService` interface boundary satisfies Constitution Principle II (Easy to Change).
|
||
|
||
**Alternatives considered**:
|
||
- Store base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
|
||
|
||
---
|
||
|
||
## Decision 7: Figure Type Classification
|
||
|
||
**Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
|
||
TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
|
||
1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
|
||
2. Marker `block_type` hint (`"Table"` → TABLE, `"Figure"` / `"Picture"` → ANATOMICAL_DIAGRAM default)
|
||
3. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
|
||
|
||
**Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
|
||
Heuristic classification avoids a separate model call per image at extraction time.
|
||
|
||
**Alternatives considered**:
|
||
- Vision model classification → accurate but adds latency and cost per figure; deferrable
|
||
- Single `FIGURE` type → loses citation granularity required by spec FR-004
|
||
|
||
---
|
||
|
||
## Decision 8: Metadata Schema for Vector Store Documents
|
||
|
||
**Decision**: All vector store documents carry a flat `Map<String, Object>` metadata for Spring
|
||
AI filtering. Schema:
|
||
|
||
| Field | Text Chunk | Figure Chunk |
|
||
|-------|-----------|-------------|
|
||
| `type` | `"TEXT"` | `"FIGURE"` |
|
||
| `book_id` | ✓ | ✓ |
|
||
| `book_title` | ✓ | ✓ |
|
||
| `chapter_id` | ✓ | ✓ |
|
||
| `section_id` | ✓ | ✓ |
|
||
| `section_title` | ✓ | ✓ |
|
||
| `page_start` | ✓ | — |
|
||
| `page_end` | ✓ | — |
|
||
| `chunk_index` | ✓ | — |
|
||
| `total_chunks` | ✓ | — |
|
||
| `figure_id` | — | ✓ |
|
||
| `figure_type` | — | ✓ |
|
||
| `image_path` | — | ✓ |
|
||
| `label` | — | ✓ |
|
||
| `page` | — | ✓ |
|
||
|
||
**Rationale**: Flat map is required by Spring AI `FilterExpressionBuilder`. Separation by `type`
|
||
allows independent filtering in dual search.
|
||
|
||
---
|
||
|
||
## Decision 9: Re-embedding Existing Books
|
||
|
||
**Decision**: Books already processed under feature 001 (text-only) are NOT automatically
|
||
re-embedded. An explicit re-embed action is exposed via `POST /api/v1/books/{id}/reembed`
|
||
(admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.
|
||
|
||
**Rationale**: Automatic re-embedding on deploy would block the system and risk data loss if
|
||
the process fails mid-way. An explicit, idempotent trigger is safer and more observable.
|
||
|
||
---
|
||
|
||
## Decision 10: Minimum Image Size Threshold
|
||
|
||
**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. Marker
|
||
returns PNG bytes; `FigureExtractionService` decodes to `BufferedImage` solely to check
|
||
dimensions. This threshold filters out decorative elements without a classification model.
|
||
|
||
**Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
|
||
The threshold is configurable via `app.figure-storage.min-image-size-px`.
|
||
|
||
**Alternatives considered**:
|
||
- No threshold → decorative icons pollute the figure index
|
||
- ML-based classification → accurate but adds model dependency; not needed at POC scale
|
||
|
||
---
|
||
|
||
# Marker Study — Why Marker Replaces Google Document AI
|
||
|
||
*Added 2026-04-04.*
|
||
|
||
## What Marker Offers
|
||
|
||
Marker is an open-source, locally-runnable PDF-to-structured-content converter that uses a
|
||
pipeline of deep-learning models (surya for OCR + layout detection, texify for equations).
|
||
Key capabilities relevant to this project:
|
||
|
||
| Capability | Marker | Google Document AI |
|
||
|-----------|--------|--------------------|
|
||
| Multi-column reading order | ✅ | ✅ |
|
||
| OCR on scanned pages | ✅ | ✅ |
|
||
| Figure detection | ✅ returns pre-cropped images | ⚠️ returns bbox only; PDFBox still needed |
|
||
| Table extraction | ✅ HTML tables | ✅ |
|
||
| JSON output with image bytes | ✅ base64 in `images` map | ❌ |
|
||
| No cloud credentials | ✅ | ❌ GCP service account required |
|
||
| No per-page billing | ✅ | ❌ ~$10/1,000 pages |
|
||
| Batch size limits | None (local) | 15 pages / 20 MB per sync call |
|
||
| Setup | `pip install marker-pdf && marker_server` | GCP project + processor + IAM |
|
||
|
||
---
|
||
|
||
## Does Marker Solve the Current Pain Points?
|
||
|
||
### Pain Point 1: Naive 50/50 Column Split
|
||
|
||
**Answer: Yes, Marker fixes this completely.**
|
||
|
||
`PdfStructureParser.extractPageText()` splits pages at the horizontal midpoint with a 20%
|
||
threshold. This fails on asymmetric columns and scanned pages. Marker's surya layout model
|
||
returns blocks in natural reading order — no heuristic needed.
|
||
|
||
### Pain Point 2: Figure Detection Misses Rasterized Figures
|
||
|
||
**Answer: Yes, Marker fixes this for most cases.**
|
||
|
||
`FigureExtractionService` previously iterated PDF XObjects (only finds embedded XObject images,
|
||
misses rasterized figures and vector-path drawings). Marker's layout model detects visual
|
||
elements by type and returns the cropped image bytes directly — no PDFBox page rendering needed.
|
||
|
||
### Pain Point 3: OCR on Scanned Pages
|
||
|
||
**Answer: Yes, Marker handles scanned pages transparently via surya OCR.**
|
||
|
||
### Pain Point 4: Caption Detection
|
||
|
||
**Answer: Improved — Marker groups caption blocks with their figure block.**
|
||
|
||
The `block_type = "Caption"` block appears as a sibling or child adjacent to the `"Figure"`
|
||
block in the Marker JSON, making caption association structural rather than regex-based.
|
||
|
||
---
|
||
|
||
## Marker API Integration
|
||
|
||
### Local Server Setup
|
||
|
||
```bash
|
||
pip install marker-pdf
|
||
marker_server --port 8000
|
||
```
|
||
|
||
The server exposes `POST /marker/upload` (the user's configured endpoint).
|
||
|
||
### Request
|
||
|
||
```
|
||
POST http://localhost:8000/marker/upload
|
||
Content-Type: multipart/form-data
|
||
|
||
file=@document.pdf
|
||
output_format=json
|
||
```
|
||
|
||
### Response (abbreviated)
|
||
|
||
```json
|
||
{
|
||
"output_format": "json",
|
||
"output": {
|
||
"block_type": "Document",
|
||
"children": [
|
||
{
|
||
"block_type": "Page",
|
||
"id": "/page/0/Page/0",
|
||
"children": [
|
||
{
|
||
"block_type": "SectionHeader",
|
||
"id": "/page/0/SectionHeader/0",
|
||
"html": "<h1>Cavernous Sinus Anatomy</h1>"
|
||
},
|
||
{
|
||
"block_type": "Text",
|
||
"id": "/page/0/Text/1",
|
||
"html": "<p>The cavernous sinus contains...</p>"
|
||
},
|
||
{
|
||
"block_type": "Figure",
|
||
"id": "/page/0/Figure/2",
|
||
"html": "<figure><img src='/page/0/Figure/2'/></figure>",
|
||
"images": {
|
||
"/page/0/Figure/2": "iVBORw0KGgo..."
|
||
}
|
||
},
|
||
{
|
||
"block_type": "Caption",
|
||
"id": "/page/0/Caption/3",
|
||
"html": "<p>Fig. 12-4. Coronal cross-section...</p>"
|
||
}
|
||
]
|
||
}
|
||
],
|
||
"metadata": { "page_stats": [...] }
|
||
}
|
||
}
|
||
```
|
||
|
||
### Java Integration Pattern
|
||
|
||
```java
|
||
// MarkerPageParser — core call
|
||
MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
|
||
body.add("file", new FileSystemResource(pdfPath));
|
||
body.add("output_format", "json");
|
||
|
||
JsonNode response = restClient.post()
|
||
.uri(baseUrl + "/marker/upload")
|
||
.contentType(MediaType.MULTIPART_FORM_DATA)
|
||
.body(body)
|
||
.retrieve()
|
||
.body(JsonNode.class);
|
||
|
||
JsonNode document = response.get("output");
|
||
```
|
||
|
||
### Mapping Marker Blocks to PageResult
|
||
|
||
```
|
||
Page block (id "/page/N/Page/M") → PageResult(pageNumber = N+1)
|
||
SectionHeader children → headingTitle (first match)
|
||
Text, TextInlineMath children → orderedText (HTML stripped, joined \n\n)
|
||
Figure children with images map → FigureData(imageBytes = base64decode(images[id]))
|
||
Caption sibling of Figure → FigureData.nearestCaption
|
||
```
|
||
|
||
---
|
||
|
||
## Architecture Change
|
||
|
||
```
|
||
Before (Document AI — removed):
|
||
DocumentAiPageParser
|
||
→ Google Document AI API (GCP, 15-page batches, credentials)
|
||
→ returns text blocks + figure bboxes
|
||
PdfStructureParser (PDFBox column heuristic)
|
||
FigureExtractionService
|
||
→ renders page via PDFBox at 150 DPI
|
||
→ crops bbox region
|
||
|
||
After (Marker):
|
||
MarkerPageParser
|
||
→ POST PDF to http://localhost:8000/marker/upload (output_format=json)
|
||
→ returns text blocks (correct reading order) + Figure blocks with base64 images
|
||
→ produces List<PageResult> (same DTO, FigureData carries bytes not bbox)
|
||
FigureExtractionService (simplified)
|
||
→ base64-decodes image bytes from PageResult.FigureData
|
||
→ checks min size (ImageIO.read → getWidth/getHeight)
|
||
→ saves to S3 via FigureStorageService (UNCHANGED)
|
||
VisionDescriptionService (UNCHANGED)
|
||
BookEmbeddingService orchestration (MINOR: inject MarkerPageParser)
|
||
```
|
||
|
||
**What is removed**:
|
||
- `DocumentAiPageParser` — replaced by `MarkerPageParser`
|
||
- `DocumentAiConfig` — replaced by `MarkerConfig`
|
||
- `PdfStructureParser` — Marker handles reading order
|
||
- `google-cloud-document-ai` Maven dependency
|
||
- `app.document-ai.*` configuration properties
|
||
|
||
**What stays the same**:
|
||
- `PageResult` DTO structure (fields renamed, not restructured)
|
||
- `FigureExtractionService` public interface
|
||
- `TextChunkingService`, `VisionDescriptionService`, `BookEmbeddingService` orchestration
|
||
- All JPA entities, repositories, vector store, S3 storage
|
||
|
||
---
|
||
|
||
## Constitution Compliance
|
||
|
||
| Principle | Assessment |
|
||
|-----------|------------|
|
||
| **I. KISS** | ✅ Simpler than Document AI — one HTTP call replaces GCP SDK + PDFBox render loop. No new dependency beyond an HTTP client (Spring RestClient, already available). |
|
||
| **II. Easy to Change** | ✅ `MarkerPageParser` is the only Marker-aware class. Swap it to use any other parser. `PageResult` DTO unchanged in contract. |
|
||
| **III. Web-First** | ✅ Internal pipeline change; no API contract change. |
|
||
| **IV. Documentation** | ✅ README must show Marker as a local external service dependency. |
|
||
|
||
---
|
||
|
||
## Risks & Mitigations
|
||
|
||
| Risk | Likelihood | Mitigation |
|
||
|------|-----------|------------|
|
||
| Marker server not running when book is uploaded | Medium | `BookEmbeddingService` catches exception from `MarkerPageParser`, marks book as `FAILED`, logs full error. |
|
||
| Marker misses some figures (complex PDFs) | Medium | `app.figure-storage.min-image-size-px` threshold can be tuned. Add fallback: if Marker returns 0 figures for a page with known images, log a warning. |
|
||
| SC-003 (≤ 3× processing time) violated | Low | Marker runs locally (no network latency to cloud). Benchmark with a real 500-page book early. |
|
||
| Large PDF upload to Marker (>100MB) | Low | Marker server handles the full file; no batching needed. Multipart upload limit configurable. |
|
||
| Marker image quality vs PDFBox crop | Low | Marker crops at native resolution; quality is equivalent or better than 150 DPI PDFBox render. |
|