16 KiB
Research: Enhanced Embedding with Image Parsing and Metadata
Branch: 002-image-aware-embedding | Date: 2026-04-04 (updated: Marker replaces Google Document AI)
This document resolves all technical unknowns identified during planning. Decisions 1–10 cover the core pipeline. The Marker Study section at the bottom explains why Marker was chosen over Google Document AI to drive PDF parsing and figure extraction.
Decision 1: Document Hierarchy Model
Decision: Adopt a four-level hierarchy — BookNode → ChapterNode → SectionNode →
TextChunkNode + FigureNode. The SectionNode is the pivotal unit: it holds the full section
text in Postgres and is used for parent-child context expansion at retrieval time.
Rationale: A flat page-per-document model (current implementation) loses structural context. When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text, not just the matching fragment. Parent-child retrieval — where chunks point to their parent section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section association explicit and queryable.
Alternatives considered:
- Keep flat page model, add metadata only → rejected: insufficient for precise citation and context expansion
- Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent to LLM; cost and latency increase
Decision 2: Document Parsing Strategy
Decision: Use Marker (local HTTP server, http://localhost:8000/marker/upload) as the
single entry point for PDF parsing. A single POST with output_format=json returns:
- Reading-order text blocks (headings, paragraphs) — no column-split heuristic needed
- Pre-cropped figure images as base64-encoded PNG in the
imagesmap of eachFigureblock - Table, equation, and code blocks as structured HTML
MarkerPageParser translates the Marker JSON response into List<PageResult>, which is the
same internal DTO used by the rest of the pipeline.
Rationale: Marker handles column reordering, scanned-page OCR, and figure cropping in one
call, eliminating the PDFBox column heuristic (PdfStructureParser) and the PDFBox
render+crop loop in FigureExtractionService. Net result: fewer classes, no cloud dependency,
no GCP credentials.
Alternatives considered:
- PDFBox column heuristic (previous approach) → rejected: 50/50 split fails on asymmetric columns and scanned pages
- Google Document AI Layout Parser → rejected: adds GCP credentials, per-page billing, 15-page batch limit, and still requires PDFBox to render+crop figure regions from bounding boxes. See Marker Study below for detailed comparison.
- Screenshot each page + OCR → far slower; loses digital text quality
Decision 3: Figure Content Representation
Decision: Generate a textual description of each extracted image using the OpenAI vision
model (GPT-4o). This description becomes the content field of the figure's vector store
document. The figure caption (parsed from the surrounding text) is also included to maximise
retrieval signal.
Rationale: Caption-only embedding would miss figures with no caption or with sparse labels. Vision-generated descriptions produce richer semantic content (anatomy terms, structural relationships) that matches clinical queries. The OpenAI client already in use supports image inputs; no additional dependency is required.
Alternatives considered:
- Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
- Local vision model (LLaVA) → requires self-hosting; out of scope for POC
- OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)
Decision 4: Dual Vector Search
Decision: At query time, run two parallel similarity searches:
- Text chunk search (filtered by
type = "TEXT"andbook_id) - Figure caption search (filtered by
type = "FIGURE"andbook_id)
Results are merged and deduplicated. The LLM prompt receives the expanded parent section text plus a structured figure reference list.
Rationale: A single search would rank text and figures against each other; figures with
terse captions would systematically lose to text chunks. Separate searches with independent
topK allow tuning each modality independently.
Alternatives considered:
- Single search, filter by relevance score → figure captions score lower than text; figures are systematically under-retrieved
- Post-process text results to look up linked figures only → misses figures that are relevant to the query but not explicitly referenced in the retrieved text chunks
Decision 5: Chunk-to-Figure Linking
Decision: During text parsing, whenever a pattern matching Fig.\s+\d+[\-\.]\d+ or
Figure\s+\d+[\-\.]\d+ is found in a chunk, insert a row into the chunk_figure_refs table
linking chunkId → figureId. At retrieval time, after text chunks are retrieved, their
associated figures are fetched from this table and added to the LLM prompt.
Rationale: Explicit linking ensures that when a text chunk is retrieved, its referenced figures are always surfaced — even if the figure's caption did not score highly in the vector search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.
Alternatives considered:
- Rely entirely on dual vector search → may miss figures referenced in retrieved text but scoring below the topK threshold in the figure search
Decision 6: Image Storage
Decision: Marker returns figure images as base64-encoded PNG bytes in the JSON response.
FigureExtractionService decodes these bytes and passes them to FigureStorageService, which
persists them to an S3-compatible bucket (${app.figure-storage.bucket}). The image path/URL
is stored in figure.image_path in Postgres.
The FigureStorageService interface is unchanged; only the caller changes (from PDFBox crop
to base64 decode).
Rationale: Marker's pre-cropped images remove the need for PDFBox rendering.
FigureStorageService interface boundary satisfies Constitution Principle II (Easy to Change).
Alternatives considered:
- Store base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
Decision 7: Figure Type Classification
Decision: Use the enum FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN, TABLE, CHART, INTRAOPERATIVE_IMAGE }. Classification is derived from:
- Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
- Marker
block_typehint ("Table"→ TABLE,"Figure"/"Picture"→ ANATOMICAL_DIAGRAM default) - Fall back to
ANATOMICAL_DIAGRAMif unclassifiable
Rationale: Allows the frontend to render different icon/label per type (e.g., "MRI" badge). Heuristic classification avoids a separate model call per image at extraction time.
Alternatives considered:
- Vision model classification → accurate but adds latency and cost per figure; deferrable
- Single
FIGUREtype → loses citation granularity required by spec FR-004
Decision 8: Metadata Schema for Vector Store Documents
Decision: All vector store documents carry a flat Map<String, Object> metadata for Spring
AI filtering. Schema:
| Field | Text Chunk | Figure Chunk |
|---|---|---|
type |
"TEXT" |
"FIGURE" |
book_id |
✓ | ✓ |
book_title |
✓ | ✓ |
chapter_id |
✓ | ✓ |
section_id |
✓ | ✓ |
section_title |
✓ | ✓ |
page_start |
✓ | — |
page_end |
✓ | — |
chunk_index |
✓ | — |
total_chunks |
✓ | — |
figure_id |
— | ✓ |
figure_type |
— | ✓ |
image_path |
— | ✓ |
label |
— | ✓ |
page |
— | ✓ |
Rationale: Flat map is required by Spring AI FilterExpressionBuilder. Separation by type
allows independent filtering in dual search.
Decision 9: Re-embedding Existing Books
Decision: Books already processed under feature 001 (text-only) are NOT automatically
re-embedded. An explicit re-embed action is exposed via POST /api/v1/books/{id}/reembed
(admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.
Rationale: Automatic re-embedding on deploy would block the system and risk data loss if the process fails mid-way. An explicit, idempotent trigger is safer and more observable.
Decision 10: Minimum Image Size Threshold
Decision: Images smaller than 100×100 pixels are discarded and no chunk is created. Marker
returns PNG bytes; FigureExtractionService decodes to BufferedImage solely to check
dimensions. This threshold filters out decorative elements without a classification model.
Rationale: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
The threshold is configurable via app.figure-storage.min-image-size-px.
Alternatives considered:
- No threshold → decorative icons pollute the figure index
- ML-based classification → accurate but adds model dependency; not needed at POC scale
Marker Study — Why Marker Replaces Google Document AI
Added 2026-04-04.
What Marker Offers
Marker is an open-source, locally-runnable PDF-to-structured-content converter that uses a pipeline of deep-learning models (surya for OCR + layout detection, texify for equations). Key capabilities relevant to this project:
| Capability | Marker | Google Document AI |
|---|---|---|
| Multi-column reading order | ✅ | ✅ |
| OCR on scanned pages | ✅ | ✅ |
| Figure detection | ✅ returns pre-cropped images | ⚠️ returns bbox only; PDFBox still needed |
| Table extraction | ✅ HTML tables | ✅ |
| JSON output with image bytes | ✅ base64 in images map |
❌ |
| No cloud credentials | ✅ | ❌ GCP service account required |
| No per-page billing | ✅ | ❌ ~$10/1,000 pages |
| Batch size limits | None (local) | 15 pages / 20 MB per sync call |
| Setup | pip install marker-pdf && marker_server |
GCP project + processor + IAM |
Does Marker Solve the Current Pain Points?
Pain Point 1: Naive 50/50 Column Split
Answer: Yes, Marker fixes this completely.
PdfStructureParser.extractPageText() splits pages at the horizontal midpoint with a 20%
threshold. This fails on asymmetric columns and scanned pages. Marker's surya layout model
returns blocks in natural reading order — no heuristic needed.
Pain Point 2: Figure Detection Misses Rasterized Figures
Answer: Yes, Marker fixes this for most cases.
FigureExtractionService previously iterated PDF XObjects (only finds embedded XObject images,
misses rasterized figures and vector-path drawings). Marker's layout model detects visual
elements by type and returns the cropped image bytes directly — no PDFBox page rendering needed.
Pain Point 3: OCR on Scanned Pages
Answer: Yes, Marker handles scanned pages transparently via surya OCR.
Pain Point 4: Caption Detection
Answer: Improved — Marker groups caption blocks with their figure block.
The block_type = "Caption" block appears as a sibling or child adjacent to the "Figure"
block in the Marker JSON, making caption association structural rather than regex-based.
Marker API Integration
Local Server Setup
pip install marker-pdf
marker_server --port 8000
The server exposes POST /marker/upload (the user's configured endpoint).
Request
POST http://localhost:8000/marker/upload
Content-Type: multipart/form-data
file=@document.pdf
output_format=json
Response (abbreviated)
{
"output_format": "json",
"output": {
"block_type": "Document",
"children": [
{
"block_type": "Page",
"id": "/page/0/Page/0",
"children": [
{
"block_type": "SectionHeader",
"id": "/page/0/SectionHeader/0",
"html": "<h1>Cavernous Sinus Anatomy</h1>"
},
{
"block_type": "Text",
"id": "/page/0/Text/1",
"html": "<p>The cavernous sinus contains...</p>"
},
{
"block_type": "Figure",
"id": "/page/0/Figure/2",
"html": "<figure><img src='/page/0/Figure/2'/></figure>",
"images": {
"/page/0/Figure/2": "iVBORw0KGgo..."
}
},
{
"block_type": "Caption",
"id": "/page/0/Caption/3",
"html": "<p>Fig. 12-4. Coronal cross-section...</p>"
}
]
}
],
"metadata": { "page_stats": [...] }
}
}
Java Integration Pattern
// MarkerPageParser — core call
MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
body.add("file", new FileSystemResource(pdfPath));
body.add("output_format", "json");
JsonNode response = restClient.post()
.uri(baseUrl + "/marker/upload")
.contentType(MediaType.MULTIPART_FORM_DATA)
.body(body)
.retrieve()
.body(JsonNode.class);
JsonNode document = response.get("output");
Mapping Marker Blocks to PageResult
Page block (id "/page/N/Page/M") → PageResult(pageNumber = N+1)
SectionHeader children → headingTitle (first match)
Text, TextInlineMath children → orderedText (HTML stripped, joined \n\n)
Figure children with images map → FigureData(imageBytes = base64decode(images[id]))
Caption sibling of Figure → FigureData.nearestCaption
Architecture Change
Before (Document AI — removed):
DocumentAiPageParser
→ Google Document AI API (GCP, 15-page batches, credentials)
→ returns text blocks + figure bboxes
PdfStructureParser (PDFBox column heuristic)
FigureExtractionService
→ renders page via PDFBox at 150 DPI
→ crops bbox region
After (Marker):
MarkerPageParser
→ POST PDF to http://localhost:8000/marker/upload (output_format=json)
→ returns text blocks (correct reading order) + Figure blocks with base64 images
→ produces List<PageResult> (same DTO, FigureData carries bytes not bbox)
FigureExtractionService (simplified)
→ base64-decodes image bytes from PageResult.FigureData
→ checks min size (ImageIO.read → getWidth/getHeight)
→ saves to S3 via FigureStorageService (UNCHANGED)
VisionDescriptionService (UNCHANGED)
BookEmbeddingService orchestration (MINOR: inject MarkerPageParser)
What is removed:
DocumentAiPageParser— replaced byMarkerPageParserDocumentAiConfig— replaced byMarkerConfigPdfStructureParser— Marker handles reading ordergoogle-cloud-document-aiMaven dependencyapp.document-ai.*configuration properties
What stays the same:
PageResultDTO structure (fields renamed, not restructured)FigureExtractionServicepublic interfaceTextChunkingService,VisionDescriptionService,BookEmbeddingServiceorchestration- All JPA entities, repositories, vector store, S3 storage
Constitution Compliance
| Principle | Assessment |
|---|---|
| I. KISS | ✅ Simpler than Document AI — one HTTP call replaces GCP SDK + PDFBox render loop. No new dependency beyond an HTTP client (Spring RestClient, already available). |
| II. Easy to Change | ✅ MarkerPageParser is the only Marker-aware class. Swap it to use any other parser. PageResult DTO unchanged in contract. |
| III. Web-First | ✅ Internal pipeline change; no API contract change. |
| IV. Documentation | ✅ README must show Marker as a local external service dependency. |
Risks & Mitigations
| Risk | Likelihood | Mitigation |
|---|---|---|
| Marker server not running when book is uploaded | Medium | BookEmbeddingService catches exception from MarkerPageParser, marks book as FAILED, logs full error. |
| Marker misses some figures (complex PDFs) | Medium | app.figure-storage.min-image-size-px threshold can be tuned. Add fallback: if Marker returns 0 figures for a page with known images, log a warning. |
| SC-003 (≤ 3× processing time) violated | Low | Marker runs locally (no network latency to cloud). Benchmark with a real 500-page book early. |
| Large PDF upload to Marker (>100MB) | Low | Marker server handles the full file; no batching needed. Multipart upload limit configurable. |
| Marker image quality vs PDFBox crop | Low | Marker crops at native resolution; quality is equivalent or better than 150 DPI PDFBox render. |