# Research: Enhanced Embedding with Image Parsing and Metadata **Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI) This document resolves all technical unknowns identified during planning. Decisions 1–10 cover the core pipeline. The **Marker Study** section at the bottom explains why Marker was chosen over Google Document AI to drive PDF parsing and figure extraction. --- ## Decision 1: Document Hierarchy Model **Decision**: Adopt a four-level hierarchy — `BookNode` → `ChapterNode` → `SectionNode` → `TextChunkNode` + `FigureNode`. The `SectionNode` is the pivotal unit: it holds the full section text in Postgres and is used for parent-child context expansion at retrieval time. **Rationale**: A flat page-per-document model (current implementation) loses structural context. When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text, not just the matching fragment. Parent-child retrieval — where chunks point to their parent section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section association explicit and queryable. **Alternatives considered**: - Keep flat page model, add metadata only → rejected: insufficient for precise citation and context expansion - Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent to LLM; cost and latency increase --- ## Decision 2: Document Parsing Strategy **Decision**: Use **Marker** (local HTTP server, `http://localhost:8000/marker/upload`) as the single entry point for PDF parsing. A single `POST` with `output_format=json` returns: - Reading-order text blocks (headings, paragraphs) — no column-split heuristic needed - Pre-cropped figure images as base64-encoded PNG in the `images` map of each `Figure` block - Table, equation, and code blocks as structured HTML `MarkerPageParser` translates the Marker JSON response into `List`, which is the same internal DTO used by the rest of the pipeline. **Rationale**: Marker handles column reordering, scanned-page OCR, and figure cropping in one call, eliminating the PDFBox column heuristic (`PdfStructureParser`) and the PDFBox render+crop loop in `FigureExtractionService`. Net result: fewer classes, no cloud dependency, no GCP credentials. **Alternatives considered**: - PDFBox column heuristic (previous approach) → rejected: 50/50 split fails on asymmetric columns and scanned pages - Google Document AI Layout Parser → rejected: adds GCP credentials, per-page billing, 15-page batch limit, and still requires PDFBox to render+crop figure regions from bounding boxes. See Marker Study below for detailed comparison. - Screenshot each page + OCR → far slower; loses digital text quality --- ## Decision 3: Figure Content Representation **Decision**: Generate a textual description of each extracted image using the OpenAI vision model (GPT-4o). This description becomes the `content` field of the figure's vector store document. The figure caption (parsed from the surrounding text) is also included to maximise retrieval signal. **Rationale**: Caption-only embedding would miss figures with no caption or with sparse labels. Vision-generated descriptions produce richer semantic content (anatomy terms, structural relationships) that matches clinical queries. The OpenAI client already in use supports image inputs; no additional dependency is required. **Alternatives considered**: - Caption-only embedding → insufficient when captions are absent or terse (common in textbooks) - Local vision model (LLaVA) → requires self-hosting; out of scope for POC - OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI) --- ## Decision 4: Dual Vector Search **Decision**: At query time, run two parallel similarity searches: 1. Text chunk search (filtered by `type = "TEXT"` and `book_id`) 2. Figure caption search (filtered by `type = "FIGURE"` and `book_id`) Results are merged and deduplicated. The LLM prompt receives the expanded parent section text plus a structured figure reference list. **Rationale**: A single search would rank text and figures against each other; figures with terse captions would systematically lose to text chunks. Separate searches with independent `topK` allow tuning each modality independently. **Alternatives considered**: - Single search, filter by relevance score → figure captions score lower than text; figures are systematically under-retrieved - Post-process text results to look up linked figures only → misses figures that are relevant to the query but not explicitly referenced in the retrieved text chunks --- ## Decision 5: Chunk-to-Figure Linking **Decision**: During text parsing, whenever a pattern matching `Fig.\s+\d+[\-\.]\d+` or `Figure\s+\d+[\-\.]\d+` is found in a chunk, insert a row into the `chunk_figure_refs` table linking `chunkId` → `figureId`. At retrieval time, after text chunks are retrieved, their associated figures are fetched from this table and added to the LLM prompt. **Rationale**: Explicit linking ensures that when a text chunk is retrieved, its referenced figures are always surfaced — even if the figure's caption did not score highly in the vector search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path. **Alternatives considered**: - Rely entirely on dual vector search → may miss figures referenced in retrieved text but scoring below the topK threshold in the figure search --- ## Decision 6: Image Storage **Decision**: Marker returns figure images as base64-encoded PNG bytes in the JSON response. `FigureExtractionService` decodes these bytes and passes them to `FigureStorageService`, which persists them to an S3-compatible bucket (`${app.figure-storage.bucket}`). The image path/URL is stored in `figure.image_path` in Postgres. The `FigureStorageService` interface is unchanged; only the caller changes (from PDFBox crop to base64 decode). **Rationale**: Marker's pre-cropped images remove the need for PDFBox rendering. `FigureStorageService` interface boundary satisfies Constitution Principle II (Easy to Change). **Alternatives considered**: - Store base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades --- ## Decision 7: Figure Type Classification **Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN, TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from: 1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed 2. Marker `block_type` hint (`"Table"` → TABLE, `"Figure"` / `"Picture"` → ANATOMICAL_DIAGRAM default) 3. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable **Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge). Heuristic classification avoids a separate model call per image at extraction time. **Alternatives considered**: - Vision model classification → accurate but adds latency and cost per figure; deferrable - Single `FIGURE` type → loses citation granularity required by spec FR-004 --- ## Decision 8: Metadata Schema for Vector Store Documents **Decision**: All vector store documents carry a flat `Map` metadata for Spring AI filtering. Schema: | Field | Text Chunk | Figure Chunk | |-------|-----------|-------------| | `type` | `"TEXT"` | `"FIGURE"` | | `book_id` | ✓ | ✓ | | `book_title` | ✓ | ✓ | | `chapter_id` | ✓ | ✓ | | `section_id` | ✓ | ✓ | | `section_title` | ✓ | ✓ | | `page_start` | ✓ | — | | `page_end` | ✓ | — | | `chunk_index` | ✓ | — | | `total_chunks` | ✓ | — | | `figure_id` | — | ✓ | | `figure_type` | — | ✓ | | `image_path` | — | ✓ | | `label` | — | ✓ | | `page` | — | ✓ | **Rationale**: Flat map is required by Spring AI `FilterExpressionBuilder`. Separation by `type` allows independent filtering in dual search. --- ## Decision 9: Re-embedding Existing Books **Decision**: Books already processed under feature 001 (text-only) are NOT automatically re-embedded. An explicit re-embed action is exposed via `POST /api/v1/books/{id}/reembed` (admin-triggered). The existing chunks remain valid for text queries until re-embedding completes. **Rationale**: Automatic re-embedding on deploy would block the system and risk data loss if the process fails mid-way. An explicit, idempotent trigger is safer and more observable. --- ## Decision 10: Minimum Image Size Threshold **Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. Marker returns PNG bytes; `FigureExtractionService` decodes to `BufferedImage` solely to check dimensions. This threshold filters out decorative elements without a classification model. **Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px. The threshold is configurable via `app.figure-storage.min-image-size-px`. **Alternatives considered**: - No threshold → decorative icons pollute the figure index - ML-based classification → accurate but adds model dependency; not needed at POC scale --- # Marker Study — Why Marker Replaces Google Document AI *Added 2026-04-04.* ## What Marker Offers Marker is an open-source, locally-runnable PDF-to-structured-content converter that uses a pipeline of deep-learning models (surya for OCR + layout detection, texify for equations). Key capabilities relevant to this project: | Capability | Marker | Google Document AI | |-----------|--------|--------------------| | Multi-column reading order | ✅ | ✅ | | OCR on scanned pages | ✅ | ✅ | | Figure detection | ✅ returns pre-cropped images | ⚠️ returns bbox only; PDFBox still needed | | Table extraction | ✅ HTML tables | ✅ | | JSON output with image bytes | ✅ base64 in `images` map | ❌ | | No cloud credentials | ✅ | ❌ GCP service account required | | No per-page billing | ✅ | ❌ ~$10/1,000 pages | | Batch size limits | None (local) | 15 pages / 20 MB per sync call | | Setup | `pip install marker-pdf && marker_server` | GCP project + processor + IAM | --- ## Does Marker Solve the Current Pain Points? ### Pain Point 1: Naive 50/50 Column Split **Answer: Yes, Marker fixes this completely.** `PdfStructureParser.extractPageText()` splits pages at the horizontal midpoint with a 20% threshold. This fails on asymmetric columns and scanned pages. Marker's surya layout model returns blocks in natural reading order — no heuristic needed. ### Pain Point 2: Figure Detection Misses Rasterized Figures **Answer: Yes, Marker fixes this for most cases.** `FigureExtractionService` previously iterated PDF XObjects (only finds embedded XObject images, misses rasterized figures and vector-path drawings). Marker's layout model detects visual elements by type and returns the cropped image bytes directly — no PDFBox page rendering needed. ### Pain Point 3: OCR on Scanned Pages **Answer: Yes, Marker handles scanned pages transparently via surya OCR.** ### Pain Point 4: Caption Detection **Answer: Improved — Marker groups caption blocks with their figure block.** The `block_type = "Caption"` block appears as a sibling or child adjacent to the `"Figure"` block in the Marker JSON, making caption association structural rather than regex-based. --- ## Marker API Integration ### Local Server Setup ```bash pip install marker-pdf marker_server --port 8000 ``` The server exposes `POST /marker/upload` (the user's configured endpoint). ### Request ``` POST http://localhost:8000/marker/upload Content-Type: multipart/form-data file=@document.pdf output_format=json ``` ### Response (abbreviated) ```json { "output_format": "json", "output": { "block_type": "Document", "children": [ { "block_type": "Page", "id": "/page/0/Page/0", "children": [ { "block_type": "SectionHeader", "id": "/page/0/SectionHeader/0", "html": "

Cavernous Sinus Anatomy

" }, { "block_type": "Text", "id": "/page/0/Text/1", "html": "

The cavernous sinus contains...

" }, { "block_type": "Figure", "id": "/page/0/Figure/2", "html": "
", "images": { "/page/0/Figure/2": "iVBORw0KGgo..." } }, { "block_type": "Caption", "id": "/page/0/Caption/3", "html": "

Fig. 12-4. Coronal cross-section...

" } ] } ], "metadata": { "page_stats": [...] } } } ``` ### Java Integration Pattern ```java // MarkerPageParser — core call MultiValueMap body = new LinkedMultiValueMap<>(); body.add("file", new FileSystemResource(pdfPath)); body.add("output_format", "json"); JsonNode response = restClient.post() .uri(baseUrl + "/marker/upload") .contentType(MediaType.MULTIPART_FORM_DATA) .body(body) .retrieve() .body(JsonNode.class); JsonNode document = response.get("output"); ``` ### Mapping Marker Blocks to PageResult ``` Page block (id "/page/N/Page/M") → PageResult(pageNumber = N+1) SectionHeader children → headingTitle (first match) Text, TextInlineMath children → orderedText (HTML stripped, joined \n\n) Figure children with images map → FigureData(imageBytes = base64decode(images[id])) Caption sibling of Figure → FigureData.nearestCaption ``` --- ## Architecture Change ``` Before (Document AI — removed): DocumentAiPageParser → Google Document AI API (GCP, 15-page batches, credentials) → returns text blocks + figure bboxes PdfStructureParser (PDFBox column heuristic) FigureExtractionService → renders page via PDFBox at 150 DPI → crops bbox region After (Marker): MarkerPageParser → POST PDF to http://localhost:8000/marker/upload (output_format=json) → returns text blocks (correct reading order) + Figure blocks with base64 images → produces List (same DTO, FigureData carries bytes not bbox) FigureExtractionService (simplified) → base64-decodes image bytes from PageResult.FigureData → checks min size (ImageIO.read → getWidth/getHeight) → saves to S3 via FigureStorageService (UNCHANGED) VisionDescriptionService (UNCHANGED) BookEmbeddingService orchestration (MINOR: inject MarkerPageParser) ``` **What is removed**: - `DocumentAiPageParser` — replaced by `MarkerPageParser` - `DocumentAiConfig` — replaced by `MarkerConfig` - `PdfStructureParser` — Marker handles reading order - `google-cloud-document-ai` Maven dependency - `app.document-ai.*` configuration properties **What stays the same**: - `PageResult` DTO structure (fields renamed, not restructured) - `FigureExtractionService` public interface - `TextChunkingService`, `VisionDescriptionService`, `BookEmbeddingService` orchestration - All JPA entities, repositories, vector store, S3 storage --- ## Constitution Compliance | Principle | Assessment | |-----------|------------| | **I. KISS** | ✅ Simpler than Document AI — one HTTP call replaces GCP SDK + PDFBox render loop. No new dependency beyond an HTTP client (Spring RestClient, already available). | | **II. Easy to Change** | ✅ `MarkerPageParser` is the only Marker-aware class. Swap it to use any other parser. `PageResult` DTO unchanged in contract. | | **III. Web-First** | ✅ Internal pipeline change; no API contract change. | | **IV. Documentation** | ✅ README must show Marker as a local external service dependency. | --- ## Risks & Mitigations | Risk | Likelihood | Mitigation | |------|-----------|------------| | Marker server not running when book is uploaded | Medium | `BookEmbeddingService` catches exception from `MarkerPageParser`, marks book as `FAILED`, logs full error. | | Marker misses some figures (complex PDFs) | Medium | `app.figure-storage.min-image-size-px` threshold can be tuned. Add fallback: if Marker returns 0 figures for a page with known images, log a warning. | | SC-003 (≤ 3× processing time) violated | Low | Marker runs locally (no network latency to cloud). Benchmark with a real 500-page book early. | | Large PDF upload to Marker (>100MB) | Low | Marker server handles the full file; no batching needed. Multipart upload limit configurable. | | Marker image quality vs PDFBox crop | Low | Marker crops at native resolution; quality is equivalent or better than 150 DPI PDFBox render. |