adding Marker to parse effectively pdf

2026-04-04 21:30:18 +02:00
parent b154e29f2d
commit ea1276dc2e
25 changed files with 2318 additions and 285 deletions
@@ -1,10 +1,10 @@
 # Research: Enhanced Embedding with Image Parsing and Metadata

-**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)

-This document resolves all technical unknowns identified during planning. The primary source for
-decisions is the detailed architecture provided directly by the project owner, supplemented by
-Spring AI 2.0.0-M4 API specifics.
+This document resolves all technical unknowns identified during planning. Decisions 1–10 cover
+the core pipeline. The **Marker Study** section at the bottom explains why Marker was chosen
+over Google Document AI to drive PDF parsing and figure extraction.

 ---

@@ -28,19 +28,29 @@ association explicit and queryable.

 ---

-## Decision 2: Image Extraction Strategy
+## Decision 2: Document Parsing Strategy

-**Decision**: Use PDFBox (already on classpath via `spring-ai-pdf-document-reader`) to extract
-images per page. Each image is tagged with `page`, `figure_id` (derived from caption, e.g.
-"Fig. 12-4"), and the parent `sectionId`. Images are saved to local disk under
-`/uploads/figures/{bookId}/`.
+**Decision**: Use **Marker** (local HTTP server, `http://localhost:8000/marker/upload`) as the
+single entry point for PDF parsing. A single `POST` with `output_format=json` returns:
+- Reading-order text blocks (headings, paragraphs) — no column-split heuristic needed
+- Pre-cropped figure images as base64-encoded PNG in the `images` map of each `Figure` block
+- Table, equation, and code blocks as structured HTML

-**Rationale**: PDFBox is already present (Spring AI bundles it). No new dependency needed.
-Per-page extraction ensures every image is captured regardless of PDF structure.
+`MarkerPageParser` translates the Marker JSON response into `List<PageResult>`, which is the
+same internal DTO used by the rest of the pipeline.
+
+**Rationale**: Marker handles column reordering, scanned-page OCR, and figure cropping in one
+call, eliminating the PDFBox column heuristic (`PdfStructureParser`) and the PDFBox
+render+crop loop in `FigureExtractionService`. Net result: fewer classes, no cloud dependency,
+no GCP credentials.

 **Alternatives considered**:
- iText / iText7 → additional commercial dependency; overkill for extraction
- Screenshot each page as PNG, then OCR → far slower; loses vector quality
+- PDFBox column heuristic (previous approach) → rejected: 50/50 split fails on asymmetric
+  columns and scanned pages
+- Google Document AI Layout Parser → rejected: adds GCP credentials, per-page billing, 15-page
+  batch limit, and still requires PDFBox to render+crop figure regions from bounding boxes.
+  See Marker Study below for detailed comparison.
+- Screenshot each page + OCR → far slower; loses digital text quality

 ---

@@ -103,18 +113,19 @@ search. This is the higher-recall path; dual search (Decision 4) is the higher-p

 ## Decision 6: Image Storage

-**Decision**: Extracted images are saved as PNG files to a local directory
-(`${app.figure-storage.base-path}`, defaults to `./uploads/figures/{bookId}/`). The path is
-stored in `figure.image_path` in Postgres. A `FigureStorageService` interface wraps all disk
-I/O so the implementation can be swapped to S3 or another object store without changing
-callers.
+**Decision**: Marker returns figure images as base64-encoded PNG bytes in the JSON response.
+`FigureExtractionService` decodes these bytes and passes them to `FigureStorageService`, which
+persists them to an S3-compatible bucket (`${app.figure-storage.bucket}`). The image path/URL
+is stored in `figure.image_path` in Postgres.

-**Rationale**: Local disk is the simplest viable option for a POC with <10 users. The interface
-boundary satisfies Constitution Principle II (Easy to Change).
+The `FigureStorageService` interface is unchanged; only the caller changes (from PDFBox crop
+to base64 decode).
+
+**Rationale**: Marker's pre-cropped images remove the need for PDFBox rendering.
+`FigureStorageService` interface boundary satisfies Constitution Principle II (Easy to Change).

 **Alternatives considered**:
- S3 from day 1 → operational overhead not justified at POC scale
- Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
+- Store base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades

 ---

@@ -123,7 +134,8 @@ boundary satisfies Constitution Principle II (Easy to Change).
 **Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
 TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
 1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
-2. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
+2. Marker `block_type` hint (`"Table"` → TABLE, `"Figure"` / `"Picture"` → ANATOMICAL_DIAGRAM default)
+3. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable

 **Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
 Heuristic classification avoids a separate model call per image at extraction time.
@@ -175,14 +187,225 @@ the process fails mid-way. An explicit, idempotent trigger is safer and more obs

 ## Decision 10: Minimum Image Size Threshold

-**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. This
-threshold filters out decorative elements (bullets, dividers, publisher logos) without a
-classification model.
+**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. Marker
+returns PNG bytes; `FigureExtractionService` decodes to `BufferedImage` solely to check
+dimensions. This threshold filters out decorative elements without a classification model.

 **Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
-The threshold is configurable via `app.figure-storage.min-image-size-px` in
-`application.properties`.
+The threshold is configurable via `app.figure-storage.min-image-size-px`.

 **Alternatives considered**:
 - No threshold → decorative icons pollute the figure index
 - ML-based classification → accurate but adds model dependency; not needed at POC scale
+
+---
+
+# Marker Study — Why Marker Replaces Google Document AI
+
+*Added 2026-04-04.*
+
+## What Marker Offers
+
+Marker is an open-source, locally-runnable PDF-to-structured-content converter that uses a
+pipeline of deep-learning models (surya for OCR + layout detection, texify for equations).
+Key capabilities relevant to this project:
+
+| Capability | Marker | Google Document AI |
+|-----------|--------|--------------------|
+| Multi-column reading order | ✅ | ✅ |
+| OCR on scanned pages | ✅ | ✅ |
+| Figure detection | ✅ returns pre-cropped images | ⚠️ returns bbox only; PDFBox still needed |
+| Table extraction | ✅ HTML tables | ✅ |
+| JSON output with image bytes | ✅ base64 in `images` map | ❌ |
+| No cloud credentials | ✅ | ❌ GCP service account required |
+| No per-page billing | ✅ | ❌ ~$10/1,000 pages |
+| Batch size limits | None (local) | 15 pages / 20 MB per sync call |
+| Setup | `pip install marker-pdf && marker_server` | GCP project + processor + IAM |
+
+---
+
+## Does Marker Solve the Current Pain Points?
+
+### Pain Point 1: Naive 50/50 Column Split
+
+**Answer: Yes, Marker fixes this completely.**
+
+`PdfStructureParser.extractPageText()` splits pages at the horizontal midpoint with a 20%
+threshold. This fails on asymmetric columns and scanned pages. Marker's surya layout model
+returns blocks in natural reading order — no heuristic needed.
+
+### Pain Point 2: Figure Detection Misses Rasterized Figures
+
+**Answer: Yes, Marker fixes this for most cases.**
+
+`FigureExtractionService` previously iterated PDF XObjects (only finds embedded XObject images,
+misses rasterized figures and vector-path drawings). Marker's layout model detects visual
+elements by type and returns the cropped image bytes directly — no PDFBox page rendering needed.
+
+### Pain Point 3: OCR on Scanned Pages
+
+**Answer: Yes, Marker handles scanned pages transparently via surya OCR.**
+
+### Pain Point 4: Caption Detection
+
+**Answer: Improved — Marker groups caption blocks with their figure block.**
+
+The `block_type = "Caption"` block appears as a sibling or child adjacent to the `"Figure"`
+block in the Marker JSON, making caption association structural rather than regex-based.
+
+---
+
+## Marker API Integration
+
+### Local Server Setup
+
+```bash
+pip install marker-pdf
+marker_server --port 8000
+```
+
+The server exposes `POST /marker/upload` (the user's configured endpoint).
+
+### Request
+
+```
+POST http://localhost:8000/marker/upload
+Content-Type: multipart/form-data
+
+file=@document.pdf
+output_format=json
+```
+
+### Response (abbreviated)
+
+```json
+{
+  "output_format": "json",
+  "output": {
+    "block_type": "Document",
+    "children": [
+      {
+        "block_type": "Page",
+        "id": "/page/0/Page/0",
+        "children": [
+          {
+            "block_type": "SectionHeader",
+            "id": "/page/0/SectionHeader/0",
+            "html": "<h1>Cavernous Sinus Anatomy</h1>"
+          },
+          {
+            "block_type": "Text",
+            "id": "/page/0/Text/1",
+            "html": "<p>The cavernous sinus contains...</p>"
+          },
+          {
+            "block_type": "Figure",
+            "id": "/page/0/Figure/2",
+            "html": "<figure><img src='/page/0/Figure/2'/></figure>",
+            "images": {
+              "/page/0/Figure/2": "iVBORw0KGgo..."
+            }
+          },
+          {
+            "block_type": "Caption",
+            "id": "/page/0/Caption/3",
+            "html": "<p>Fig. 12-4. Coronal cross-section...</p>"
+          }
+        ]
+      }
+    ],
+    "metadata": { "page_stats": [...] }
+  }
+}
+```
+
+### Java Integration Pattern
+
+```java
+// MarkerPageParser — core call
+MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
+body.add("file", new FileSystemResource(pdfPath));
+body.add("output_format", "json");
+
+JsonNode response = restClient.post()
+    .uri(baseUrl + "/marker/upload")
+    .contentType(MediaType.MULTIPART_FORM_DATA)
+    .body(body)
+    .retrieve()
+    .body(JsonNode.class);
+
+JsonNode document = response.get("output");
+```
+
+### Mapping Marker Blocks to PageResult
+
+```
+Page block (id "/page/N/Page/M") → PageResult(pageNumber = N+1)
+  SectionHeader children           → headingTitle (first match)
+  Text, TextInlineMath children    → orderedText (HTML stripped, joined \n\n)
+  Figure children with images map  → FigureData(imageBytes = base64decode(images[id]))
+  Caption sibling of Figure        → FigureData.nearestCaption
+```
+
+---
+
+## Architecture Change
+
+```
+Before (Document AI — removed):
+  DocumentAiPageParser
+      → Google Document AI API (GCP, 15-page batches, credentials)
+      → returns text blocks + figure bboxes
+  PdfStructureParser (PDFBox column heuristic)
+  FigureExtractionService
+      → renders page via PDFBox at 150 DPI
+      → crops bbox region
+
+After (Marker):
+  MarkerPageParser
+      → POST PDF to http://localhost:8000/marker/upload (output_format=json)
+      → returns text blocks (correct reading order) + Figure blocks with base64 images
+      → produces List<PageResult> (same DTO, FigureData carries bytes not bbox)
+  FigureExtractionService (simplified)
+      → base64-decodes image bytes from PageResult.FigureData
+      → checks min size (ImageIO.read → getWidth/getHeight)
+      → saves to S3 via FigureStorageService (UNCHANGED)
+  VisionDescriptionService (UNCHANGED)
+  BookEmbeddingService orchestration (MINOR: inject MarkerPageParser)
+```
+
+**What is removed**:
+- `DocumentAiPageParser` — replaced by `MarkerPageParser`
+- `DocumentAiConfig` — replaced by `MarkerConfig`
+- `PdfStructureParser` — Marker handles reading order
+- `google-cloud-document-ai` Maven dependency
+- `app.document-ai.*` configuration properties
+
+**What stays the same**:
+- `PageResult` DTO structure (fields renamed, not restructured)
+- `FigureExtractionService` public interface
+- `TextChunkingService`, `VisionDescriptionService`, `BookEmbeddingService` orchestration
+- All JPA entities, repositories, vector store, S3 storage
+
+---
+
+## Constitution Compliance
+
+| Principle | Assessment |
+|-----------|------------|
+| **I. KISS** | ✅ Simpler than Document AI — one HTTP call replaces GCP SDK + PDFBox render loop. No new dependency beyond an HTTP client (Spring RestClient, already available). |
+| **II. Easy to Change** | ✅ `MarkerPageParser` is the only Marker-aware class. Swap it to use any other parser. `PageResult` DTO unchanged in contract. |
+| **III. Web-First** | ✅ Internal pipeline change; no API contract change. |
+| **IV. Documentation** | ✅ README must show Marker as a local external service dependency. |
+
+---
+
+## Risks & Mitigations
+
+| Risk | Likelihood | Mitigation |
+|------|-----------|------------|
+| Marker server not running when book is uploaded | Medium | `BookEmbeddingService` catches exception from `MarkerPageParser`, marks book as `FAILED`, logs full error. |
+| Marker misses some figures (complex PDFs) | Medium | `app.figure-storage.min-image-size-px` threshold can be tuned. Add fallback: if Marker returns 0 figures for a page with known images, log a warning. |
+| SC-003 (≤ 3× processing time) violated | Low | Marker runs locally (no network latency to cloud). Benchmark with a real 500-page book early. |
+| Large PDF upload to Marker (>100MB) | Low | Marker server handles the full file; no batching needed. Multipart upload limit configurable. |
+| Marker image quality vs PDFBox crop | Low | Marker crops at native resolution; quality is equivalent or better than 150 DPI PDFBox render. |