adding Marker to parse effectively pdf

This commit is contained in:
Adrien
2026-04-04 21:30:18 +02:00
parent b154e29f2d
commit ea1276dc2e
25 changed files with 2318 additions and 285 deletions
+251 -28
View File
@@ -1,10 +1,10 @@
# Research: Enhanced Embedding with Image Parsing and Metadata
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)
This document resolves all technical unknowns identified during planning. The primary source for
decisions is the detailed architecture provided directly by the project owner, supplemented by
Spring AI 2.0.0-M4 API specifics.
This document resolves all technical unknowns identified during planning. Decisions 110 cover
the core pipeline. The **Marker Study** section at the bottom explains why Marker was chosen
over Google Document AI to drive PDF parsing and figure extraction.
---
@@ -28,19 +28,29 @@ association explicit and queryable.
---
## Decision 2: Image Extraction Strategy
## Decision 2: Document Parsing Strategy
**Decision**: Use PDFBox (already on classpath via `spring-ai-pdf-document-reader`) to extract
images per page. Each image is tagged with `page`, `figure_id` (derived from caption, e.g.
"Fig. 12-4"), and the parent `sectionId`. Images are saved to local disk under
`/uploads/figures/{bookId}/`.
**Decision**: Use **Marker** (local HTTP server, `http://localhost:8000/marker/upload`) as the
single entry point for PDF parsing. A single `POST` with `output_format=json` returns:
- Reading-order text blocks (headings, paragraphs) — no column-split heuristic needed
- Pre-cropped figure images as base64-encoded PNG in the `images` map of each `Figure` block
- Table, equation, and code blocks as structured HTML
**Rationale**: PDFBox is already present (Spring AI bundles it). No new dependency needed.
Per-page extraction ensures every image is captured regardless of PDF structure.
`MarkerPageParser` translates the Marker JSON response into `List<PageResult>`, which is the
same internal DTO used by the rest of the pipeline.
**Rationale**: Marker handles column reordering, scanned-page OCR, and figure cropping in one
call, eliminating the PDFBox column heuristic (`PdfStructureParser`) and the PDFBox
render+crop loop in `FigureExtractionService`. Net result: fewer classes, no cloud dependency,
no GCP credentials.
**Alternatives considered**:
- iText / iText7 → additional commercial dependency; overkill for extraction
- Screenshot each page as PNG, then OCR → far slower; loses vector quality
- PDFBox column heuristic (previous approach) → rejected: 50/50 split fails on asymmetric
columns and scanned pages
- Google Document AI Layout Parser → rejected: adds GCP credentials, per-page billing, 15-page
batch limit, and still requires PDFBox to render+crop figure regions from bounding boxes.
See Marker Study below for detailed comparison.
- Screenshot each page + OCR → far slower; loses digital text quality
---
@@ -103,18 +113,19 @@ search. This is the higher-recall path; dual search (Decision 4) is the higher-p
## Decision 6: Image Storage
**Decision**: Extracted images are saved as PNG files to a local directory
(`${app.figure-storage.base-path}`, defaults to `./uploads/figures/{bookId}/`). The path is
stored in `figure.image_path` in Postgres. A `FigureStorageService` interface wraps all disk
I/O so the implementation can be swapped to S3 or another object store without changing
callers.
**Decision**: Marker returns figure images as base64-encoded PNG bytes in the JSON response.
`FigureExtractionService` decodes these bytes and passes them to `FigureStorageService`, which
persists them to an S3-compatible bucket (`${app.figure-storage.bucket}`). The image path/URL
is stored in `figure.image_path` in Postgres.
**Rationale**: Local disk is the simplest viable option for a POC with <10 users. The interface
boundary satisfies Constitution Principle II (Easy to Change).
The `FigureStorageService` interface is unchanged; only the caller changes (from PDFBox crop
to base64 decode).
**Rationale**: Marker's pre-cropped images remove the need for PDFBox rendering.
`FigureStorageService` interface boundary satisfies Constitution Principle II (Easy to Change).
**Alternatives considered**:
- S3 from day 1 → operational overhead not justified at POC scale
- Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
- Store base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
---
@@ -123,7 +134,8 @@ boundary satisfies Constitution Principle II (Easy to Change).
**Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
2. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
2. Marker `block_type` hint (`"Table"` → TABLE, `"Figure"` / `"Picture"` → ANATOMICAL_DIAGRAM default)
3. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
**Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
Heuristic classification avoids a separate model call per image at extraction time.
@@ -175,14 +187,225 @@ the process fails mid-way. An explicit, idempotent trigger is safer and more obs
## Decision 10: Minimum Image Size Threshold
**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. This
threshold filters out decorative elements (bullets, dividers, publisher logos) without a
classification model.
**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. Marker
returns PNG bytes; `FigureExtractionService` decodes to `BufferedImage` solely to check
dimensions. This threshold filters out decorative elements without a classification model.
**Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
The threshold is configurable via `app.figure-storage.min-image-size-px` in
`application.properties`.
The threshold is configurable via `app.figure-storage.min-image-size-px`.
**Alternatives considered**:
- No threshold → decorative icons pollute the figure index
- ML-based classification → accurate but adds model dependency; not needed at POC scale
---
# Marker Study — Why Marker Replaces Google Document AI
*Added 2026-04-04.*
## What Marker Offers
Marker is an open-source, locally-runnable PDF-to-structured-content converter that uses a
pipeline of deep-learning models (surya for OCR + layout detection, texify for equations).
Key capabilities relevant to this project:
| Capability | Marker | Google Document AI |
|-----------|--------|--------------------|
| Multi-column reading order | ✅ | ✅ |
| OCR on scanned pages | ✅ | ✅ |
| Figure detection | ✅ returns pre-cropped images | ⚠️ returns bbox only; PDFBox still needed |
| Table extraction | ✅ HTML tables | ✅ |
| JSON output with image bytes | ✅ base64 in `images` map | ❌ |
| No cloud credentials | ✅ | ❌ GCP service account required |
| No per-page billing | ✅ | ❌ ~$10/1,000 pages |
| Batch size limits | None (local) | 15 pages / 20 MB per sync call |
| Setup | `pip install marker-pdf && marker_server` | GCP project + processor + IAM |
---
## Does Marker Solve the Current Pain Points?
### Pain Point 1: Naive 50/50 Column Split
**Answer: Yes, Marker fixes this completely.**
`PdfStructureParser.extractPageText()` splits pages at the horizontal midpoint with a 20%
threshold. This fails on asymmetric columns and scanned pages. Marker's surya layout model
returns blocks in natural reading order — no heuristic needed.
### Pain Point 2: Figure Detection Misses Rasterized Figures
**Answer: Yes, Marker fixes this for most cases.**
`FigureExtractionService` previously iterated PDF XObjects (only finds embedded XObject images,
misses rasterized figures and vector-path drawings). Marker's layout model detects visual
elements by type and returns the cropped image bytes directly — no PDFBox page rendering needed.
### Pain Point 3: OCR on Scanned Pages
**Answer: Yes, Marker handles scanned pages transparently via surya OCR.**
### Pain Point 4: Caption Detection
**Answer: Improved — Marker groups caption blocks with their figure block.**
The `block_type = "Caption"` block appears as a sibling or child adjacent to the `"Figure"`
block in the Marker JSON, making caption association structural rather than regex-based.
---
## Marker API Integration
### Local Server Setup
```bash
pip install marker-pdf
marker_server --port 8000
```
The server exposes `POST /marker/upload` (the user's configured endpoint).
### Request
```
POST http://localhost:8000/marker/upload
Content-Type: multipart/form-data
file=@document.pdf
output_format=json
```
### Response (abbreviated)
```json
{
"output_format": "json",
"output": {
"block_type": "Document",
"children": [
{
"block_type": "Page",
"id": "/page/0/Page/0",
"children": [
{
"block_type": "SectionHeader",
"id": "/page/0/SectionHeader/0",
"html": "<h1>Cavernous Sinus Anatomy</h1>"
},
{
"block_type": "Text",
"id": "/page/0/Text/1",
"html": "<p>The cavernous sinus contains...</p>"
},
{
"block_type": "Figure",
"id": "/page/0/Figure/2",
"html": "<figure><img src='/page/0/Figure/2'/></figure>",
"images": {
"/page/0/Figure/2": "iVBORw0KGgo..."
}
},
{
"block_type": "Caption",
"id": "/page/0/Caption/3",
"html": "<p>Fig. 12-4. Coronal cross-section...</p>"
}
]
}
],
"metadata": { "page_stats": [...] }
}
}
```
### Java Integration Pattern
```java
// MarkerPageParser — core call
MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
body.add("file", new FileSystemResource(pdfPath));
body.add("output_format", "json");
JsonNode response = restClient.post()
.uri(baseUrl + "/marker/upload")
.contentType(MediaType.MULTIPART_FORM_DATA)
.body(body)
.retrieve()
.body(JsonNode.class);
JsonNode document = response.get("output");
```
### Mapping Marker Blocks to PageResult
```
Page block (id "/page/N/Page/M") → PageResult(pageNumber = N+1)
SectionHeader children → headingTitle (first match)
Text, TextInlineMath children → orderedText (HTML stripped, joined \n\n)
Figure children with images map → FigureData(imageBytes = base64decode(images[id]))
Caption sibling of Figure → FigureData.nearestCaption
```
---
## Architecture Change
```
Before (Document AI — removed):
DocumentAiPageParser
→ Google Document AI API (GCP, 15-page batches, credentials)
→ returns text blocks + figure bboxes
PdfStructureParser (PDFBox column heuristic)
FigureExtractionService
→ renders page via PDFBox at 150 DPI
→ crops bbox region
After (Marker):
MarkerPageParser
→ POST PDF to http://localhost:8000/marker/upload (output_format=json)
→ returns text blocks (correct reading order) + Figure blocks with base64 images
→ produces List<PageResult> (same DTO, FigureData carries bytes not bbox)
FigureExtractionService (simplified)
→ base64-decodes image bytes from PageResult.FigureData
→ checks min size (ImageIO.read → getWidth/getHeight)
→ saves to S3 via FigureStorageService (UNCHANGED)
VisionDescriptionService (UNCHANGED)
BookEmbeddingService orchestration (MINOR: inject MarkerPageParser)
```
**What is removed**:
- `DocumentAiPageParser` — replaced by `MarkerPageParser`
- `DocumentAiConfig` — replaced by `MarkerConfig`
- `PdfStructureParser` — Marker handles reading order
- `google-cloud-document-ai` Maven dependency
- `app.document-ai.*` configuration properties
**What stays the same**:
- `PageResult` DTO structure (fields renamed, not restructured)
- `FigureExtractionService` public interface
- `TextChunkingService`, `VisionDescriptionService`, `BookEmbeddingService` orchestration
- All JPA entities, repositories, vector store, S3 storage
---
## Constitution Compliance
| Principle | Assessment |
|-----------|------------|
| **I. KISS** | ✅ Simpler than Document AI — one HTTP call replaces GCP SDK + PDFBox render loop. No new dependency beyond an HTTP client (Spring RestClient, already available). |
| **II. Easy to Change** | ✅ `MarkerPageParser` is the only Marker-aware class. Swap it to use any other parser. `PageResult` DTO unchanged in contract. |
| **III. Web-First** | ✅ Internal pipeline change; no API contract change. |
| **IV. Documentation** | ✅ README must show Marker as a local external service dependency. |
---
## Risks & Mitigations
| Risk | Likelihood | Mitigation |
|------|-----------|------------|
| Marker server not running when book is uploaded | Medium | `BookEmbeddingService` catches exception from `MarkerPageParser`, marks book as `FAILED`, logs full error. |
| Marker misses some figures (complex PDFs) | Medium | `app.figure-storage.min-image-size-px` threshold can be tuned. Add fallback: if Marker returns 0 figures for a page with known images, log a warning. |
| SC-003 (≤ 3× processing time) violated | Low | Marker runs locally (no network latency to cloud). Benchmark with a real 500-page book early. |
| Large PDF upload to Marker (>100MB) | Low | Marker server handles the full file; no batching needed. Multipart upload limit configurable. |
| Marker image quality vs PDFBox crop | Low | Marker crops at native resolution; quality is equivalent or better than 150 DPI PDFBox render. |