adding Marker to parse effectively pdf
This commit is contained in:
@@ -1,10 +1,10 @@
|
||||
# Research: Enhanced Embedding with Image Parsing and Metadata
|
||||
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)
|
||||
|
||||
This document resolves all technical unknowns identified during planning. The primary source for
|
||||
decisions is the detailed architecture provided directly by the project owner, supplemented by
|
||||
Spring AI 2.0.0-M4 API specifics.
|
||||
This document resolves all technical unknowns identified during planning. Decisions 1–10 cover
|
||||
the core pipeline. The **Marker Study** section at the bottom explains why Marker was chosen
|
||||
over Google Document AI to drive PDF parsing and figure extraction.
|
||||
|
||||
---
|
||||
|
||||
@@ -28,19 +28,29 @@ association explicit and queryable.
|
||||
|
||||
---
|
||||
|
||||
## Decision 2: Image Extraction Strategy
|
||||
## Decision 2: Document Parsing Strategy
|
||||
|
||||
**Decision**: Use PDFBox (already on classpath via `spring-ai-pdf-document-reader`) to extract
|
||||
images per page. Each image is tagged with `page`, `figure_id` (derived from caption, e.g.
|
||||
"Fig. 12-4"), and the parent `sectionId`. Images are saved to local disk under
|
||||
`/uploads/figures/{bookId}/`.
|
||||
**Decision**: Use **Marker** (local HTTP server, `http://localhost:8000/marker/upload`) as the
|
||||
single entry point for PDF parsing. A single `POST` with `output_format=json` returns:
|
||||
- Reading-order text blocks (headings, paragraphs) — no column-split heuristic needed
|
||||
- Pre-cropped figure images as base64-encoded PNG in the `images` map of each `Figure` block
|
||||
- Table, equation, and code blocks as structured HTML
|
||||
|
||||
**Rationale**: PDFBox is already present (Spring AI bundles it). No new dependency needed.
|
||||
Per-page extraction ensures every image is captured regardless of PDF structure.
|
||||
`MarkerPageParser` translates the Marker JSON response into `List<PageResult>`, which is the
|
||||
same internal DTO used by the rest of the pipeline.
|
||||
|
||||
**Rationale**: Marker handles column reordering, scanned-page OCR, and figure cropping in one
|
||||
call, eliminating the PDFBox column heuristic (`PdfStructureParser`) and the PDFBox
|
||||
render+crop loop in `FigureExtractionService`. Net result: fewer classes, no cloud dependency,
|
||||
no GCP credentials.
|
||||
|
||||
**Alternatives considered**:
|
||||
- iText / iText7 → additional commercial dependency; overkill for extraction
|
||||
- Screenshot each page as PNG, then OCR → far slower; loses vector quality
|
||||
- PDFBox column heuristic (previous approach) → rejected: 50/50 split fails on asymmetric
|
||||
columns and scanned pages
|
||||
- Google Document AI Layout Parser → rejected: adds GCP credentials, per-page billing, 15-page
|
||||
batch limit, and still requires PDFBox to render+crop figure regions from bounding boxes.
|
||||
See Marker Study below for detailed comparison.
|
||||
- Screenshot each page + OCR → far slower; loses digital text quality
|
||||
|
||||
---
|
||||
|
||||
@@ -103,18 +113,19 @@ search. This is the higher-recall path; dual search (Decision 4) is the higher-p
|
||||
|
||||
## Decision 6: Image Storage
|
||||
|
||||
**Decision**: Extracted images are saved as PNG files to a local directory
|
||||
(`${app.figure-storage.base-path}`, defaults to `./uploads/figures/{bookId}/`). The path is
|
||||
stored in `figure.image_path` in Postgres. A `FigureStorageService` interface wraps all disk
|
||||
I/O so the implementation can be swapped to S3 or another object store without changing
|
||||
callers.
|
||||
**Decision**: Marker returns figure images as base64-encoded PNG bytes in the JSON response.
|
||||
`FigureExtractionService` decodes these bytes and passes them to `FigureStorageService`, which
|
||||
persists them to an S3-compatible bucket (`${app.figure-storage.bucket}`). The image path/URL
|
||||
is stored in `figure.image_path` in Postgres.
|
||||
|
||||
**Rationale**: Local disk is the simplest viable option for a POC with <10 users. The interface
|
||||
boundary satisfies Constitution Principle II (Easy to Change).
|
||||
The `FigureStorageService` interface is unchanged; only the caller changes (from PDFBox crop
|
||||
to base64 decode).
|
||||
|
||||
**Rationale**: Marker's pre-cropped images remove the need for PDFBox rendering.
|
||||
`FigureStorageService` interface boundary satisfies Constitution Principle II (Easy to Change).
|
||||
|
||||
**Alternatives considered**:
|
||||
- S3 from day 1 → operational overhead not justified at POC scale
|
||||
- Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
|
||||
- Store base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
|
||||
|
||||
---
|
||||
|
||||
@@ -123,7 +134,8 @@ boundary satisfies Constitution Principle II (Easy to Change).
|
||||
**Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
|
||||
TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
|
||||
1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
|
||||
2. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
|
||||
2. Marker `block_type` hint (`"Table"` → TABLE, `"Figure"` / `"Picture"` → ANATOMICAL_DIAGRAM default)
|
||||
3. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
|
||||
|
||||
**Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
|
||||
Heuristic classification avoids a separate model call per image at extraction time.
|
||||
@@ -175,14 +187,225 @@ the process fails mid-way. An explicit, idempotent trigger is safer and more obs
|
||||
|
||||
## Decision 10: Minimum Image Size Threshold
|
||||
|
||||
**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. This
|
||||
threshold filters out decorative elements (bullets, dividers, publisher logos) without a
|
||||
classification model.
|
||||
**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. Marker
|
||||
returns PNG bytes; `FigureExtractionService` decodes to `BufferedImage` solely to check
|
||||
dimensions. This threshold filters out decorative elements without a classification model.
|
||||
|
||||
**Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
|
||||
The threshold is configurable via `app.figure-storage.min-image-size-px` in
|
||||
`application.properties`.
|
||||
The threshold is configurable via `app.figure-storage.min-image-size-px`.
|
||||
|
||||
**Alternatives considered**:
|
||||
- No threshold → decorative icons pollute the figure index
|
||||
- ML-based classification → accurate but adds model dependency; not needed at POC scale
|
||||
|
||||
---
|
||||
|
||||
# Marker Study — Why Marker Replaces Google Document AI
|
||||
|
||||
*Added 2026-04-04.*
|
||||
|
||||
## What Marker Offers
|
||||
|
||||
Marker is an open-source, locally-runnable PDF-to-structured-content converter that uses a
|
||||
pipeline of deep-learning models (surya for OCR + layout detection, texify for equations).
|
||||
Key capabilities relevant to this project:
|
||||
|
||||
| Capability | Marker | Google Document AI |
|
||||
|-----------|--------|--------------------|
|
||||
| Multi-column reading order | ✅ | ✅ |
|
||||
| OCR on scanned pages | ✅ | ✅ |
|
||||
| Figure detection | ✅ returns pre-cropped images | ⚠️ returns bbox only; PDFBox still needed |
|
||||
| Table extraction | ✅ HTML tables | ✅ |
|
||||
| JSON output with image bytes | ✅ base64 in `images` map | ❌ |
|
||||
| No cloud credentials | ✅ | ❌ GCP service account required |
|
||||
| No per-page billing | ✅ | ❌ ~$10/1,000 pages |
|
||||
| Batch size limits | None (local) | 15 pages / 20 MB per sync call |
|
||||
| Setup | `pip install marker-pdf && marker_server` | GCP project + processor + IAM |
|
||||
|
||||
---
|
||||
|
||||
## Does Marker Solve the Current Pain Points?
|
||||
|
||||
### Pain Point 1: Naive 50/50 Column Split
|
||||
|
||||
**Answer: Yes, Marker fixes this completely.**
|
||||
|
||||
`PdfStructureParser.extractPageText()` splits pages at the horizontal midpoint with a 20%
|
||||
threshold. This fails on asymmetric columns and scanned pages. Marker's surya layout model
|
||||
returns blocks in natural reading order — no heuristic needed.
|
||||
|
||||
### Pain Point 2: Figure Detection Misses Rasterized Figures
|
||||
|
||||
**Answer: Yes, Marker fixes this for most cases.**
|
||||
|
||||
`FigureExtractionService` previously iterated PDF XObjects (only finds embedded XObject images,
|
||||
misses rasterized figures and vector-path drawings). Marker's layout model detects visual
|
||||
elements by type and returns the cropped image bytes directly — no PDFBox page rendering needed.
|
||||
|
||||
### Pain Point 3: OCR on Scanned Pages
|
||||
|
||||
**Answer: Yes, Marker handles scanned pages transparently via surya OCR.**
|
||||
|
||||
### Pain Point 4: Caption Detection
|
||||
|
||||
**Answer: Improved — Marker groups caption blocks with their figure block.**
|
||||
|
||||
The `block_type = "Caption"` block appears as a sibling or child adjacent to the `"Figure"`
|
||||
block in the Marker JSON, making caption association structural rather than regex-based.
|
||||
|
||||
---
|
||||
|
||||
## Marker API Integration
|
||||
|
||||
### Local Server Setup
|
||||
|
||||
```bash
|
||||
pip install marker-pdf
|
||||
marker_server --port 8000
|
||||
```
|
||||
|
||||
The server exposes `POST /marker/upload` (the user's configured endpoint).
|
||||
|
||||
### Request
|
||||
|
||||
```
|
||||
POST http://localhost:8000/marker/upload
|
||||
Content-Type: multipart/form-data
|
||||
|
||||
file=@document.pdf
|
||||
output_format=json
|
||||
```
|
||||
|
||||
### Response (abbreviated)
|
||||
|
||||
```json
|
||||
{
|
||||
"output_format": "json",
|
||||
"output": {
|
||||
"block_type": "Document",
|
||||
"children": [
|
||||
{
|
||||
"block_type": "Page",
|
||||
"id": "/page/0/Page/0",
|
||||
"children": [
|
||||
{
|
||||
"block_type": "SectionHeader",
|
||||
"id": "/page/0/SectionHeader/0",
|
||||
"html": "<h1>Cavernous Sinus Anatomy</h1>"
|
||||
},
|
||||
{
|
||||
"block_type": "Text",
|
||||
"id": "/page/0/Text/1",
|
||||
"html": "<p>The cavernous sinus contains...</p>"
|
||||
},
|
||||
{
|
||||
"block_type": "Figure",
|
||||
"id": "/page/0/Figure/2",
|
||||
"html": "<figure><img src='/page/0/Figure/2'/></figure>",
|
||||
"images": {
|
||||
"/page/0/Figure/2": "iVBORw0KGgo..."
|
||||
}
|
||||
},
|
||||
{
|
||||
"block_type": "Caption",
|
||||
"id": "/page/0/Caption/3",
|
||||
"html": "<p>Fig. 12-4. Coronal cross-section...</p>"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": { "page_stats": [...] }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Java Integration Pattern
|
||||
|
||||
```java
|
||||
// MarkerPageParser — core call
|
||||
MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
|
||||
body.add("file", new FileSystemResource(pdfPath));
|
||||
body.add("output_format", "json");
|
||||
|
||||
JsonNode response = restClient.post()
|
||||
.uri(baseUrl + "/marker/upload")
|
||||
.contentType(MediaType.MULTIPART_FORM_DATA)
|
||||
.body(body)
|
||||
.retrieve()
|
||||
.body(JsonNode.class);
|
||||
|
||||
JsonNode document = response.get("output");
|
||||
```
|
||||
|
||||
### Mapping Marker Blocks to PageResult
|
||||
|
||||
```
|
||||
Page block (id "/page/N/Page/M") → PageResult(pageNumber = N+1)
|
||||
SectionHeader children → headingTitle (first match)
|
||||
Text, TextInlineMath children → orderedText (HTML stripped, joined \n\n)
|
||||
Figure children with images map → FigureData(imageBytes = base64decode(images[id]))
|
||||
Caption sibling of Figure → FigureData.nearestCaption
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Architecture Change
|
||||
|
||||
```
|
||||
Before (Document AI — removed):
|
||||
DocumentAiPageParser
|
||||
→ Google Document AI API (GCP, 15-page batches, credentials)
|
||||
→ returns text blocks + figure bboxes
|
||||
PdfStructureParser (PDFBox column heuristic)
|
||||
FigureExtractionService
|
||||
→ renders page via PDFBox at 150 DPI
|
||||
→ crops bbox region
|
||||
|
||||
After (Marker):
|
||||
MarkerPageParser
|
||||
→ POST PDF to http://localhost:8000/marker/upload (output_format=json)
|
||||
→ returns text blocks (correct reading order) + Figure blocks with base64 images
|
||||
→ produces List<PageResult> (same DTO, FigureData carries bytes not bbox)
|
||||
FigureExtractionService (simplified)
|
||||
→ base64-decodes image bytes from PageResult.FigureData
|
||||
→ checks min size (ImageIO.read → getWidth/getHeight)
|
||||
→ saves to S3 via FigureStorageService (UNCHANGED)
|
||||
VisionDescriptionService (UNCHANGED)
|
||||
BookEmbeddingService orchestration (MINOR: inject MarkerPageParser)
|
||||
```
|
||||
|
||||
**What is removed**:
|
||||
- `DocumentAiPageParser` — replaced by `MarkerPageParser`
|
||||
- `DocumentAiConfig` — replaced by `MarkerConfig`
|
||||
- `PdfStructureParser` — Marker handles reading order
|
||||
- `google-cloud-document-ai` Maven dependency
|
||||
- `app.document-ai.*` configuration properties
|
||||
|
||||
**What stays the same**:
|
||||
- `PageResult` DTO structure (fields renamed, not restructured)
|
||||
- `FigureExtractionService` public interface
|
||||
- `TextChunkingService`, `VisionDescriptionService`, `BookEmbeddingService` orchestration
|
||||
- All JPA entities, repositories, vector store, S3 storage
|
||||
|
||||
---
|
||||
|
||||
## Constitution Compliance
|
||||
|
||||
| Principle | Assessment |
|
||||
|-----------|------------|
|
||||
| **I. KISS** | ✅ Simpler than Document AI — one HTTP call replaces GCP SDK + PDFBox render loop. No new dependency beyond an HTTP client (Spring RestClient, already available). |
|
||||
| **II. Easy to Change** | ✅ `MarkerPageParser` is the only Marker-aware class. Swap it to use any other parser. `PageResult` DTO unchanged in contract. |
|
||||
| **III. Web-First** | ✅ Internal pipeline change; no API contract change. |
|
||||
| **IV. Documentation** | ✅ README must show Marker as a local external service dependency. |
|
||||
|
||||
---
|
||||
|
||||
## Risks & Mitigations
|
||||
|
||||
| Risk | Likelihood | Mitigation |
|
||||
|------|-----------|------------|
|
||||
| Marker server not running when book is uploaded | Medium | `BookEmbeddingService` catches exception from `MarkerPageParser`, marks book as `FAILED`, logs full error. |
|
||||
| Marker misses some figures (complex PDFs) | Medium | `app.figure-storage.min-image-size-px` threshold can be tuned. Add fallback: if Marker returns 0 figures for a page with known images, log a warning. |
|
||||
| SC-003 (≤ 3× processing time) violated | Low | Marker runs locally (no network latency to cloud). Benchmark with a real 500-page book early. |
|
||||
| Large PDF upload to Marker (>100MB) | Low | Marker server handles the full file; no batching needed. Multipart upload limit configurable. |
|
||||
| Marker image quality vs PDFBox crop | Low | Marker crops at native resolution; quality is equivalent or better than 150 DPI PDFBox render. |
|
||||
|
||||
Reference in New Issue
Block a user