giteaadmin/ai-teacher

Fork 0

Files

T

Adrien ea1276dc2e adding Marker to parse effectively pdf

2026-04-04 21:30:18 +02:00

16 KiB

Raw Permalink Blame History

Research: Enhanced Embedding with Image Parsing and Metadata

Branch: 002-image-aware-embedding | Date: 2026-04-04 (updated: Marker replaces Google Document AI)

This document resolves all technical unknowns identified during planning. Decisions 1–10 cover the core pipeline. The Marker Study section at the bottom explains why Marker was chosen over Google Document AI to drive PDF parsing and figure extraction.

Decision 1: Document Hierarchy Model

Decision: Adopt a four-level hierarchy — BookNode → ChapterNode → SectionNode → TextChunkNode + FigureNode. The SectionNode is the pivotal unit: it holds the full section text in Postgres and is used for parent-child context expansion at retrieval time.

Rationale: A flat page-per-document model (current implementation) loses structural context. When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text, not just the matching fragment. Parent-child retrieval — where chunks point to their parent section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section association explicit and queryable.

Alternatives considered:

Keep flat page model, add metadata only → rejected: insufficient for precise citation and context expansion
Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent to LLM; cost and latency increase

Decision 2: Document Parsing Strategy

Decision: Use Marker (local HTTP server, http://localhost:8000/marker/upload) as the single entry point for PDF parsing. A single POST with output_format=json returns:

Reading-order text blocks (headings, paragraphs) — no column-split heuristic needed
Pre-cropped figure images as base64-encoded PNG in the images map of each Figure block
Table, equation, and code blocks as structured HTML

MarkerPageParser translates the Marker JSON response into List<PageResult>, which is the same internal DTO used by the rest of the pipeline.

Rationale: Marker handles column reordering, scanned-page OCR, and figure cropping in one call, eliminating the PDFBox column heuristic (PdfStructureParser) and the PDFBox render+crop loop in FigureExtractionService. Net result: fewer classes, no cloud dependency, no GCP credentials.

Alternatives considered:

PDFBox column heuristic (previous approach) → rejected: 50/50 split fails on asymmetric columns and scanned pages
Google Document AI Layout Parser → rejected: adds GCP credentials, per-page billing, 15-page batch limit, and still requires PDFBox to render+crop figure regions from bounding boxes. See Marker Study below for detailed comparison.
Screenshot each page + OCR → far slower; loses digital text quality

Decision 3: Figure Content Representation

Decision: Generate a textual description of each extracted image using the OpenAI vision model (GPT-4o). This description becomes the content field of the figure's vector store document. The figure caption (parsed from the surrounding text) is also included to maximise retrieval signal.

Rationale: Caption-only embedding would miss figures with no caption or with sparse labels. Vision-generated descriptions produce richer semantic content (anatomy terms, structural relationships) that matches clinical queries. The OpenAI client already in use supports image inputs; no additional dependency is required.

Alternatives considered:

Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
Local vision model (LLaVA) → requires self-hosting; out of scope for POC
OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)

Decision 4: Dual Vector Search

Decision: At query time, run two parallel similarity searches:

Text chunk search (filtered by type = "TEXT" and book_id)
Figure caption search (filtered by type = "FIGURE" and book_id)

Results are merged and deduplicated. The LLM prompt receives the expanded parent section text plus a structured figure reference list.

Rationale: A single search would rank text and figures against each other; figures with terse captions would systematically lose to text chunks. Separate searches with independent topK allow tuning each modality independently.

Alternatives considered:

Single search, filter by relevance score → figure captions score lower than text; figures are systematically under-retrieved
Post-process text results to look up linked figures only → misses figures that are relevant to the query but not explicitly referenced in the retrieved text chunks

Decision 5: Chunk-to-Figure Linking

Decision: During text parsing, whenever a pattern matching Fig.\s+\d+[\-\.]\d+ or Figure\s+\d+[\-\.]\d+ is found in a chunk, insert a row into the chunk_figure_refs table linking chunkId → figureId. At retrieval time, after text chunks are retrieved, their associated figures are fetched from this table and added to the LLM prompt.

Rationale: Explicit linking ensures that when a text chunk is retrieved, its referenced figures are always surfaced — even if the figure's caption did not score highly in the vector search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.

Alternatives considered:

Rely entirely on dual vector search → may miss figures referenced in retrieved text but scoring below the topK threshold in the figure search

Decision 6: Image Storage

Decision: Marker returns figure images as base64-encoded PNG bytes in the JSON response. FigureExtractionService decodes these bytes and passes them to FigureStorageService, which persists them to an S3-compatible bucket (${app.figure-storage.bucket}). The image path/URL is stored in figure.image_path in Postgres.

The FigureStorageService interface is unchanged; only the caller changes (from PDFBox crop to base64 decode).

Rationale: Marker's pre-cropped images remove the need for PDFBox rendering. FigureStorageService interface boundary satisfies Constitution Principle II (Easy to Change).

Alternatives considered:

Store base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades

Decision 7: Figure Type Classification

Decision: Use the enum FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN, TABLE, CHART, INTRAOPERATIVE_IMAGE }. Classification is derived from:

Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
Marker block_type hint ("Table" → TABLE, "Figure" / "Picture" → ANATOMICAL_DIAGRAM default)
Fall back to ANATOMICAL_DIAGRAM if unclassifiable

Rationale: Allows the frontend to render different icon/label per type (e.g., "MRI" badge). Heuristic classification avoids a separate model call per image at extraction time.

Alternatives considered:

Vision model classification → accurate but adds latency and cost per figure; deferrable
Single FIGURE type → loses citation granularity required by spec FR-004

Decision 8: Metadata Schema for Vector Store Documents

Decision: All vector store documents carry a flat Map<String, Object> metadata for Spring AI filtering. Schema:

Field	Text Chunk	Figure Chunk
`type`	`"TEXT"`	`"FIGURE"`
`book_id`	✓	✓
`book_title`	✓	✓
`chapter_id`	✓	✓
`section_id`	✓	✓
`section_title`	✓	✓
`page_start`	✓	—
`page_end`	✓	—
`chunk_index`	✓	—
`total_chunks`	✓	—
`figure_id`	—	✓
`figure_type`	—	✓
`image_path`	—	✓
`label`	—	✓
`page`	—	✓

Rationale: Flat map is required by Spring AI FilterExpressionBuilder. Separation by type allows independent filtering in dual search.

Decision 9: Re-embedding Existing Books

Decision: Books already processed under feature 001 (text-only) are NOT automatically re-embedded. An explicit re-embed action is exposed via POST /api/v1/books/{id}/reembed (admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.

Rationale: Automatic re-embedding on deploy would block the system and risk data loss if the process fails mid-way. An explicit, idempotent trigger is safer and more observable.

Decision 10: Minimum Image Size Threshold

Decision: Images smaller than 100×100 pixels are discarded and no chunk is created. Marker returns PNG bytes; FigureExtractionService decodes to BufferedImage solely to check dimensions. This threshold filters out decorative elements without a classification model.

Rationale: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px. The threshold is configurable via app.figure-storage.min-image-size-px.

Alternatives considered:

No threshold → decorative icons pollute the figure index
ML-based classification → accurate but adds model dependency; not needed at POC scale

Marker Study — Why Marker Replaces Google Document AI

Added 2026-04-04.

What Marker Offers

Marker is an open-source, locally-runnable PDF-to-structured-content converter that uses a pipeline of deep-learning models (surya for OCR + layout detection, texify for equations). Key capabilities relevant to this project:

Capability	Marker	Google Document AI
Multi-column reading order	✅	✅
OCR on scanned pages	✅	✅
Figure detection	✅ returns pre-cropped images	⚠️ returns bbox only; PDFBox still needed
Table extraction	✅ HTML tables	✅
JSON output with image bytes	✅ base64 in `images` map	❌
No cloud credentials	✅	❌ GCP service account required
No per-page billing	✅	❌ ~$10/1,000 pages
Batch size limits	None (local)	15 pages / 20 MB per sync call
Setup	`pip install marker-pdf && marker_server`	GCP project + processor + IAM

Does Marker Solve the Current Pain Points?

Pain Point 1: Naive 50/50 Column Split

Answer: Yes, Marker fixes this completely.

PdfStructureParser.extractPageText() splits pages at the horizontal midpoint with a 20% threshold. This fails on asymmetric columns and scanned pages. Marker's surya layout model returns blocks in natural reading order — no heuristic needed.

Pain Point 2: Figure Detection Misses Rasterized Figures

Answer: Yes, Marker fixes this for most cases.

FigureExtractionService previously iterated PDF XObjects (only finds embedded XObject images, misses rasterized figures and vector-path drawings). Marker's layout model detects visual elements by type and returns the cropped image bytes directly — no PDFBox page rendering needed.

Pain Point 3: OCR on Scanned Pages

Answer: Yes, Marker handles scanned pages transparently via surya OCR.

Pain Point 4: Caption Detection

Answer: Improved — Marker groups caption blocks with their figure block.

The block_type = "Caption" block appears as a sibling or child adjacent to the "Figure" block in the Marker JSON, making caption association structural rather than regex-based.

Marker API Integration

Local Server Setup

pip install marker-pdf
marker_server --port 8000

The server exposes POST /marker/upload (the user's configured endpoint).

Request

POST http://localhost:8000/marker/upload
Content-Type: multipart/form-data

file=@document.pdf
output_format=json

Response (abbreviated)

{
  "output_format": "json",
  "output": {
    "block_type": "Document",
    "children": [
      {
        "block_type": "Page",
        "id": "/page/0/Page/0",
        "children": [
          {
            "block_type": "SectionHeader",
            "id": "/page/0/SectionHeader/0",
            "html": "<h1>Cavernous Sinus Anatomy</h1>"
          },
          {
            "block_type": "Text",
            "id": "/page/0/Text/1",
            "html": "<p>The cavernous sinus contains...</p>"
          },
          {
            "block_type": "Figure",
            "id": "/page/0/Figure/2",
            "html": "<figure><img src='/page/0/Figure/2'/></figure>",
            "images": {
              "/page/0/Figure/2": "iVBORw0KGgo..."
            }
          },
          {
            "block_type": "Caption",
            "id": "/page/0/Caption/3",
            "html": "<p>Fig. 12-4. Coronal cross-section...</p>"
          }
        ]
      }
    ],
    "metadata": { "page_stats": [...] }
  }
}

Java Integration Pattern

// MarkerPageParser — core call
MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
body.add("file", new FileSystemResource(pdfPath));
body.add("output_format", "json");

JsonNode response = restClient.post()
    .uri(baseUrl + "/marker/upload")
    .contentType(MediaType.MULTIPART_FORM_DATA)
    .body(body)
    .retrieve()
    .body(JsonNode.class);

JsonNode document = response.get("output");

Mapping Marker Blocks to PageResult

Page block (id "/page/N/Page/M") → PageResult(pageNumber = N+1)
  SectionHeader children           → headingTitle (first match)
  Text, TextInlineMath children    → orderedText (HTML stripped, joined \n\n)
  Figure children with images map  → FigureData(imageBytes = base64decode(images[id]))
  Caption sibling of Figure        → FigureData.nearestCaption

Architecture Change

Before (Document AI — removed):
  DocumentAiPageParser
      → Google Document AI API (GCP, 15-page batches, credentials)
      → returns text blocks + figure bboxes
  PdfStructureParser (PDFBox column heuristic)
  FigureExtractionService
      → renders page via PDFBox at 150 DPI
      → crops bbox region

After (Marker):
  MarkerPageParser
      → POST PDF to http://localhost:8000/marker/upload (output_format=json)
      → returns text blocks (correct reading order) + Figure blocks with base64 images
      → produces List<PageResult> (same DTO, FigureData carries bytes not bbox)
  FigureExtractionService (simplified)
      → base64-decodes image bytes from PageResult.FigureData
      → checks min size (ImageIO.read → getWidth/getHeight)
      → saves to S3 via FigureStorageService (UNCHANGED)
  VisionDescriptionService (UNCHANGED)
  BookEmbeddingService orchestration (MINOR: inject MarkerPageParser)

What is removed:

DocumentAiPageParser — replaced by MarkerPageParser
DocumentAiConfig — replaced by MarkerConfig
PdfStructureParser — Marker handles reading order
google-cloud-document-ai Maven dependency
app.document-ai.* configuration properties

What stays the same:

PageResult DTO structure (fields renamed, not restructured)
FigureExtractionService public interface
TextChunkingService, VisionDescriptionService, BookEmbeddingService orchestration
All JPA entities, repositories, vector store, S3 storage

Constitution Compliance

Principle	Assessment
I. KISS	✅ Simpler than Document AI — one HTTP call replaces GCP SDK + PDFBox render loop. No new dependency beyond an HTTP client (Spring RestClient, already available).
II. Easy to Change	✅ `MarkerPageParser` is the only Marker-aware class. Swap it to use any other parser. `PageResult` DTO unchanged in contract.
III. Web-First	✅ Internal pipeline change; no API contract change.
IV. Documentation	✅ README must show Marker as a local external service dependency.

Risks & Mitigations

Risk	Likelihood	Mitigation
Marker server not running when book is uploaded	Medium	`BookEmbeddingService` catches exception from `MarkerPageParser`, marks book as `FAILED`, logs full error.
Marker misses some figures (complex PDFs)	Medium	`app.figure-storage.min-image-size-px` threshold can be tuned. Add fallback: if Marker returns 0 figures for a page with known images, log a warning.
SC-003 (≤ 3× processing time) violated	Low	Marker runs locally (no network latency to cloud). Benchmark with a real 500-page book early.
Large PDF upload to Marker (>100MB)	Low	Marker server handles the full file; no batching needed. Multipart upload limit configurable.
Marker image quality vs PDFBox crop	Low	Marker crops at native resolution; quality is equivalent or better than 150 DPI PDFBox render.

16 KiB Raw Permalink Blame History Unescape Escape

Research: Enhanced Embedding with Image Parsing and Metadata

Decision 1: Document Hierarchy Model

Decision 2: Document Parsing Strategy

Decision 3: Figure Content Representation

Decision 4: Dual Vector Search

Decision 5: Chunk-to-Figure Linking

Decision 6: Image Storage

Decision 7: Figure Type Classification

Decision 8: Metadata Schema for Vector Store Documents

Decision 9: Re-embedding Existing Books

Decision 10: Minimum Image Size Threshold

Marker Study — Why Marker Replaces Google Document AI

What Marker Offers

Does Marker Solve the Current Pain Points?

Pain Point 1: Naive 50/50 Column Split

Pain Point 2: Figure Detection Misses Rasterized Figures

Pain Point 3: OCR on Scanned Pages

Pain Point 4: Caption Detection

Marker API Integration

Local Server Setup

Request

Response (abbreviated)

Java Integration Pattern

Mapping Marker Blocks to PageResult

Architecture Change

Constitution Compliance

Risks & Mitigations

16 KiB

Raw Permalink Blame History