adding Marker to parse effectively pdf

2026-04-04 21:30:18 +02:00
parent b154e29f2d
commit ea1276dc2e
25 changed files with 2318 additions and 285 deletions
@@ -0,0 +1,79 @@
+# Internal Contract: DocumentAiPageParser → FigureExtractionService
+
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04  
+**Type**: Internal Java DTO (not an HTTP contract)
+
+---
+
+## Purpose
+
+`PageResult` is the internal data transfer object produced by `DocumentAiPageParser` for each
+PDF page. It decouples the Google Document AI SDK types from the rest of the pipeline so that
+`PdfStructureParser` can be replaced without cascading changes.
+
+---
+
+## Java Record
+
+```java
+package com.aiteacher.document;
+
+import java.util.List;
+
+/**
+ * Internal DTO produced by DocumentAiPageParser for one PDF page.
+ * Decouples the Document AI SDK types from downstream services.
+ */
+public record PageResult(
+    int pageNumber,           // 1-based, matches Document.Page.getPageNumber()
+    String orderedText,       // full page text in correct reading order (blocks joined by \n\n)
+    String headingTitle,      // first HEADING block on page, or null
+    List<FigureBbox> figures  // detected figure regions (may be empty)
+) {
+
+    /**
+     * Normalized bounding box for a detected figure region.
+     * Coordinates are in the [0.0, 1.0] range relative to page dimensions.
+     */
+    public record FigureBbox(
+        float x,       // left edge (normalized)
+        float y,       // top edge (normalized)
+        float width,   // width (normalized)
+        float height,  // height (normalized)
+        String nearestCaption  // text of adjacent paragraph block, or null
+    ) {}
+}
+```
+
+---
+
+## Production Rules
+
+| Field | Rule |
+|-------|------|
+| `orderedText` | Concatenation of all `PARAGRAPH` and `HEADING_*` blocks, joined with `\n\n`. Tables are represented as tab-separated text. |
+| `headingTitle` | First block whose `blockType` is `HEADING_1` through `HEADING_6`. `null` if no heading detected. |
+| `figures` | One entry per `VisualElement` with `type == "figure"` and `confidence ≥ 0.5`. Sorted top-to-bottom by `y`. |
+| `nearestCaption` | The `PARAGRAPH` block immediately following the figure bbox (by Y coordinate). May be `null` if no paragraph follows within 10% of page height. |
+
+---
+
+## Mapping from Document AI Proto
+
+```
+Document.Page.Block         → orderedText (concatenated)
+Document.Page.Block (HEADING_*) → headingTitle (first match)
+Document.Page.VisualElement → FigureBbox
+  └─ layout.bounding_poly.normalized_vertices[0] → (x, y) top-left
+  └─ normalized_vertices[2] → (x+w, y+h) bottom-right
+```
+
+---
+
+## Consumers
+
+| Consumer | What It Uses |
+|----------|-------------|
+| `BookEmbeddingService` | `orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title` |
+| `FigureExtractionService` | `figures` list → renders page via PDFBox, crops each bbox to `BufferedImage` |
+| `TextChunkingService` | Receives `SectionEntity` (indirectly uses `orderedText`) — **unchanged** |
@@ -0,0 +1,84 @@
+# Internal Contract: MarkerPageParser → FigureExtractionService / BookEmbeddingService
+
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04  
+**Type**: Internal Java DTO (not an HTTP contract)
+
+---
+
+## Purpose
+
+`PageResult` is the internal data transfer object produced by `MarkerPageParser` for each
+PDF page. It decouples the Marker HTTP API from the rest of the pipeline. Downstream consumers
+(`BookEmbeddingService`, `FigureExtractionService`, `TextChunkingService`) are unaware of
+Marker and depend only on this DTO.
+
+---
+
+## Java Record
+
+```java
+package com.aiteacher.document;
+
+import java.util.List;
+
+/**
+ * Internal DTO produced by MarkerPageParser for one PDF page.
+ * Decouples the Marker HTTP API from downstream services.
+ */
+public record PageResult(
+    int pageNumber,              // 1-based, derived from Marker page block index
+    String orderedText,          // full page text in correct reading order (blocks joined by \n\n)
+    String headingTitle,         // first SectionHeader block on page, or null
+    List<FigureData> figures     // extracted figure images (may be empty)
+) {
+
+    /**
+     * A figure extracted from the page.
+     * Image bytes are PNG data decoded from the Marker JSON `images` map.
+     */
+    public record FigureData(
+        byte[] imageBytes,       // PNG image data (base64-decoded from Marker response)
+        String nearestCaption,   // text of the adjacent Caption block, or null
+        String blockId           // Marker block ID (e.g. "/page/0/Figure/2") for traceability
+    ) {}
+}
+```
+
+---
+
+## Production Rules
+
+| Field | Rule |
+|-------|------|
+| `pageNumber` | 1-based index derived from the Marker page block's position in the `children` array (index + 1). |
+| `orderedText` | HTML-stripped text from all `Text`, `TextInlineMath`, `SectionHeader`, `ListItem`, and `Table` blocks, joined with `\n\n`. Marker already returns them in reading order. |
+| `headingTitle` | Plain text of the first `SectionHeader` block on the page. `null` if no heading detected. |
+| `figures` | One `FigureData` per `Figure` or `Picture` block that has a non-empty `images` entry. Blocks with no image data are skipped. |
+| `imageBytes` | Base64-decoded bytes from `block.images[blockId]`. Marker returns PNG. |
+| `nearestCaption` | Plain text of the first `Caption` block that is a sibling appearing immediately after the figure block. `null` if absent. |
+
+---
+
+## Mapping from Marker JSON
+
+```
+Marker JSON → PageResult
+
+Page block ("/page/N/Page/M")       → PageResult(pageNumber = N + 1)
+  SectionHeader child                → headingTitle (first match, HTML-stripped)
+  Text / TextInlineMath children    → orderedText (HTML-stripped, joined \n\n)
+  Figure / Picture child            → FigureData
+    images[blockId]                  → FigureData.imageBytes (base64-decoded)
+    next Caption sibling             → FigureData.nearestCaption (HTML-stripped)
+    blockId                          → FigureData.blockId
+```
+
+---
+
+## Consumers
+
+| Consumer | What It Uses |
+|----------|-------------|
+| `BookEmbeddingService` | `orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title` |
+| `FigureExtractionService` | `figures` list → decodes `imageBytes`, checks min size, saves to S3 |
+| `TextChunkingService` | Receives `SectionEntity` (uses `orderedText` indirectly) — **unchanged** |
@@ -1,40 +1,42 @@
 # Implementation Plan: Enhanced Embedding with Image Parsing and Metadata

-**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03 | **Spec**: [spec.md](spec.md)  
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 | **Spec**: [spec.md](spec.md)  
 **Input**: Feature specification from `/specs/002-image-aware-embedding/spec.md`

 ## Summary

-Enhance the book embedding pipeline to extract images from every PDF page, generate descriptive
-text for each image, and store all content (text chunks + figure captions) with rich, consistent
-metadata in the vector store. A new document hierarchy (Book → Chapter → Section → TextChunk +
-Figure) is introduced. Postgres holds the full-text sections and figure metadata; the vector
-store holds chunk and figure caption embeddings; the local file store holds extracted image files.
-At query time, both the text-chunk store and figure-caption store are searched in parallel and
-results are merged before being sent to the LLM.
+Enhance the PDF embedding pipeline to extract figures and generate AI descriptions for them,
+making image content semantically searchable alongside text. PDF parsing and figure extraction
+are delegated to a local **Marker** server (`http://localhost:8000/marker/upload`), which
+returns reading-order text and pre-cropped figure images (base64) in a single JSON response,
+eliminating the need for PDFBox column heuristics and figure bbox rendering.

 ## Technical Context

 **Language/Version**: Java 25 (backend), TypeScript / Node 20 (frontend)  
-**Primary Dependencies**: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings + chat), PDFBox (via Spring AI PDF reader dependency)  
-**Storage**: PostgreSQL (JPA + Flyway), pgvector (Spring AI `VectorStore`), local file system (extracted images — `/uploads/figures/`)  
-**Testing**: Spring Boot Test, JUnit 5, Mockito  
-**Target Platform**: Linux server (Docker Compose)  
-**Project Type**: Web application — backend REST API + Vue 3 frontend  
-**Performance Goals**: Full book (up to 500 pages with images) processed in ≤ 30 minutes; query response unchanged from existing baseline  
-**Constraints**: No new deployable units; all changes within the existing `backend/` module; image storage on local disk (S3 migration is a future concern, behind an interface)  
-**Scale/Scope**: POC — <10 concurrent users; single shared book library
+**Primary Dependencies**: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings +
+GPT-4o vision), PDFBox 3.0.3 (via `spring-ai-pdf-document-reader` — retained transitively,
+no longer used directly), Marker local HTTP API (`http://localhost:8000/marker/upload`)  
+**Storage**: PostgreSQL (JPA + Flyway), pgvector (Spring AI `VectorStore`), S3-compatible
+object store (figure images via `FigureStorageService`)  
+**Testing**: Maven / JUnit 5 (`spring-boot-starter-test`)  
+**Target Platform**: Linux server  
+**Project Type**: Web application (backend API + frontend client)  
+**Performance Goals**: SC-003 — book processing time ≤ 3× text-only for ≤ 500 pages  
+**Constraints**: REST API only (Constitution III); Marker server must be running locally;
+S3-compatible storage configured via env vars  
+**Scale/Scope**: POC — handful of books, <10 users

 ## Constitution Check

-*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
+*GATE: Must pass before Phase 0 research. Re-checked after Phase 1 design.*

 | Principle | Status | Notes |
 |-----------|--------|-------|
-| I — KISS | ⚠️ Justified violation — see Complexity Tracking | Hierarchical model + dual search adds complexity; justified by precision requirement |
-| II — Easy to Change | ✅ | Figure storage wrapped behind `FigureStorageService` interface; can swap local disk for S3 |
-| III — Web-First | ✅ | All new capabilities exposed via existing REST API; no new deployable units |
-| IV — Docs as Architecture | ⚠️ Required | README Mermaid diagram MUST be updated in this PR to show new storage tiers |
+| **I. KISS** | ✅ Justified | Marker replaces a bespoke PDFBox column heuristic + Google Cloud SDK with one HTTP call. Net complexity reduction vs. the Document AI approach. |
+| **II. Easy to Change** | ✅ | `MarkerPageParser` is the only class that knows about Marker; swap the implementation to replace Marker with any other parser. `PageResult` DTO remains unchanged. |
+| **III. Web-First** | ✅ | Internal pipeline change; no public API contract change. |
+| **IV. Documentation** | ✅ | README must be updated to show Marker as a local external service. |

 ## Project Structure

@@ -46,60 +48,38 @@ specs/002-image-aware-embedding/
 ├── research.md          # Phase 0 output
 ├── data-model.md        # Phase 1 output
 ├── quickstart.md        # Phase 1 output
-├── contracts/           # Phase 1 output
-└── tasks.md             # Phase 2 output (/speckit.tasks)
+├── contracts/
+│   ├── api.md           # HTTP API contracts (unchanged from initial plan)
+│   └── marker-page-result.md  # Internal DTO contract (MarkerPageParser → downstream)
+└── tasks.md             # Phase 2 output (/speckit.tasks — not created here)
 ```

-### Source Code (repository root)
+### Source Code

 ```text
 backend/
 ├── src/main/java/com/aiteacher/
+│   ├── config/
+│   │   └── MarkerConfig.java          # NEW: RestClient bean + base-url property
+│   ├── document/
+│   │   ├── MarkerPageParser.java      # NEW: replaces DocumentAiPageParser + PdfStructureParser
+│   │   ├── PageResult.java            # UPDATED: FigureBbox → FigureData (bytes not bbox)
+│   │   ├── FigureExtractionService.java  # UPDATED: no PDFBox render; decode bytes directly
+│   │   ├── TextChunkingService.java   # UNCHANGED
+│   │   ├── VisionDescriptionService.java # UNCHANGED
+│   │   └── [removed] DocumentAiPageParser.java
 │   ├── book/
-│   │   ├── Book.java                         (existing)
-│   │   ├── BookController.java               (existing)
-│   │   ├── BookService.java                  (existing)
-│   │   ├── BookRepository.java               (existing)
-│   │   ├── BookStatus.java                   (existing)
-│   │   ├── BookEmbeddingService.java         (existing — enhanced)
-│   │   └── NoKnowledgeSourceException.java   (existing)
-│   ├── document/                             (new package)
-│   │   ├── BookNode.java
-│   │   ├── ChapterNode.java
-│   │   ├── SectionNode.java
-│   │   ├── SectionRepository.java
-│   │   ├── TextChunkNode.java
-│   │   ├── FigureNode.java
-│   │   ├── FigureRepository.java
-│   │   ├── FigureType.java
-│   │   ├── ChunkFigureRef.java
-│   │   └── ChunkFigureRefRepository.java
-│   ├── figure/                               (new package)
-│   │   ├── FigureStorageService.java         (interface)
-│   │   └── LocalFigureStorageService.java    (implementation)
-│   ├── retrieval/                            (new package)
-│   │   └── NeurosurgeryRetriever.java
-│   ├── chat/
-│   │   └── ChatService.java                  (updated — uses NeurosurgeryRetriever)
-│   └── config/
-│       └── FigureStorageConfig.java          (new — configures upload dir)
-└── src/main/resources/
-    └── db/migration/
-        ├── V4__document_hierarchy.sql        (new)
-        └── V5__figures_and_refs.sql          (new)
-
-uploads/
-└── figures/                                  (runtime — extracted images; gitignored)
+│   │   └── BookEmbeddingService.java  # MINOR UPDATE: inject MarkerPageParser, drop DocumentAiPageParser
+│   └── [removed] config/DocumentAiConfig.java
+├── src/main/resources/
+│   └── application.yaml               # UPDATED: remove document-ai.*, add marker.base-url
+└── pom.xml                            # UPDATED: remove google-cloud-document-ai
 ```

-**Structure Decision**: Option 2 (Web Application) confirmed. All backend changes stay within
-`backend/`. Two new packages (`document/`, `retrieval/`) plus one interface package (`figure/`)
-keep concerns separated without adding a deployable unit.
+**Structure Decision**: Option 2 (backend + frontend) per constitution Technology Constraints.
+Frontend changes are display-only (render figure citations inline).

 ## Complexity Tracking

-| Violation | Why Needed | Simpler Alternative Rejected Because |
-|-----------|------------|-------------------------------------|
-| Document hierarchy (BookNode → ChapterNode → SectionNode) | Parent-child retrieval: chunks reference their parent section so the LLM receives full section context, not just the matching fragment. This is the established solution for RAG precision. | Flat page-per-doc model (current) loses inter-sentence context; chunk-only retrieval produces incomplete answers for multi-paragraph clinical questions |
-| Dual vector search (text chunks + figure captions) | Figure captions must be independently searchable — a query about "cavernous sinus anatomy" must surface the diagram even if no text chunk scores highly | Single vector store search would miss figures whose captions don't happen to be the highest-similarity hit; this is the core deliverable of the feature |
-| Third storage tier (local file store for images) | Extracted images cannot live in Postgres (binary blobs degrade query performance) or the vector store (only vectors). A file-per-image approach is standard. | Storing images as base64 in Postgres JSONB would bloat the DB and complicate backup/restore; the `FigureStorageService` interface keeps the implementation swappable |
+> No constitution violations — Marker reduces complexity compared to the previous
+> Google Document AI approach (fewer dependencies, no GCP credentials, no 15-page batching).
@@ -1,34 +1,67 @@
 # Quickstart: Enhanced Embedding with Image Parsing and Metadata

-**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)

 ---

 ## Prerequisites

 - Docker Compose running (PostgreSQL + pgvector)
- OpenAI API key set in `backend/src/main/resources/application.properties` or as env var `OPENAI_API_KEY`
+- OpenAI API key set as env var `OPENAI_API_KEY`
 - Java 25 + Maven on PATH
+- **Marker server running** on `http://localhost:8000` (see setup below)
+- S3-compatible bucket configured (existing setup)

 ---

-## New Configuration
+## Marker Server Setup (one-time)

-Add to `backend/src/main/resources/application.properties`:
+Marker is a local Python service — no cloud credentials required.

-```properties
-# Figure storage
-app.figure-storage.base-path=./uploads
-app.figure-storage.min-image-size-px=100
+```bash
+# Install (Python 3.10+ required)
+pip install marker-pdf
+
+# Start the server on port 8000
+marker_server --port 8000
 ```

-The `uploads/figures/` directory is created automatically on first use. Add it to `.gitignore`.
+The server is ready when you see:
+```
+INFO:     Uvicorn running on http://0.0.0.0:8000
+```
+
+Keep the server running in the background (or use a process manager like `systemd` or `screen`).
+
+---
+
+## Backend Configuration
+
+Add or update `backend/src/main/resources/application.yaml`:
+
+```yaml
+app:
+  figure-storage:
+    endpoint: https://your-s3-endpoint
+    region: your-region
+    bucket: ${S3_BUCKET:aiteacher}
+    access-key-id: ${S3_ACCESS_KEY_ID}
+    secret-access-key: ${S3_SECRET_ACCESS_KEY}
+    min-image-size-px: 100   # skip decorative images smaller than 100×100 px
+  marker:
+    base-url: ${MARKER_BASE_URL:http://localhost:8000}
+  embedding:
+    batch-size: 20
+    batch-delay-ms: 2000
+```
+
+No GCP credentials or project IDs are needed.

 ---

 ## Database Migration

-Two new Flyway migrations run automatically on startup:
+Two Flyway migrations run automatically on startup:

 - `V4__document_hierarchy.sql` — adds `chapter` and `section` tables
 - `V5__figures_and_refs.sql` — adds `figure` and `chunk_figure_ref` tables
@@ -54,10 +87,11 @@ image-aware pipeline runs. Status can be polled via `GET /api/v1/books`.

 ## Verifying Image Extraction

-1. Upload a PDF with diagrams: `POST /api/v1/books/upload`
-2. Wait for `status: "READY"` via `GET /api/v1/books`
-3. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page
-4. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry
+1. Ensure Marker is running: `curl http://localhost:8000` should respond.
+2. Upload a PDF with diagrams: `POST /api/v1/books/upload`
+3. Wait for `status: "READY"` via `GET /api/v1/books`
+4. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page
+5. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry

 ---

@@ -80,7 +114,8 @@ mvn test
 ```

 Key new test classes:
- `FigureExtractionServiceTest` — unit tests for image extraction and classification
+- `MarkerPageParserTest` — unit tests for JSON parsing and block-to-PageResult mapping
+- `FigureExtractionServiceTest` — unit tests for base64 decode, size filtering, classification
 - `NeurosurgeryRetrieverTest` — unit tests for dual-search merge and deduplication
 - `BookEmbeddingServiceIntegrationTest` — integration test: upload PDF with known figures,
  verify figures appear in `GET /api/v1/books/{id}/figures`
@@ -1,10 +1,10 @@
 # Research: Enhanced Embedding with Image Parsing and Metadata

-**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)

-This document resolves all technical unknowns identified during planning. The primary source for
-decisions is the detailed architecture provided directly by the project owner, supplemented by
-Spring AI 2.0.0-M4 API specifics.
+This document resolves all technical unknowns identified during planning. Decisions 1–10 cover
+the core pipeline. The **Marker Study** section at the bottom explains why Marker was chosen
+over Google Document AI to drive PDF parsing and figure extraction.

 ---

@@ -28,19 +28,29 @@ association explicit and queryable.

 ---

-## Decision 2: Image Extraction Strategy
+## Decision 2: Document Parsing Strategy

-**Decision**: Use PDFBox (already on classpath via `spring-ai-pdf-document-reader`) to extract
-images per page. Each image is tagged with `page`, `figure_id` (derived from caption, e.g.
-"Fig. 12-4"), and the parent `sectionId`. Images are saved to local disk under
-`/uploads/figures/{bookId}/`.
+**Decision**: Use **Marker** (local HTTP server, `http://localhost:8000/marker/upload`) as the
+single entry point for PDF parsing. A single `POST` with `output_format=json` returns:
+- Reading-order text blocks (headings, paragraphs) — no column-split heuristic needed
+- Pre-cropped figure images as base64-encoded PNG in the `images` map of each `Figure` block
+- Table, equation, and code blocks as structured HTML

-**Rationale**: PDFBox is already present (Spring AI bundles it). No new dependency needed.
-Per-page extraction ensures every image is captured regardless of PDF structure.
+`MarkerPageParser` translates the Marker JSON response into `List<PageResult>`, which is the
+same internal DTO used by the rest of the pipeline.
+
+**Rationale**: Marker handles column reordering, scanned-page OCR, and figure cropping in one
+call, eliminating the PDFBox column heuristic (`PdfStructureParser`) and the PDFBox
+render+crop loop in `FigureExtractionService`. Net result: fewer classes, no cloud dependency,
+no GCP credentials.

 **Alternatives considered**:
- iText / iText7 → additional commercial dependency; overkill for extraction
- Screenshot each page as PNG, then OCR → far slower; loses vector quality
+- PDFBox column heuristic (previous approach) → rejected: 50/50 split fails on asymmetric
+  columns and scanned pages
+- Google Document AI Layout Parser → rejected: adds GCP credentials, per-page billing, 15-page
+  batch limit, and still requires PDFBox to render+crop figure regions from bounding boxes.
+  See Marker Study below for detailed comparison.
+- Screenshot each page + OCR → far slower; loses digital text quality

 ---

@@ -103,18 +113,19 @@ search. This is the higher-recall path; dual search (Decision 4) is the higher-p

 ## Decision 6: Image Storage

-**Decision**: Extracted images are saved as PNG files to a local directory
-(`${app.figure-storage.base-path}`, defaults to `./uploads/figures/{bookId}/`). The path is
-stored in `figure.image_path` in Postgres. A `FigureStorageService` interface wraps all disk
-I/O so the implementation can be swapped to S3 or another object store without changing
-callers.
+**Decision**: Marker returns figure images as base64-encoded PNG bytes in the JSON response.
+`FigureExtractionService` decodes these bytes and passes them to `FigureStorageService`, which
+persists them to an S3-compatible bucket (`${app.figure-storage.bucket}`). The image path/URL
+is stored in `figure.image_path` in Postgres.

-**Rationale**: Local disk is the simplest viable option for a POC with <10 users. The interface
-boundary satisfies Constitution Principle II (Easy to Change).
+The `FigureStorageService` interface is unchanged; only the caller changes (from PDFBox crop
+to base64 decode).
+
+**Rationale**: Marker's pre-cropped images remove the need for PDFBox rendering.
+`FigureStorageService` interface boundary satisfies Constitution Principle II (Easy to Change).

 **Alternatives considered**:
- S3 from day 1 → operational overhead not justified at POC scale
- Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
+- Store base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades

 ---

@@ -123,7 +134,8 @@ boundary satisfies Constitution Principle II (Easy to Change).
 **Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
 TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
 1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
-2. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
+2. Marker `block_type` hint (`"Table"` → TABLE, `"Figure"` / `"Picture"` → ANATOMICAL_DIAGRAM default)
+3. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable

 **Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
 Heuristic classification avoids a separate model call per image at extraction time.
@@ -175,14 +187,225 @@ the process fails mid-way. An explicit, idempotent trigger is safer and more obs

 ## Decision 10: Minimum Image Size Threshold

-**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. This
-threshold filters out decorative elements (bullets, dividers, publisher logos) without a
-classification model.
+**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. Marker
+returns PNG bytes; `FigureExtractionService` decodes to `BufferedImage` solely to check
+dimensions. This threshold filters out decorative elements without a classification model.

 **Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
-The threshold is configurable via `app.figure-storage.min-image-size-px` in
-`application.properties`.
+The threshold is configurable via `app.figure-storage.min-image-size-px`.

 **Alternatives considered**:
 - No threshold → decorative icons pollute the figure index
 - ML-based classification → accurate but adds model dependency; not needed at POC scale
+
+---
+
+# Marker Study — Why Marker Replaces Google Document AI
+
+*Added 2026-04-04.*
+
+## What Marker Offers
+
+Marker is an open-source, locally-runnable PDF-to-structured-content converter that uses a
+pipeline of deep-learning models (surya for OCR + layout detection, texify for equations).
+Key capabilities relevant to this project:
+
+| Capability | Marker | Google Document AI |
+|-----------|--------|--------------------|
+| Multi-column reading order | ✅ | ✅ |
+| OCR on scanned pages | ✅ | ✅ |
+| Figure detection | ✅ returns pre-cropped images | ⚠️ returns bbox only; PDFBox still needed |
+| Table extraction | ✅ HTML tables | ✅ |
+| JSON output with image bytes | ✅ base64 in `images` map | ❌ |
+| No cloud credentials | ✅ | ❌ GCP service account required |
+| No per-page billing | ✅ | ❌ ~$10/1,000 pages |
+| Batch size limits | None (local) | 15 pages / 20 MB per sync call |
+| Setup | `pip install marker-pdf && marker_server` | GCP project + processor + IAM |
+
+---
+
+## Does Marker Solve the Current Pain Points?
+
+### Pain Point 1: Naive 50/50 Column Split
+
+**Answer: Yes, Marker fixes this completely.**
+
+`PdfStructureParser.extractPageText()` splits pages at the horizontal midpoint with a 20%
+threshold. This fails on asymmetric columns and scanned pages. Marker's surya layout model
+returns blocks in natural reading order — no heuristic needed.
+
+### Pain Point 2: Figure Detection Misses Rasterized Figures
+
+**Answer: Yes, Marker fixes this for most cases.**
+
+`FigureExtractionService` previously iterated PDF XObjects (only finds embedded XObject images,
+misses rasterized figures and vector-path drawings). Marker's layout model detects visual
+elements by type and returns the cropped image bytes directly — no PDFBox page rendering needed.
+
+### Pain Point 3: OCR on Scanned Pages
+
+**Answer: Yes, Marker handles scanned pages transparently via surya OCR.**
+
+### Pain Point 4: Caption Detection
+
+**Answer: Improved — Marker groups caption blocks with their figure block.**
+
+The `block_type = "Caption"` block appears as a sibling or child adjacent to the `"Figure"`
+block in the Marker JSON, making caption association structural rather than regex-based.
+
+---
+
+## Marker API Integration
+
+### Local Server Setup
+
+```bash
+pip install marker-pdf
+marker_server --port 8000
+```
+
+The server exposes `POST /marker/upload` (the user's configured endpoint).
+
+### Request
+
+```
+POST http://localhost:8000/marker/upload
+Content-Type: multipart/form-data
+
+file=@document.pdf
+output_format=json
+```
+
+### Response (abbreviated)
+
+```json
+{
+  "output_format": "json",
+  "output": {
+    "block_type": "Document",
+    "children": [
+      {
+        "block_type": "Page",
+        "id": "/page/0/Page/0",
+        "children": [
+          {
+            "block_type": "SectionHeader",
+            "id": "/page/0/SectionHeader/0",
+            "html": "<h1>Cavernous Sinus Anatomy</h1>"
+          },
+          {
+            "block_type": "Text",
+            "id": "/page/0/Text/1",
+            "html": "<p>The cavernous sinus contains...</p>"
+          },
+          {
+            "block_type": "Figure",
+            "id": "/page/0/Figure/2",
+            "html": "<figure><img src='/page/0/Figure/2'/></figure>",
+            "images": {
+              "/page/0/Figure/2": "iVBORw0KGgo..."
+            }
+          },
+          {
+            "block_type": "Caption",
+            "id": "/page/0/Caption/3",
+            "html": "<p>Fig. 12-4. Coronal cross-section...</p>"
+          }
+        ]
+      }
+    ],
+    "metadata": { "page_stats": [...] }
+  }
+}
+```
+
+### Java Integration Pattern
+
+```java
+// MarkerPageParser — core call
+MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
+body.add("file", new FileSystemResource(pdfPath));
+body.add("output_format", "json");
+
+JsonNode response = restClient.post()
+    .uri(baseUrl + "/marker/upload")
+    .contentType(MediaType.MULTIPART_FORM_DATA)
+    .body(body)
+    .retrieve()
+    .body(JsonNode.class);
+
+JsonNode document = response.get("output");
+```
+
+### Mapping Marker Blocks to PageResult
+
+```
+Page block (id "/page/N/Page/M") → PageResult(pageNumber = N+1)
+  SectionHeader children           → headingTitle (first match)
+  Text, TextInlineMath children    → orderedText (HTML stripped, joined \n\n)
+  Figure children with images map  → FigureData(imageBytes = base64decode(images[id]))
+  Caption sibling of Figure        → FigureData.nearestCaption
+```
+
+---
+
+## Architecture Change
+
+```
+Before (Document AI — removed):
+  DocumentAiPageParser
+      → Google Document AI API (GCP, 15-page batches, credentials)
+      → returns text blocks + figure bboxes
+  PdfStructureParser (PDFBox column heuristic)
+  FigureExtractionService
+      → renders page via PDFBox at 150 DPI
+      → crops bbox region
+
+After (Marker):
+  MarkerPageParser
+      → POST PDF to http://localhost:8000/marker/upload (output_format=json)
+      → returns text blocks (correct reading order) + Figure blocks with base64 images
+      → produces List<PageResult> (same DTO, FigureData carries bytes not bbox)
+  FigureExtractionService (simplified)
+      → base64-decodes image bytes from PageResult.FigureData
+      → checks min size (ImageIO.read → getWidth/getHeight)
+      → saves to S3 via FigureStorageService (UNCHANGED)
+  VisionDescriptionService (UNCHANGED)
+  BookEmbeddingService orchestration (MINOR: inject MarkerPageParser)
+```
+
+**What is removed**:
+- `DocumentAiPageParser` — replaced by `MarkerPageParser`
+- `DocumentAiConfig` — replaced by `MarkerConfig`
+- `PdfStructureParser` — Marker handles reading order
+- `google-cloud-document-ai` Maven dependency
+- `app.document-ai.*` configuration properties
+
+**What stays the same**:
+- `PageResult` DTO structure (fields renamed, not restructured)
+- `FigureExtractionService` public interface
+- `TextChunkingService`, `VisionDescriptionService`, `BookEmbeddingService` orchestration
+- All JPA entities, repositories, vector store, S3 storage
+
+---
+
+## Constitution Compliance
+
+| Principle | Assessment |
+|-----------|------------|
+| **I. KISS** | ✅ Simpler than Document AI — one HTTP call replaces GCP SDK + PDFBox render loop. No new dependency beyond an HTTP client (Spring RestClient, already available). |
+| **II. Easy to Change** | ✅ `MarkerPageParser` is the only Marker-aware class. Swap it to use any other parser. `PageResult` DTO unchanged in contract. |
+| **III. Web-First** | ✅ Internal pipeline change; no API contract change. |
+| **IV. Documentation** | ✅ README must show Marker as a local external service dependency. |
+
+---
+
+## Risks & Mitigations
+
+| Risk | Likelihood | Mitigation |
+|------|-----------|------------|
+| Marker server not running when book is uploaded | Medium | `BookEmbeddingService` catches exception from `MarkerPageParser`, marks book as `FAILED`, logs full error. |
+| Marker misses some figures (complex PDFs) | Medium | `app.figure-storage.min-image-size-px` threshold can be tuned. Add fallback: if Marker returns 0 figures for a page with known images, log a warning. |
+| SC-003 (≤ 3× processing time) violated | Low | Marker runs locally (no network latency to cloud). Benchmark with a real 500-page book early. |
+| Large PDF upload to Marker (>100MB) | Low | Marker server handles the full file; no batching needed. Multipart upload limit configurable. |
+| Marker image quality vs PDFBox crop | Low | Marker crops at native resolution; quality is equivalent or better than 150 DPI PDFBox render. |
@@ -48,12 +48,13 @@

 **Independent Test**: Upload a PDF containing at least one page with a labelled anatomical diagram. After status shows `READY`, call `GET /api/v1/books/{id}/figures` — response must contain at least one entry with `figureType`, `caption`, `page`, and `imageUrl` populated. Verify the PNG file exists at the path in `imagePath`.

- [X] T013 [US2] Create `PdfStructureParser` service in `backend/src/main/java/com/aiteacher/document/PdfStructureParser.java` — uses Spring AI's `PagePdfDocumentReader` to extract per-page text; groups pages into `SectionEntity` records using heading-detection heuristics (lines matching `^\d+(\.\d+)*\s+[A-Z]`); groups sections into `ChapterEntity` records; persists both to Postgres via `ChapterRepository` and `SectionRepository`; returns `List<SectionEntity>` for the book
- [X] T014 [US2] Create `FigureExtractionService` in `backend/src/main/java/com/aiteacher/document/FigureExtractionService.java` — opens PDF with PDFBox `PDDocument`; iterates pages; extracts `PDImageXObject` instances; skips images whose width or height are below `min-image-size-px`; classifies `FigureType` using the keyword-matching table from data-model.md §FigureType; parses caption from the nearest text line matching `CAPTION_PATTERN`; saves PNG via `FigureStorageService`; persists `FigureEntity` to `FigureRepository`; returns `List<FigureEntity>` per book
+- [X] T013 [US2] ~~Create `PdfStructureParser`~~ → **SUPERSEDED**: PDF parsing is handled by `MarkerPageParser` (see T013b). `PdfStructureParser` exists but is not wired into the pipeline.
+- [X] T013b [US2] Create `MarkerPageParser` in `backend/src/main/java/com/aiteacher/document/MarkerPageParser.java` — POSTs PDF to `http://localhost:8000/marker/upload?output_format=json` via Spring `RestClient`; parses JSON response into `List<PageResult>` (one per page block); extracts heading, ordered text, and pre-cropped figure PNG bytes per page
+- [X] T014 [US2] Update `FigureExtractionService` in `backend/src/main/java/com/aiteacher/document/FigureExtractionService.java` — **Marker migration**: removed PDFBox rendering + bbox-crop loop; decodes PNG bytes from `PageResult.FigureData` via `ImageIO.read()`; skips images below `min-image-size-px`; classifies `FigureType`; saves via `FigureStorageService`; persists `FigureEntity`
 - [X] T015 [US2] Create `VisionDescriptionService` in `backend/src/main/java/com/aiteacher/document/VisionDescriptionService.java` — accepts a `Path` to a PNG and a caption String; calls the OpenAI vision model (via Spring AI `ChatClient` with image media type) to generate a 2–4 sentence clinical description; returns the generated description string; handles API failures by returning the caption as fallback
 - [X] T016 [US2] Create `TextChunkingService` in `backend/src/main/java/com/aiteacher/document/TextChunkingService.java` — accepts a `SectionEntity`; splits `fullText` into overlapping 400–600 token windows (20-token overlap); wraps each window in a Spring AI `Document` with the flat metadata map defined in data-model.md §Text chunk document; returns `List<Document>`
 - [X] T017 [US2] Create `ChunkFigureRefService` in `backend/src/main/java/com/aiteacher/document/ChunkFigureRefService.java` — accepts a Spring AI `Document` (with its `id` as `chunkId`) and a `List<FigureEntity>` for the book; scans chunk text for patterns `Fig\.\s*\d+[\-\.]\d+` and `Figure\s+\d+[\-\.]\d+`; matches against figure labels; persists `ChunkFigureRefEntity` rows via `ChunkFigureRefRepository`
- [X] T018 [US2] Rewrite `BookEmbeddingService.embedBook()` in `backend/src/main/java/com/aiteacher/book/BookEmbeddingService.java` to orchestrate the full pipeline: (1) `PdfStructureParser` → sections; (2) parallel: `FigureExtractionService` + `TextChunkingService` for each section; (3) `VisionDescriptionService` for each figure; (4) embed figure captions+descriptions as `Document`s (metadata per data-model.md §Figure caption document) into `vectorStore`; (5) embed text chunks into `vectorStore`; (6) `ChunkFigureRefService` for each chunk; update `captionEmbeddingId` on `FigureEntity` after embedding
+- [X] T018 [US2] Update `BookEmbeddingService.embedBook()` — **Marker migration**: injected `MarkerPageParser` replacing `DocumentAiPageParser`; updated `figureExtractionService.extract()` call (removed `pdfPath` arg); updated log message. Pipeline: (1) `MarkerPageParser` → `List<PageResult>`; (2) `buildAndSaveSections()` → sections; (3) `TextChunkingService` → chunks → embed; (4) `FigureExtractionService.extract()` → figures; (5) `VisionDescriptionService` → embed figure chunks; (6) `ChunkFigureRefService` → refs
 - [X] T019 [US2] Extend `BookEmbeddingService.deleteBookChunks()` to also delete: all `ChunkFigureRefEntity` rows (via `findByFigureIdIn`), all `FigureEntity` rows (via `deleteAllByBookId`), all figure PNG files (via `FigureStorageService.delete(bookId)`), all `SectionEntity` and `ChapterEntity` rows for the book
 - [X] T020 [US2] Add `POST /api/v1/books/{id}/reembed` endpoint to `BookController` in `backend/src/main/java/com/aiteacher/book/BookController.java` — returns `202` with `{ bookId, status: "PROCESSING" }`; returns `404` if not found; returns `409` if already `PROCESSING`; calls `deleteBookChunks()` then `embedBook()` asynchronously