adding Marker to parse effectively pdf
This commit is contained in:
@@ -0,0 +1,79 @@
|
||||
# Internal Contract: DocumentAiPageParser → FigureExtractionService
|
||||
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04
|
||||
**Type**: Internal Java DTO (not an HTTP contract)
|
||||
|
||||
---
|
||||
|
||||
## Purpose
|
||||
|
||||
`PageResult` is the internal data transfer object produced by `DocumentAiPageParser` for each
|
||||
PDF page. It decouples the Google Document AI SDK types from the rest of the pipeline so that
|
||||
`PdfStructureParser` can be replaced without cascading changes.
|
||||
|
||||
---
|
||||
|
||||
## Java Record
|
||||
|
||||
```java
|
||||
package com.aiteacher.document;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
/**
|
||||
* Internal DTO produced by DocumentAiPageParser for one PDF page.
|
||||
* Decouples the Document AI SDK types from downstream services.
|
||||
*/
|
||||
public record PageResult(
|
||||
int pageNumber, // 1-based, matches Document.Page.getPageNumber()
|
||||
String orderedText, // full page text in correct reading order (blocks joined by \n\n)
|
||||
String headingTitle, // first HEADING block on page, or null
|
||||
List<FigureBbox> figures // detected figure regions (may be empty)
|
||||
) {
|
||||
|
||||
/**
|
||||
* Normalized bounding box for a detected figure region.
|
||||
* Coordinates are in the [0.0, 1.0] range relative to page dimensions.
|
||||
*/
|
||||
public record FigureBbox(
|
||||
float x, // left edge (normalized)
|
||||
float y, // top edge (normalized)
|
||||
float width, // width (normalized)
|
||||
float height, // height (normalized)
|
||||
String nearestCaption // text of adjacent paragraph block, or null
|
||||
) {}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Production Rules
|
||||
|
||||
| Field | Rule |
|
||||
|-------|------|
|
||||
| `orderedText` | Concatenation of all `PARAGRAPH` and `HEADING_*` blocks, joined with `\n\n`. Tables are represented as tab-separated text. |
|
||||
| `headingTitle` | First block whose `blockType` is `HEADING_1` through `HEADING_6`. `null` if no heading detected. |
|
||||
| `figures` | One entry per `VisualElement` with `type == "figure"` and `confidence ≥ 0.5`. Sorted top-to-bottom by `y`. |
|
||||
| `nearestCaption` | The `PARAGRAPH` block immediately following the figure bbox (by Y coordinate). May be `null` if no paragraph follows within 10% of page height. |
|
||||
|
||||
---
|
||||
|
||||
## Mapping from Document AI Proto
|
||||
|
||||
```
|
||||
Document.Page.Block → orderedText (concatenated)
|
||||
Document.Page.Block (HEADING_*) → headingTitle (first match)
|
||||
Document.Page.VisualElement → FigureBbox
|
||||
└─ layout.bounding_poly.normalized_vertices[0] → (x, y) top-left
|
||||
└─ normalized_vertices[2] → (x+w, y+h) bottom-right
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Consumers
|
||||
|
||||
| Consumer | What It Uses |
|
||||
|----------|-------------|
|
||||
| `BookEmbeddingService` | `orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title` |
|
||||
| `FigureExtractionService` | `figures` list → renders page via PDFBox, crops each bbox to `BufferedImage` |
|
||||
| `TextChunkingService` | Receives `SectionEntity` (indirectly uses `orderedText`) — **unchanged** |
|
||||
@@ -0,0 +1,84 @@
|
||||
# Internal Contract: MarkerPageParser → FigureExtractionService / BookEmbeddingService
|
||||
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04
|
||||
**Type**: Internal Java DTO (not an HTTP contract)
|
||||
|
||||
---
|
||||
|
||||
## Purpose
|
||||
|
||||
`PageResult` is the internal data transfer object produced by `MarkerPageParser` for each
|
||||
PDF page. It decouples the Marker HTTP API from the rest of the pipeline. Downstream consumers
|
||||
(`BookEmbeddingService`, `FigureExtractionService`, `TextChunkingService`) are unaware of
|
||||
Marker and depend only on this DTO.
|
||||
|
||||
---
|
||||
|
||||
## Java Record
|
||||
|
||||
```java
|
||||
package com.aiteacher.document;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
/**
|
||||
* Internal DTO produced by MarkerPageParser for one PDF page.
|
||||
* Decouples the Marker HTTP API from downstream services.
|
||||
*/
|
||||
public record PageResult(
|
||||
int pageNumber, // 1-based, derived from Marker page block index
|
||||
String orderedText, // full page text in correct reading order (blocks joined by \n\n)
|
||||
String headingTitle, // first SectionHeader block on page, or null
|
||||
List<FigureData> figures // extracted figure images (may be empty)
|
||||
) {
|
||||
|
||||
/**
|
||||
* A figure extracted from the page.
|
||||
* Image bytes are PNG data decoded from the Marker JSON `images` map.
|
||||
*/
|
||||
public record FigureData(
|
||||
byte[] imageBytes, // PNG image data (base64-decoded from Marker response)
|
||||
String nearestCaption, // text of the adjacent Caption block, or null
|
||||
String blockId // Marker block ID (e.g. "/page/0/Figure/2") for traceability
|
||||
) {}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Production Rules
|
||||
|
||||
| Field | Rule |
|
||||
|-------|------|
|
||||
| `pageNumber` | 1-based index derived from the Marker page block's position in the `children` array (index + 1). |
|
||||
| `orderedText` | HTML-stripped text from all `Text`, `TextInlineMath`, `SectionHeader`, `ListItem`, and `Table` blocks, joined with `\n\n`. Marker already returns them in reading order. |
|
||||
| `headingTitle` | Plain text of the first `SectionHeader` block on the page. `null` if no heading detected. |
|
||||
| `figures` | One `FigureData` per `Figure` or `Picture` block that has a non-empty `images` entry. Blocks with no image data are skipped. |
|
||||
| `imageBytes` | Base64-decoded bytes from `block.images[blockId]`. Marker returns PNG. |
|
||||
| `nearestCaption` | Plain text of the first `Caption` block that is a sibling appearing immediately after the figure block. `null` if absent. |
|
||||
|
||||
---
|
||||
|
||||
## Mapping from Marker JSON
|
||||
|
||||
```
|
||||
Marker JSON → PageResult
|
||||
|
||||
Page block ("/page/N/Page/M") → PageResult(pageNumber = N + 1)
|
||||
SectionHeader child → headingTitle (first match, HTML-stripped)
|
||||
Text / TextInlineMath children → orderedText (HTML-stripped, joined \n\n)
|
||||
Figure / Picture child → FigureData
|
||||
images[blockId] → FigureData.imageBytes (base64-decoded)
|
||||
next Caption sibling → FigureData.nearestCaption (HTML-stripped)
|
||||
blockId → FigureData.blockId
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Consumers
|
||||
|
||||
| Consumer | What It Uses |
|
||||
|----------|-------------|
|
||||
| `BookEmbeddingService` | `orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title` |
|
||||
| `FigureExtractionService` | `figures` list → decodes `imageBytes`, checks min size, saves to S3 |
|
||||
| `TextChunkingService` | Receives `SectionEntity` (uses `orderedText` indirectly) — **unchanged** |
|
||||
@@ -1,40 +1,42 @@
|
||||
# Implementation Plan: Enhanced Embedding with Image Parsing and Metadata
|
||||
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03 | **Spec**: [spec.md](spec.md)
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 | **Spec**: [spec.md](spec.md)
|
||||
**Input**: Feature specification from `/specs/002-image-aware-embedding/spec.md`
|
||||
|
||||
## Summary
|
||||
|
||||
Enhance the book embedding pipeline to extract images from every PDF page, generate descriptive
|
||||
text for each image, and store all content (text chunks + figure captions) with rich, consistent
|
||||
metadata in the vector store. A new document hierarchy (Book → Chapter → Section → TextChunk +
|
||||
Figure) is introduced. Postgres holds the full-text sections and figure metadata; the vector
|
||||
store holds chunk and figure caption embeddings; the local file store holds extracted image files.
|
||||
At query time, both the text-chunk store and figure-caption store are searched in parallel and
|
||||
results are merged before being sent to the LLM.
|
||||
Enhance the PDF embedding pipeline to extract figures and generate AI descriptions for them,
|
||||
making image content semantically searchable alongside text. PDF parsing and figure extraction
|
||||
are delegated to a local **Marker** server (`http://localhost:8000/marker/upload`), which
|
||||
returns reading-order text and pre-cropped figure images (base64) in a single JSON response,
|
||||
eliminating the need for PDFBox column heuristics and figure bbox rendering.
|
||||
|
||||
## Technical Context
|
||||
|
||||
**Language/Version**: Java 25 (backend), TypeScript / Node 20 (frontend)
|
||||
**Primary Dependencies**: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings + chat), PDFBox (via Spring AI PDF reader dependency)
|
||||
**Storage**: PostgreSQL (JPA + Flyway), pgvector (Spring AI `VectorStore`), local file system (extracted images — `/uploads/figures/`)
|
||||
**Testing**: Spring Boot Test, JUnit 5, Mockito
|
||||
**Target Platform**: Linux server (Docker Compose)
|
||||
**Project Type**: Web application — backend REST API + Vue 3 frontend
|
||||
**Performance Goals**: Full book (up to 500 pages with images) processed in ≤ 30 minutes; query response unchanged from existing baseline
|
||||
**Constraints**: No new deployable units; all changes within the existing `backend/` module; image storage on local disk (S3 migration is a future concern, behind an interface)
|
||||
**Scale/Scope**: POC — <10 concurrent users; single shared book library
|
||||
**Primary Dependencies**: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings +
|
||||
GPT-4o vision), PDFBox 3.0.3 (via `spring-ai-pdf-document-reader` — retained transitively,
|
||||
no longer used directly), Marker local HTTP API (`http://localhost:8000/marker/upload`)
|
||||
**Storage**: PostgreSQL (JPA + Flyway), pgvector (Spring AI `VectorStore`), S3-compatible
|
||||
object store (figure images via `FigureStorageService`)
|
||||
**Testing**: Maven / JUnit 5 (`spring-boot-starter-test`)
|
||||
**Target Platform**: Linux server
|
||||
**Project Type**: Web application (backend API + frontend client)
|
||||
**Performance Goals**: SC-003 — book processing time ≤ 3× text-only for ≤ 500 pages
|
||||
**Constraints**: REST API only (Constitution III); Marker server must be running locally;
|
||||
S3-compatible storage configured via env vars
|
||||
**Scale/Scope**: POC — handful of books, <10 users
|
||||
|
||||
## Constitution Check
|
||||
|
||||
*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
|
||||
*GATE: Must pass before Phase 0 research. Re-checked after Phase 1 design.*
|
||||
|
||||
| Principle | Status | Notes |
|
||||
|-----------|--------|-------|
|
||||
| I — KISS | ⚠️ Justified violation — see Complexity Tracking | Hierarchical model + dual search adds complexity; justified by precision requirement |
|
||||
| II — Easy to Change | ✅ | Figure storage wrapped behind `FigureStorageService` interface; can swap local disk for S3 |
|
||||
| III — Web-First | ✅ | All new capabilities exposed via existing REST API; no new deployable units |
|
||||
| IV — Docs as Architecture | ⚠️ Required | README Mermaid diagram MUST be updated in this PR to show new storage tiers |
|
||||
| **I. KISS** | ✅ Justified | Marker replaces a bespoke PDFBox column heuristic + Google Cloud SDK with one HTTP call. Net complexity reduction vs. the Document AI approach. |
|
||||
| **II. Easy to Change** | ✅ | `MarkerPageParser` is the only class that knows about Marker; swap the implementation to replace Marker with any other parser. `PageResult` DTO remains unchanged. |
|
||||
| **III. Web-First** | ✅ | Internal pipeline change; no public API contract change. |
|
||||
| **IV. Documentation** | ✅ | README must be updated to show Marker as a local external service. |
|
||||
|
||||
## Project Structure
|
||||
|
||||
@@ -46,60 +48,38 @@ specs/002-image-aware-embedding/
|
||||
├── research.md # Phase 0 output
|
||||
├── data-model.md # Phase 1 output
|
||||
├── quickstart.md # Phase 1 output
|
||||
├── contracts/ # Phase 1 output
|
||||
└── tasks.md # Phase 2 output (/speckit.tasks)
|
||||
├── contracts/
|
||||
│ ├── api.md # HTTP API contracts (unchanged from initial plan)
|
||||
│ └── marker-page-result.md # Internal DTO contract (MarkerPageParser → downstream)
|
||||
└── tasks.md # Phase 2 output (/speckit.tasks — not created here)
|
||||
```
|
||||
|
||||
### Source Code (repository root)
|
||||
### Source Code
|
||||
|
||||
```text
|
||||
backend/
|
||||
├── src/main/java/com/aiteacher/
|
||||
│ ├── config/
|
||||
│ │ └── MarkerConfig.java # NEW: RestClient bean + base-url property
|
||||
│ ├── document/
|
||||
│ │ ├── MarkerPageParser.java # NEW: replaces DocumentAiPageParser + PdfStructureParser
|
||||
│ │ ├── PageResult.java # UPDATED: FigureBbox → FigureData (bytes not bbox)
|
||||
│ │ ├── FigureExtractionService.java # UPDATED: no PDFBox render; decode bytes directly
|
||||
│ │ ├── TextChunkingService.java # UNCHANGED
|
||||
│ │ ├── VisionDescriptionService.java # UNCHANGED
|
||||
│ │ └── [removed] DocumentAiPageParser.java
|
||||
│ ├── book/
|
||||
│ │ ├── Book.java (existing)
|
||||
│ │ ├── BookController.java (existing)
|
||||
│ │ ├── BookService.java (existing)
|
||||
│ │ ├── BookRepository.java (existing)
|
||||
│ │ ├── BookStatus.java (existing)
|
||||
│ │ ├── BookEmbeddingService.java (existing — enhanced)
|
||||
│ │ └── NoKnowledgeSourceException.java (existing)
|
||||
│ ├── document/ (new package)
|
||||
│ │ ├── BookNode.java
|
||||
│ │ ├── ChapterNode.java
|
||||
│ │ ├── SectionNode.java
|
||||
│ │ ├── SectionRepository.java
|
||||
│ │ ├── TextChunkNode.java
|
||||
│ │ ├── FigureNode.java
|
||||
│ │ ├── FigureRepository.java
|
||||
│ │ ├── FigureType.java
|
||||
│ │ ├── ChunkFigureRef.java
|
||||
│ │ └── ChunkFigureRefRepository.java
|
||||
│ ├── figure/ (new package)
|
||||
│ │ ├── FigureStorageService.java (interface)
|
||||
│ │ └── LocalFigureStorageService.java (implementation)
|
||||
│ ├── retrieval/ (new package)
|
||||
│ │ └── NeurosurgeryRetriever.java
|
||||
│ ├── chat/
|
||||
│ │ └── ChatService.java (updated — uses NeurosurgeryRetriever)
|
||||
│ └── config/
|
||||
│ └── FigureStorageConfig.java (new — configures upload dir)
|
||||
└── src/main/resources/
|
||||
└── db/migration/
|
||||
├── V4__document_hierarchy.sql (new)
|
||||
└── V5__figures_and_refs.sql (new)
|
||||
|
||||
uploads/
|
||||
└── figures/ (runtime — extracted images; gitignored)
|
||||
│ │ └── BookEmbeddingService.java # MINOR UPDATE: inject MarkerPageParser, drop DocumentAiPageParser
|
||||
│ └── [removed] config/DocumentAiConfig.java
|
||||
├── src/main/resources/
|
||||
│ └── application.yaml # UPDATED: remove document-ai.*, add marker.base-url
|
||||
└── pom.xml # UPDATED: remove google-cloud-document-ai
|
||||
```
|
||||
|
||||
**Structure Decision**: Option 2 (Web Application) confirmed. All backend changes stay within
|
||||
`backend/`. Two new packages (`document/`, `retrieval/`) plus one interface package (`figure/`)
|
||||
keep concerns separated without adding a deployable unit.
|
||||
**Structure Decision**: Option 2 (backend + frontend) per constitution Technology Constraints.
|
||||
Frontend changes are display-only (render figure citations inline).
|
||||
|
||||
## Complexity Tracking
|
||||
|
||||
| Violation | Why Needed | Simpler Alternative Rejected Because |
|
||||
|-----------|------------|-------------------------------------|
|
||||
| Document hierarchy (BookNode → ChapterNode → SectionNode) | Parent-child retrieval: chunks reference their parent section so the LLM receives full section context, not just the matching fragment. This is the established solution for RAG precision. | Flat page-per-doc model (current) loses inter-sentence context; chunk-only retrieval produces incomplete answers for multi-paragraph clinical questions |
|
||||
| Dual vector search (text chunks + figure captions) | Figure captions must be independently searchable — a query about "cavernous sinus anatomy" must surface the diagram even if no text chunk scores highly | Single vector store search would miss figures whose captions don't happen to be the highest-similarity hit; this is the core deliverable of the feature |
|
||||
| Third storage tier (local file store for images) | Extracted images cannot live in Postgres (binary blobs degrade query performance) or the vector store (only vectors). A file-per-image approach is standard. | Storing images as base64 in Postgres JSONB would bloat the DB and complicate backup/restore; the `FigureStorageService` interface keeps the implementation swappable |
|
||||
> No constitution violations — Marker reduces complexity compared to the previous
|
||||
> Google Document AI approach (fewer dependencies, no GCP credentials, no 15-page batching).
|
||||
|
||||
@@ -1,34 +1,67 @@
|
||||
# Quickstart: Enhanced Embedding with Image Parsing and Metadata
|
||||
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Docker Compose running (PostgreSQL + pgvector)
|
||||
- OpenAI API key set in `backend/src/main/resources/application.properties` or as env var `OPENAI_API_KEY`
|
||||
- OpenAI API key set as env var `OPENAI_API_KEY`
|
||||
- Java 25 + Maven on PATH
|
||||
- **Marker server running** on `http://localhost:8000` (see setup below)
|
||||
- S3-compatible bucket configured (existing setup)
|
||||
|
||||
---
|
||||
|
||||
## New Configuration
|
||||
## Marker Server Setup (one-time)
|
||||
|
||||
Add to `backend/src/main/resources/application.properties`:
|
||||
Marker is a local Python service — no cloud credentials required.
|
||||
|
||||
```properties
|
||||
# Figure storage
|
||||
app.figure-storage.base-path=./uploads
|
||||
app.figure-storage.min-image-size-px=100
|
||||
```bash
|
||||
# Install (Python 3.10+ required)
|
||||
pip install marker-pdf
|
||||
|
||||
# Start the server on port 8000
|
||||
marker_server --port 8000
|
||||
```
|
||||
|
||||
The `uploads/figures/` directory is created automatically on first use. Add it to `.gitignore`.
|
||||
The server is ready when you see:
|
||||
```
|
||||
INFO: Uvicorn running on http://0.0.0.0:8000
|
||||
```
|
||||
|
||||
Keep the server running in the background (or use a process manager like `systemd` or `screen`).
|
||||
|
||||
---
|
||||
|
||||
## Backend Configuration
|
||||
|
||||
Add or update `backend/src/main/resources/application.yaml`:
|
||||
|
||||
```yaml
|
||||
app:
|
||||
figure-storage:
|
||||
endpoint: https://your-s3-endpoint
|
||||
region: your-region
|
||||
bucket: ${S3_BUCKET:aiteacher}
|
||||
access-key-id: ${S3_ACCESS_KEY_ID}
|
||||
secret-access-key: ${S3_SECRET_ACCESS_KEY}
|
||||
min-image-size-px: 100 # skip decorative images smaller than 100×100 px
|
||||
marker:
|
||||
base-url: ${MARKER_BASE_URL:http://localhost:8000}
|
||||
embedding:
|
||||
batch-size: 20
|
||||
batch-delay-ms: 2000
|
||||
```
|
||||
|
||||
No GCP credentials or project IDs are needed.
|
||||
|
||||
---
|
||||
|
||||
## Database Migration
|
||||
|
||||
Two new Flyway migrations run automatically on startup:
|
||||
Two Flyway migrations run automatically on startup:
|
||||
|
||||
- `V4__document_hierarchy.sql` — adds `chapter` and `section` tables
|
||||
- `V5__figures_and_refs.sql` — adds `figure` and `chunk_figure_ref` tables
|
||||
@@ -54,10 +87,11 @@ image-aware pipeline runs. Status can be polled via `GET /api/v1/books`.
|
||||
|
||||
## Verifying Image Extraction
|
||||
|
||||
1. Upload a PDF with diagrams: `POST /api/v1/books/upload`
|
||||
2. Wait for `status: "READY"` via `GET /api/v1/books`
|
||||
3. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page
|
||||
4. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry
|
||||
1. Ensure Marker is running: `curl http://localhost:8000` should respond.
|
||||
2. Upload a PDF with diagrams: `POST /api/v1/books/upload`
|
||||
3. Wait for `status: "READY"` via `GET /api/v1/books`
|
||||
4. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page
|
||||
5. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry
|
||||
|
||||
---
|
||||
|
||||
@@ -80,7 +114,8 @@ mvn test
|
||||
```
|
||||
|
||||
Key new test classes:
|
||||
- `FigureExtractionServiceTest` — unit tests for image extraction and classification
|
||||
- `MarkerPageParserTest` — unit tests for JSON parsing and block-to-PageResult mapping
|
||||
- `FigureExtractionServiceTest` — unit tests for base64 decode, size filtering, classification
|
||||
- `NeurosurgeryRetrieverTest` — unit tests for dual-search merge and deduplication
|
||||
- `BookEmbeddingServiceIntegrationTest` — integration test: upload PDF with known figures,
|
||||
verify figures appear in `GET /api/v1/books/{id}/figures`
|
||||
|
||||
@@ -1,10 +1,10 @@
|
||||
# Research: Enhanced Embedding with Image Parsing and Metadata
|
||||
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)
|
||||
|
||||
This document resolves all technical unknowns identified during planning. The primary source for
|
||||
decisions is the detailed architecture provided directly by the project owner, supplemented by
|
||||
Spring AI 2.0.0-M4 API specifics.
|
||||
This document resolves all technical unknowns identified during planning. Decisions 1–10 cover
|
||||
the core pipeline. The **Marker Study** section at the bottom explains why Marker was chosen
|
||||
over Google Document AI to drive PDF parsing and figure extraction.
|
||||
|
||||
---
|
||||
|
||||
@@ -28,19 +28,29 @@ association explicit and queryable.
|
||||
|
||||
---
|
||||
|
||||
## Decision 2: Image Extraction Strategy
|
||||
## Decision 2: Document Parsing Strategy
|
||||
|
||||
**Decision**: Use PDFBox (already on classpath via `spring-ai-pdf-document-reader`) to extract
|
||||
images per page. Each image is tagged with `page`, `figure_id` (derived from caption, e.g.
|
||||
"Fig. 12-4"), and the parent `sectionId`. Images are saved to local disk under
|
||||
`/uploads/figures/{bookId}/`.
|
||||
**Decision**: Use **Marker** (local HTTP server, `http://localhost:8000/marker/upload`) as the
|
||||
single entry point for PDF parsing. A single `POST` with `output_format=json` returns:
|
||||
- Reading-order text blocks (headings, paragraphs) — no column-split heuristic needed
|
||||
- Pre-cropped figure images as base64-encoded PNG in the `images` map of each `Figure` block
|
||||
- Table, equation, and code blocks as structured HTML
|
||||
|
||||
**Rationale**: PDFBox is already present (Spring AI bundles it). No new dependency needed.
|
||||
Per-page extraction ensures every image is captured regardless of PDF structure.
|
||||
`MarkerPageParser` translates the Marker JSON response into `List<PageResult>`, which is the
|
||||
same internal DTO used by the rest of the pipeline.
|
||||
|
||||
**Rationale**: Marker handles column reordering, scanned-page OCR, and figure cropping in one
|
||||
call, eliminating the PDFBox column heuristic (`PdfStructureParser`) and the PDFBox
|
||||
render+crop loop in `FigureExtractionService`. Net result: fewer classes, no cloud dependency,
|
||||
no GCP credentials.
|
||||
|
||||
**Alternatives considered**:
|
||||
- iText / iText7 → additional commercial dependency; overkill for extraction
|
||||
- Screenshot each page as PNG, then OCR → far slower; loses vector quality
|
||||
- PDFBox column heuristic (previous approach) → rejected: 50/50 split fails on asymmetric
|
||||
columns and scanned pages
|
||||
- Google Document AI Layout Parser → rejected: adds GCP credentials, per-page billing, 15-page
|
||||
batch limit, and still requires PDFBox to render+crop figure regions from bounding boxes.
|
||||
See Marker Study below for detailed comparison.
|
||||
- Screenshot each page + OCR → far slower; loses digital text quality
|
||||
|
||||
---
|
||||
|
||||
@@ -103,18 +113,19 @@ search. This is the higher-recall path; dual search (Decision 4) is the higher-p
|
||||
|
||||
## Decision 6: Image Storage
|
||||
|
||||
**Decision**: Extracted images are saved as PNG files to a local directory
|
||||
(`${app.figure-storage.base-path}`, defaults to `./uploads/figures/{bookId}/`). The path is
|
||||
stored in `figure.image_path` in Postgres. A `FigureStorageService` interface wraps all disk
|
||||
I/O so the implementation can be swapped to S3 or another object store without changing
|
||||
callers.
|
||||
**Decision**: Marker returns figure images as base64-encoded PNG bytes in the JSON response.
|
||||
`FigureExtractionService` decodes these bytes and passes them to `FigureStorageService`, which
|
||||
persists them to an S3-compatible bucket (`${app.figure-storage.bucket}`). The image path/URL
|
||||
is stored in `figure.image_path` in Postgres.
|
||||
|
||||
**Rationale**: Local disk is the simplest viable option for a POC with <10 users. The interface
|
||||
boundary satisfies Constitution Principle II (Easy to Change).
|
||||
The `FigureStorageService` interface is unchanged; only the caller changes (from PDFBox crop
|
||||
to base64 decode).
|
||||
|
||||
**Rationale**: Marker's pre-cropped images remove the need for PDFBox rendering.
|
||||
`FigureStorageService` interface boundary satisfies Constitution Principle II (Easy to Change).
|
||||
|
||||
**Alternatives considered**:
|
||||
- S3 from day 1 → operational overhead not justified at POC scale
|
||||
- Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
|
||||
- Store base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
|
||||
|
||||
---
|
||||
|
||||
@@ -123,7 +134,8 @@ boundary satisfies Constitution Principle II (Easy to Change).
|
||||
**Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
|
||||
TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
|
||||
1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
|
||||
2. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
|
||||
2. Marker `block_type` hint (`"Table"` → TABLE, `"Figure"` / `"Picture"` → ANATOMICAL_DIAGRAM default)
|
||||
3. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
|
||||
|
||||
**Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
|
||||
Heuristic classification avoids a separate model call per image at extraction time.
|
||||
@@ -175,14 +187,225 @@ the process fails mid-way. An explicit, idempotent trigger is safer and more obs
|
||||
|
||||
## Decision 10: Minimum Image Size Threshold
|
||||
|
||||
**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. This
|
||||
threshold filters out decorative elements (bullets, dividers, publisher logos) without a
|
||||
classification model.
|
||||
**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. Marker
|
||||
returns PNG bytes; `FigureExtractionService` decodes to `BufferedImage` solely to check
|
||||
dimensions. This threshold filters out decorative elements without a classification model.
|
||||
|
||||
**Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
|
||||
The threshold is configurable via `app.figure-storage.min-image-size-px` in
|
||||
`application.properties`.
|
||||
The threshold is configurable via `app.figure-storage.min-image-size-px`.
|
||||
|
||||
**Alternatives considered**:
|
||||
- No threshold → decorative icons pollute the figure index
|
||||
- ML-based classification → accurate but adds model dependency; not needed at POC scale
|
||||
|
||||
---
|
||||
|
||||
# Marker Study — Why Marker Replaces Google Document AI
|
||||
|
||||
*Added 2026-04-04.*
|
||||
|
||||
## What Marker Offers
|
||||
|
||||
Marker is an open-source, locally-runnable PDF-to-structured-content converter that uses a
|
||||
pipeline of deep-learning models (surya for OCR + layout detection, texify for equations).
|
||||
Key capabilities relevant to this project:
|
||||
|
||||
| Capability | Marker | Google Document AI |
|
||||
|-----------|--------|--------------------|
|
||||
| Multi-column reading order | ✅ | ✅ |
|
||||
| OCR on scanned pages | ✅ | ✅ |
|
||||
| Figure detection | ✅ returns pre-cropped images | ⚠️ returns bbox only; PDFBox still needed |
|
||||
| Table extraction | ✅ HTML tables | ✅ |
|
||||
| JSON output with image bytes | ✅ base64 in `images` map | ❌ |
|
||||
| No cloud credentials | ✅ | ❌ GCP service account required |
|
||||
| No per-page billing | ✅ | ❌ ~$10/1,000 pages |
|
||||
| Batch size limits | None (local) | 15 pages / 20 MB per sync call |
|
||||
| Setup | `pip install marker-pdf && marker_server` | GCP project + processor + IAM |
|
||||
|
||||
---
|
||||
|
||||
## Does Marker Solve the Current Pain Points?
|
||||
|
||||
### Pain Point 1: Naive 50/50 Column Split
|
||||
|
||||
**Answer: Yes, Marker fixes this completely.**
|
||||
|
||||
`PdfStructureParser.extractPageText()` splits pages at the horizontal midpoint with a 20%
|
||||
threshold. This fails on asymmetric columns and scanned pages. Marker's surya layout model
|
||||
returns blocks in natural reading order — no heuristic needed.
|
||||
|
||||
### Pain Point 2: Figure Detection Misses Rasterized Figures
|
||||
|
||||
**Answer: Yes, Marker fixes this for most cases.**
|
||||
|
||||
`FigureExtractionService` previously iterated PDF XObjects (only finds embedded XObject images,
|
||||
misses rasterized figures and vector-path drawings). Marker's layout model detects visual
|
||||
elements by type and returns the cropped image bytes directly — no PDFBox page rendering needed.
|
||||
|
||||
### Pain Point 3: OCR on Scanned Pages
|
||||
|
||||
**Answer: Yes, Marker handles scanned pages transparently via surya OCR.**
|
||||
|
||||
### Pain Point 4: Caption Detection
|
||||
|
||||
**Answer: Improved — Marker groups caption blocks with their figure block.**
|
||||
|
||||
The `block_type = "Caption"` block appears as a sibling or child adjacent to the `"Figure"`
|
||||
block in the Marker JSON, making caption association structural rather than regex-based.
|
||||
|
||||
---
|
||||
|
||||
## Marker API Integration
|
||||
|
||||
### Local Server Setup
|
||||
|
||||
```bash
|
||||
pip install marker-pdf
|
||||
marker_server --port 8000
|
||||
```
|
||||
|
||||
The server exposes `POST /marker/upload` (the user's configured endpoint).
|
||||
|
||||
### Request
|
||||
|
||||
```
|
||||
POST http://localhost:8000/marker/upload
|
||||
Content-Type: multipart/form-data
|
||||
|
||||
file=@document.pdf
|
||||
output_format=json
|
||||
```
|
||||
|
||||
### Response (abbreviated)
|
||||
|
||||
```json
|
||||
{
|
||||
"output_format": "json",
|
||||
"output": {
|
||||
"block_type": "Document",
|
||||
"children": [
|
||||
{
|
||||
"block_type": "Page",
|
||||
"id": "/page/0/Page/0",
|
||||
"children": [
|
||||
{
|
||||
"block_type": "SectionHeader",
|
||||
"id": "/page/0/SectionHeader/0",
|
||||
"html": "<h1>Cavernous Sinus Anatomy</h1>"
|
||||
},
|
||||
{
|
||||
"block_type": "Text",
|
||||
"id": "/page/0/Text/1",
|
||||
"html": "<p>The cavernous sinus contains...</p>"
|
||||
},
|
||||
{
|
||||
"block_type": "Figure",
|
||||
"id": "/page/0/Figure/2",
|
||||
"html": "<figure><img src='/page/0/Figure/2'/></figure>",
|
||||
"images": {
|
||||
"/page/0/Figure/2": "iVBORw0KGgo..."
|
||||
}
|
||||
},
|
||||
{
|
||||
"block_type": "Caption",
|
||||
"id": "/page/0/Caption/3",
|
||||
"html": "<p>Fig. 12-4. Coronal cross-section...</p>"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": { "page_stats": [...] }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Java Integration Pattern
|
||||
|
||||
```java
|
||||
// MarkerPageParser — core call
|
||||
MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
|
||||
body.add("file", new FileSystemResource(pdfPath));
|
||||
body.add("output_format", "json");
|
||||
|
||||
JsonNode response = restClient.post()
|
||||
.uri(baseUrl + "/marker/upload")
|
||||
.contentType(MediaType.MULTIPART_FORM_DATA)
|
||||
.body(body)
|
||||
.retrieve()
|
||||
.body(JsonNode.class);
|
||||
|
||||
JsonNode document = response.get("output");
|
||||
```
|
||||
|
||||
### Mapping Marker Blocks to PageResult
|
||||
|
||||
```
|
||||
Page block (id "/page/N/Page/M") → PageResult(pageNumber = N+1)
|
||||
SectionHeader children → headingTitle (first match)
|
||||
Text, TextInlineMath children → orderedText (HTML stripped, joined \n\n)
|
||||
Figure children with images map → FigureData(imageBytes = base64decode(images[id]))
|
||||
Caption sibling of Figure → FigureData.nearestCaption
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Architecture Change
|
||||
|
||||
```
|
||||
Before (Document AI — removed):
|
||||
DocumentAiPageParser
|
||||
→ Google Document AI API (GCP, 15-page batches, credentials)
|
||||
→ returns text blocks + figure bboxes
|
||||
PdfStructureParser (PDFBox column heuristic)
|
||||
FigureExtractionService
|
||||
→ renders page via PDFBox at 150 DPI
|
||||
→ crops bbox region
|
||||
|
||||
After (Marker):
|
||||
MarkerPageParser
|
||||
→ POST PDF to http://localhost:8000/marker/upload (output_format=json)
|
||||
→ returns text blocks (correct reading order) + Figure blocks with base64 images
|
||||
→ produces List<PageResult> (same DTO, FigureData carries bytes not bbox)
|
||||
FigureExtractionService (simplified)
|
||||
→ base64-decodes image bytes from PageResult.FigureData
|
||||
→ checks min size (ImageIO.read → getWidth/getHeight)
|
||||
→ saves to S3 via FigureStorageService (UNCHANGED)
|
||||
VisionDescriptionService (UNCHANGED)
|
||||
BookEmbeddingService orchestration (MINOR: inject MarkerPageParser)
|
||||
```
|
||||
|
||||
**What is removed**:
|
||||
- `DocumentAiPageParser` — replaced by `MarkerPageParser`
|
||||
- `DocumentAiConfig` — replaced by `MarkerConfig`
|
||||
- `PdfStructureParser` — Marker handles reading order
|
||||
- `google-cloud-document-ai` Maven dependency
|
||||
- `app.document-ai.*` configuration properties
|
||||
|
||||
**What stays the same**:
|
||||
- `PageResult` DTO structure (fields renamed, not restructured)
|
||||
- `FigureExtractionService` public interface
|
||||
- `TextChunkingService`, `VisionDescriptionService`, `BookEmbeddingService` orchestration
|
||||
- All JPA entities, repositories, vector store, S3 storage
|
||||
|
||||
---
|
||||
|
||||
## Constitution Compliance
|
||||
|
||||
| Principle | Assessment |
|
||||
|-----------|------------|
|
||||
| **I. KISS** | ✅ Simpler than Document AI — one HTTP call replaces GCP SDK + PDFBox render loop. No new dependency beyond an HTTP client (Spring RestClient, already available). |
|
||||
| **II. Easy to Change** | ✅ `MarkerPageParser` is the only Marker-aware class. Swap it to use any other parser. `PageResult` DTO unchanged in contract. |
|
||||
| **III. Web-First** | ✅ Internal pipeline change; no API contract change. |
|
||||
| **IV. Documentation** | ✅ README must show Marker as a local external service dependency. |
|
||||
|
||||
---
|
||||
|
||||
## Risks & Mitigations
|
||||
|
||||
| Risk | Likelihood | Mitigation |
|
||||
|------|-----------|------------|
|
||||
| Marker server not running when book is uploaded | Medium | `BookEmbeddingService` catches exception from `MarkerPageParser`, marks book as `FAILED`, logs full error. |
|
||||
| Marker misses some figures (complex PDFs) | Medium | `app.figure-storage.min-image-size-px` threshold can be tuned. Add fallback: if Marker returns 0 figures for a page with known images, log a warning. |
|
||||
| SC-003 (≤ 3× processing time) violated | Low | Marker runs locally (no network latency to cloud). Benchmark with a real 500-page book early. |
|
||||
| Large PDF upload to Marker (>100MB) | Low | Marker server handles the full file; no batching needed. Multipart upload limit configurable. |
|
||||
| Marker image quality vs PDFBox crop | Low | Marker crops at native resolution; quality is equivalent or better than 150 DPI PDFBox render. |
|
||||
|
||||
@@ -48,12 +48,13 @@
|
||||
|
||||
**Independent Test**: Upload a PDF containing at least one page with a labelled anatomical diagram. After status shows `READY`, call `GET /api/v1/books/{id}/figures` — response must contain at least one entry with `figureType`, `caption`, `page`, and `imageUrl` populated. Verify the PNG file exists at the path in `imagePath`.
|
||||
|
||||
- [X] T013 [US2] Create `PdfStructureParser` service in `backend/src/main/java/com/aiteacher/document/PdfStructureParser.java` — uses Spring AI's `PagePdfDocumentReader` to extract per-page text; groups pages into `SectionEntity` records using heading-detection heuristics (lines matching `^\d+(\.\d+)*\s+[A-Z]`); groups sections into `ChapterEntity` records; persists both to Postgres via `ChapterRepository` and `SectionRepository`; returns `List<SectionEntity>` for the book
|
||||
- [X] T014 [US2] Create `FigureExtractionService` in `backend/src/main/java/com/aiteacher/document/FigureExtractionService.java` — opens PDF with PDFBox `PDDocument`; iterates pages; extracts `PDImageXObject` instances; skips images whose width or height are below `min-image-size-px`; classifies `FigureType` using the keyword-matching table from data-model.md §FigureType; parses caption from the nearest text line matching `CAPTION_PATTERN`; saves PNG via `FigureStorageService`; persists `FigureEntity` to `FigureRepository`; returns `List<FigureEntity>` per book
|
||||
- [X] T013 [US2] ~~Create `PdfStructureParser`~~ → **SUPERSEDED**: PDF parsing is handled by `MarkerPageParser` (see T013b). `PdfStructureParser` exists but is not wired into the pipeline.
|
||||
- [X] T013b [US2] Create `MarkerPageParser` in `backend/src/main/java/com/aiteacher/document/MarkerPageParser.java` — POSTs PDF to `http://localhost:8000/marker/upload?output_format=json` via Spring `RestClient`; parses JSON response into `List<PageResult>` (one per page block); extracts heading, ordered text, and pre-cropped figure PNG bytes per page
|
||||
- [X] T014 [US2] Update `FigureExtractionService` in `backend/src/main/java/com/aiteacher/document/FigureExtractionService.java` — **Marker migration**: removed PDFBox rendering + bbox-crop loop; decodes PNG bytes from `PageResult.FigureData` via `ImageIO.read()`; skips images below `min-image-size-px`; classifies `FigureType`; saves via `FigureStorageService`; persists `FigureEntity`
|
||||
- [X] T015 [US2] Create `VisionDescriptionService` in `backend/src/main/java/com/aiteacher/document/VisionDescriptionService.java` — accepts a `Path` to a PNG and a caption String; calls the OpenAI vision model (via Spring AI `ChatClient` with image media type) to generate a 2–4 sentence clinical description; returns the generated description string; handles API failures by returning the caption as fallback
|
||||
- [X] T016 [US2] Create `TextChunkingService` in `backend/src/main/java/com/aiteacher/document/TextChunkingService.java` — accepts a `SectionEntity`; splits `fullText` into overlapping 400–600 token windows (20-token overlap); wraps each window in a Spring AI `Document` with the flat metadata map defined in data-model.md §Text chunk document; returns `List<Document>`
|
||||
- [X] T017 [US2] Create `ChunkFigureRefService` in `backend/src/main/java/com/aiteacher/document/ChunkFigureRefService.java` — accepts a Spring AI `Document` (with its `id` as `chunkId`) and a `List<FigureEntity>` for the book; scans chunk text for patterns `Fig\.\s*\d+[\-\.]\d+` and `Figure\s+\d+[\-\.]\d+`; matches against figure labels; persists `ChunkFigureRefEntity` rows via `ChunkFigureRefRepository`
|
||||
- [X] T018 [US2] Rewrite `BookEmbeddingService.embedBook()` in `backend/src/main/java/com/aiteacher/book/BookEmbeddingService.java` to orchestrate the full pipeline: (1) `PdfStructureParser` → sections; (2) parallel: `FigureExtractionService` + `TextChunkingService` for each section; (3) `VisionDescriptionService` for each figure; (4) embed figure captions+descriptions as `Document`s (metadata per data-model.md §Figure caption document) into `vectorStore`; (5) embed text chunks into `vectorStore`; (6) `ChunkFigureRefService` for each chunk; update `captionEmbeddingId` on `FigureEntity` after embedding
|
||||
- [X] T018 [US2] Update `BookEmbeddingService.embedBook()` — **Marker migration**: injected `MarkerPageParser` replacing `DocumentAiPageParser`; updated `figureExtractionService.extract()` call (removed `pdfPath` arg); updated log message. Pipeline: (1) `MarkerPageParser` → `List<PageResult>`; (2) `buildAndSaveSections()` → sections; (3) `TextChunkingService` → chunks → embed; (4) `FigureExtractionService.extract()` → figures; (5) `VisionDescriptionService` → embed figure chunks; (6) `ChunkFigureRefService` → refs
|
||||
- [X] T019 [US2] Extend `BookEmbeddingService.deleteBookChunks()` to also delete: all `ChunkFigureRefEntity` rows (via `findByFigureIdIn`), all `FigureEntity` rows (via `deleteAllByBookId`), all figure PNG files (via `FigureStorageService.delete(bookId)`), all `SectionEntity` and `ChapterEntity` rows for the book
|
||||
- [X] T020 [US2] Add `POST /api/v1/books/{id}/reembed` endpoint to `BookController` in `backend/src/main/java/com/aiteacher/book/BookController.java` — returns `202` with `{ bookId, status: "PROCESSING" }`; returns `404` if not found; returns `409` if already `PROCESSING`; calls `deleteBookChunks()` then `embedBook()` asynchronously
|
||||
|
||||
|
||||
Reference in New Issue
Block a user