adding Marker to parse effectively pdf

2026-04-04 21:30:18 +02:00
parent b154e29f2d
commit ea1276dc2e
25 changed files with 2318 additions and 285 deletions
@@ -1,34 +1,67 @@
 # Quickstart: Enhanced Embedding with Image Parsing and Metadata

-**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)

 ---

 ## Prerequisites

 - Docker Compose running (PostgreSQL + pgvector)
- OpenAI API key set in `backend/src/main/resources/application.properties` or as env var `OPENAI_API_KEY`
+- OpenAI API key set as env var `OPENAI_API_KEY`
 - Java 25 + Maven on PATH
+- **Marker server running** on `http://localhost:8000` (see setup below)
+- S3-compatible bucket configured (existing setup)

 ---

-## New Configuration
+## Marker Server Setup (one-time)

-Add to `backend/src/main/resources/application.properties`:
+Marker is a local Python service — no cloud credentials required.

-```properties
-# Figure storage
-app.figure-storage.base-path=./uploads
-app.figure-storage.min-image-size-px=100
+```bash
+# Install (Python 3.10+ required)
+pip install marker-pdf
+
+# Start the server on port 8000
+marker_server --port 8000
 ```

-The `uploads/figures/` directory is created automatically on first use. Add it to `.gitignore`.
+The server is ready when you see:
+```
+INFO:     Uvicorn running on http://0.0.0.0:8000
+```
+
+Keep the server running in the background (or use a process manager like `systemd` or `screen`).
+
+---
+
+## Backend Configuration
+
+Add or update `backend/src/main/resources/application.yaml`:
+
+```yaml
+app:
+  figure-storage:
+    endpoint: https://your-s3-endpoint
+    region: your-region
+    bucket: ${S3_BUCKET:aiteacher}
+    access-key-id: ${S3_ACCESS_KEY_ID}
+    secret-access-key: ${S3_SECRET_ACCESS_KEY}
+    min-image-size-px: 100   # skip decorative images smaller than 100×100 px
+  marker:
+    base-url: ${MARKER_BASE_URL:http://localhost:8000}
+  embedding:
+    batch-size: 20
+    batch-delay-ms: 2000
+```
+
+No GCP credentials or project IDs are needed.

 ---

 ## Database Migration

-Two new Flyway migrations run automatically on startup:
+Two Flyway migrations run automatically on startup:

 - `V4__document_hierarchy.sql` — adds `chapter` and `section` tables
 - `V5__figures_and_refs.sql` — adds `figure` and `chunk_figure_ref` tables
@@ -54,10 +87,11 @@ image-aware pipeline runs. Status can be polled via `GET /api/v1/books`.

 ## Verifying Image Extraction

-1. Upload a PDF with diagrams: `POST /api/v1/books/upload`
-2. Wait for `status: "READY"` via `GET /api/v1/books`
-3. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page
-4. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry
+1. Ensure Marker is running: `curl http://localhost:8000` should respond.
+2. Upload a PDF with diagrams: `POST /api/v1/books/upload`
+3. Wait for `status: "READY"` via `GET /api/v1/books`
+4. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page
+5. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry

 ---

@@ -80,7 +114,8 @@ mvn test
 ```

 Key new test classes:
- `FigureExtractionServiceTest` — unit tests for image extraction and classification
+- `MarkerPageParserTest` — unit tests for JSON parsing and block-to-PageResult mapping
+- `FigureExtractionServiceTest` — unit tests for base64 decode, size filtering, classification
 - `NeurosurgeryRetrieverTest` — unit tests for dual-search merge and deduplication
 - `BookEmbeddingServiceIntegrationTest` — integration test: upload PDF with known figures,
  verify figures appear in `GET /api/v1/books/{id}/figures`