adding Marker to parse effectively pdf

This commit is contained in:
Adrien
2026-04-04 21:30:18 +02:00
parent b154e29f2d
commit ea1276dc2e
25 changed files with 2318 additions and 285 deletions
+50 -15
View File
@@ -1,34 +1,67 @@
# Quickstart: Enhanced Embedding with Image Parsing and Metadata
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)
---
## Prerequisites
- Docker Compose running (PostgreSQL + pgvector)
- OpenAI API key set in `backend/src/main/resources/application.properties` or as env var `OPENAI_API_KEY`
- OpenAI API key set as env var `OPENAI_API_KEY`
- Java 25 + Maven on PATH
- **Marker server running** on `http://localhost:8000` (see setup below)
- S3-compatible bucket configured (existing setup)
---
## New Configuration
## Marker Server Setup (one-time)
Add to `backend/src/main/resources/application.properties`:
Marker is a local Python service — no cloud credentials required.
```properties
# Figure storage
app.figure-storage.base-path=./uploads
app.figure-storage.min-image-size-px=100
```bash
# Install (Python 3.10+ required)
pip install marker-pdf
# Start the server on port 8000
marker_server --port 8000
```
The `uploads/figures/` directory is created automatically on first use. Add it to `.gitignore`.
The server is ready when you see:
```
INFO: Uvicorn running on http://0.0.0.0:8000
```
Keep the server running in the background (or use a process manager like `systemd` or `screen`).
---
## Backend Configuration
Add or update `backend/src/main/resources/application.yaml`:
```yaml
app:
figure-storage:
endpoint: https://your-s3-endpoint
region: your-region
bucket: ${S3_BUCKET:aiteacher}
access-key-id: ${S3_ACCESS_KEY_ID}
secret-access-key: ${S3_SECRET_ACCESS_KEY}
min-image-size-px: 100 # skip decorative images smaller than 100×100 px
marker:
base-url: ${MARKER_BASE_URL:http://localhost:8000}
embedding:
batch-size: 20
batch-delay-ms: 2000
```
No GCP credentials or project IDs are needed.
---
## Database Migration
Two new Flyway migrations run automatically on startup:
Two Flyway migrations run automatically on startup:
- `V4__document_hierarchy.sql` — adds `chapter` and `section` tables
- `V5__figures_and_refs.sql` — adds `figure` and `chunk_figure_ref` tables
@@ -54,10 +87,11 @@ image-aware pipeline runs. Status can be polled via `GET /api/v1/books`.
## Verifying Image Extraction
1. Upload a PDF with diagrams: `POST /api/v1/books/upload`
2. Wait for `status: "READY"` via `GET /api/v1/books`
3. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page
4. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry
1. Ensure Marker is running: `curl http://localhost:8000` should respond.
2. Upload a PDF with diagrams: `POST /api/v1/books/upload`
3. Wait for `status: "READY"` via `GET /api/v1/books`
4. List figures: `GET /api/v1/books/{id}/figures` should return at least one entry per image page
5. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry
---
@@ -80,7 +114,8 @@ mvn test
```
Key new test classes:
- `FigureExtractionServiceTest` — unit tests for image extraction and classification
- `MarkerPageParserTest` — unit tests for JSON parsing and block-to-PageResult mapping
- `FigureExtractionServiceTest` — unit tests for base64 decode, size filtering, classification
- `NeurosurgeryRetrieverTest` — unit tests for dual-search merge and deduplication
- `BookEmbeddingServiceIntegrationTest` — integration test: upload PDF with known figures,
verify figures appear in `GET /api/v1/books/{id}/figures`