3.3 KiB
Quickstart: Enhanced Embedding with Image Parsing and Metadata
Branch: 002-image-aware-embedding | Date: 2026-04-04 (updated: Marker replaces Google Document AI)
Prerequisites
- Docker Compose running (PostgreSQL + pgvector)
- OpenAI API key set as env var
OPENAI_API_KEY - Java 25 + Maven on PATH
- Marker server running on
http://localhost:8000(see setup below) - S3-compatible bucket configured (existing setup)
Marker Server Setup (one-time)
Marker is a local Python service — no cloud credentials required.
# Install (Python 3.10+ required)
pip install marker-pdf
# Start the server on port 8000
marker_server --port 8000
The server is ready when you see:
INFO: Uvicorn running on http://0.0.0.0:8000
Keep the server running in the background (or use a process manager like systemd or screen).
Backend Configuration
Add or update backend/src/main/resources/application.yaml:
app:
figure-storage:
endpoint: https://your-s3-endpoint
region: your-region
bucket: ${S3_BUCKET:aiteacher}
access-key-id: ${S3_ACCESS_KEY_ID}
secret-access-key: ${S3_SECRET_ACCESS_KEY}
min-image-size-px: 100 # skip decorative images smaller than 100×100 px
marker:
base-url: ${MARKER_BASE_URL:http://localhost:8000}
embedding:
batch-size: 20
batch-delay-ms: 2000
No GCP credentials or project IDs are needed.
Database Migration
Two Flyway migrations run automatically on startup:
V4__document_hierarchy.sql— addschapterandsectiontablesV5__figures_and_refs.sql— addsfigureandchunk_figure_reftables
No manual DB setup needed.
Re-embedding Existing Books
Books embedded by feature 001 (text-only) remain functional for text queries. To add image support, trigger a re-embed:
curl -X POST http://localhost:8080/api/v1/books/{bookId}/reembed \
-u admin:password
The book transitions to PROCESSING, old chunks and figures are deleted, and the new
image-aware pipeline runs. Status can be polled via GET /api/v1/books.
Verifying Image Extraction
- Ensure Marker is running:
curl http://localhost:8000should respond. - Upload a PDF with diagrams:
POST /api/v1/books/upload - Wait for
status: "READY"viaGET /api/v1/books - List figures:
GET /api/v1/books/{id}/figures— should return at least one entry per image page - Ask a diagram-specific question in chat — response
sourcesshould include atype: "FIGURE"entry
Frontend: Rendering Inline Figures
The assistant message content field will contain figure references in the format
[Fig. 12-4, p.184]. The frontend should:
- Parse
[Fig. X, p.N]patterns in assistant message text - Look up the matching entry in
sourceswheretype === "FIGURE" - Render the figure inline using the
imageUrlfield
Running Tests
cd backend
mvn test
Key new test classes:
MarkerPageParserTest— unit tests for JSON parsing and block-to-PageResult mappingFigureExtractionServiceTest— unit tests for base64 decode, size filtering, classificationNeurosurgeryRetrieverTest— unit tests for dual-search merge and deduplicationBookEmbeddingServiceIntegrationTest— integration test: upload PDF with known figures, verify figures appear inGET /api/v1/books/{id}/figures