# Quickstart: Enhanced Embedding with Image Parsing and Metadata **Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI) --- ## Prerequisites - Docker Compose running (PostgreSQL + pgvector) - OpenAI API key set as env var `OPENAI_API_KEY` - Java 25 + Maven on PATH - **Marker server running** on `http://localhost:8000` (see setup below) - S3-compatible bucket configured (existing setup) --- ## Marker Server Setup (one-time) Marker is a local Python service — no cloud credentials required. ```bash # Install (Python 3.10+ required) pip install marker-pdf # Start the server on port 8000 marker_server --port 8000 ``` The server is ready when you see: ``` INFO: Uvicorn running on http://0.0.0.0:8000 ``` Keep the server running in the background (or use a process manager like `systemd` or `screen`). --- ## Backend Configuration Add or update `backend/src/main/resources/application.yaml`: ```yaml app: figure-storage: endpoint: https://your-s3-endpoint region: your-region bucket: ${S3_BUCKET:aiteacher} access-key-id: ${S3_ACCESS_KEY_ID} secret-access-key: ${S3_SECRET_ACCESS_KEY} min-image-size-px: 100 # skip decorative images smaller than 100×100 px marker: base-url: ${MARKER_BASE_URL:http://localhost:8000} embedding: batch-size: 20 batch-delay-ms: 2000 ``` No GCP credentials or project IDs are needed. --- ## Database Migration Two Flyway migrations run automatically on startup: - `V4__document_hierarchy.sql` — adds `chapter` and `section` tables - `V5__figures_and_refs.sql` — adds `figure` and `chunk_figure_ref` tables No manual DB setup needed. --- ## Re-embedding Existing Books Books embedded by feature 001 (text-only) remain functional for text queries. To add image support, trigger a re-embed: ```bash curl -X POST http://localhost:8080/api/v1/books/{bookId}/reembed \ -u admin:password ``` The book transitions to `PROCESSING`, old chunks and figures are deleted, and the new image-aware pipeline runs. Status can be polled via `GET /api/v1/books`. --- ## Verifying Image Extraction 1. Ensure Marker is running: `curl http://localhost:8000` should respond. 2. Upload a PDF with diagrams: `POST /api/v1/books/upload` 3. Wait for `status: "READY"` via `GET /api/v1/books` 4. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page 5. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry --- ## Frontend: Rendering Inline Figures The assistant message `content` field will contain figure references in the format `[Fig. 12-4, p.184]`. The frontend should: 1. Parse `[Fig. X, p.N]` patterns in assistant message text 2. Look up the matching entry in `sources` where `type === "FIGURE"` 3. Render the figure inline using the `imageUrl` field --- ## Running Tests ```bash cd backend mvn test ``` Key new test classes: - `MarkerPageParserTest` — unit tests for JSON parsing and block-to-PageResult mapping - `FigureExtractionServiceTest` — unit tests for base64 decode, size filtering, classification - `NeurosurgeryRetrieverTest` — unit tests for dual-search merge and deduplication - `BookEmbeddingServiceIntegrationTest` — integration test: upload PDF with known figures, verify figures appear in `GET /api/v1/books/{id}/figures`