Files
ai-teacher/specs/002-image-aware-embedding/quickstart.md
T
2026-04-04 21:30:18 +02:00

3.3 KiB
Raw Blame History

Quickstart: Enhanced Embedding with Image Parsing and Metadata

Branch: 002-image-aware-embedding | Date: 2026-04-04 (updated: Marker replaces Google Document AI)


Prerequisites

  • Docker Compose running (PostgreSQL + pgvector)
  • OpenAI API key set as env var OPENAI_API_KEY
  • Java 25 + Maven on PATH
  • Marker server running on http://localhost:8000 (see setup below)
  • S3-compatible bucket configured (existing setup)

Marker Server Setup (one-time)

Marker is a local Python service — no cloud credentials required.

# Install (Python 3.10+ required)
pip install marker-pdf

# Start the server on port 8000
marker_server --port 8000

The server is ready when you see:

INFO:     Uvicorn running on http://0.0.0.0:8000

Keep the server running in the background (or use a process manager like systemd or screen).


Backend Configuration

Add or update backend/src/main/resources/application.yaml:

app:
  figure-storage:
    endpoint: https://your-s3-endpoint
    region: your-region
    bucket: ${S3_BUCKET:aiteacher}
    access-key-id: ${S3_ACCESS_KEY_ID}
    secret-access-key: ${S3_SECRET_ACCESS_KEY}
    min-image-size-px: 100   # skip decorative images smaller than 100×100 px
  marker:
    base-url: ${MARKER_BASE_URL:http://localhost:8000}
  embedding:
    batch-size: 20
    batch-delay-ms: 2000

No GCP credentials or project IDs are needed.


Database Migration

Two Flyway migrations run automatically on startup:

  • V4__document_hierarchy.sql — adds chapter and section tables
  • V5__figures_and_refs.sql — adds figure and chunk_figure_ref tables

No manual DB setup needed.


Re-embedding Existing Books

Books embedded by feature 001 (text-only) remain functional for text queries. To add image support, trigger a re-embed:

curl -X POST http://localhost:8080/api/v1/books/{bookId}/reembed \
  -u admin:password

The book transitions to PROCESSING, old chunks and figures are deleted, and the new image-aware pipeline runs. Status can be polled via GET /api/v1/books.


Verifying Image Extraction

  1. Ensure Marker is running: curl http://localhost:8000 should respond.
  2. Upload a PDF with diagrams: POST /api/v1/books/upload
  3. Wait for status: "READY" via GET /api/v1/books
  4. List figures: GET /api/v1/books/{id}/figures — should return at least one entry per image page
  5. Ask a diagram-specific question in chat — response sources should include a type: "FIGURE" entry

Frontend: Rendering Inline Figures

The assistant message content field will contain figure references in the format [Fig. 12-4, p.184]. The frontend should:

  1. Parse [Fig. X, p.N] patterns in assistant message text
  2. Look up the matching entry in sources where type === "FIGURE"
  3. Render the figure inline using the imageUrl field

Running Tests

cd backend
mvn test

Key new test classes:

  • MarkerPageParserTest — unit tests for JSON parsing and block-to-PageResult mapping
  • FigureExtractionServiceTest — unit tests for base64 decode, size filtering, classification
  • NeurosurgeryRetrieverTest — unit tests for dual-search merge and deduplication
  • BookEmbeddingServiceIntegrationTest — integration test: upload PDF with known figures, verify figures appear in GET /api/v1/books/{id}/figures