ai-teacher/specs/002-image-aware-embedding/quickstart.md

# Quickstart: Enhanced Embedding with Image Parsing and Metadata

**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)

---

## Prerequisites

- Docker Compose running (PostgreSQL + pgvector)
- OpenAI API key set as env var `OPENAI_API_KEY`
- Java 25 + Maven on PATH
- **Marker server running** on `http://localhost:8000` (see setup below)
- S3-compatible bucket configured (existing setup)

---

## Marker Server Setup (one-time)

Marker is a local Python service — no cloud credentials required.

```bash
# Install (Python 3.10+ required)
pip install marker-pdf

# Start the server on port 8000
marker_server --port 8000
```

The server is ready when you see:
```
INFO:     Uvicorn running on http://0.0.0.0:8000
```

Keep the server running in the background (or use a process manager like `systemd` or `screen`).

---

## Backend Configuration

Add or update `backend/src/main/resources/application.yaml`:

```yaml
app:
  figure-storage:
    endpoint: https://your-s3-endpoint
    region: your-region
    bucket: ${S3_BUCKET:aiteacher}
    access-key-id: ${S3_ACCESS_KEY_ID}
    secret-access-key: ${S3_SECRET_ACCESS_KEY}
    min-image-size-px: 100   # skip decorative images smaller than 100×100 px
  marker:
    base-url: ${MARKER_BASE_URL:http://localhost:8000}
  embedding:
    batch-size: 20
    batch-delay-ms: 2000
```

No GCP credentials or project IDs are needed.

---

## Database Migration

Two Flyway migrations run automatically on startup:

- `V4__document_hierarchy.sql` — adds `chapter` and `section` tables
- `V5__figures_and_refs.sql` — adds `figure` and `chunk_figure_ref` tables

No manual DB setup needed.

---

## Re-embedding Existing Books

Books embedded by feature 001 (text-only) remain functional for text queries. To add image
support, trigger a re-embed:

```bash
curl -X POST http://localhost:8080/api/v1/books/{bookId}/reembed \
  -u admin:password
```

The book transitions to `PROCESSING`, old chunks and figures are deleted, and the new
image-aware pipeline runs. Status can be polled via `GET /api/v1/books`.

---

## Verifying Image Extraction

1. Ensure Marker is running: `curl http://localhost:8000` should respond.
2. Upload a PDF with diagrams: `POST /api/v1/books/upload`
3. Wait for `status: "READY"` via `GET /api/v1/books`
4. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page
5. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry

---

## Frontend: Rendering Inline Figures

The assistant message `content` field will contain figure references in the format
`[Fig. 12-4, p.184]`. The frontend should:

1. Parse `[Fig. X, p.N]` patterns in assistant message text
2. Look up the matching entry in `sources` where `type === "FIGURE"`
3. Render the figure inline using the `imageUrl` field

---

## Running Tests

```bash
cd backend
mvn test
```

Key new test classes:
- `MarkerPageParserTest` — unit tests for JSON parsing and block-to-PageResult mapping
- `FigureExtractionServiceTest` — unit tests for base64 decode, size filtering, classification
- `NeurosurgeryRetrieverTest` — unit tests for dual-search merge and deduplication
- `BookEmbeddingServiceIntegrationTest` — integration test: upload PDF with known figures,
  verify figures appear in `GET /api/v1/books/{id}/figures`