Files
ai-teacher/specs/002-image-aware-embedding/quickstart.md
T
2026-04-04 21:30:18 +02:00

122 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Quickstart: Enhanced Embedding with Image Parsing and Metadata
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)
---
## Prerequisites
- Docker Compose running (PostgreSQL + pgvector)
- OpenAI API key set as env var `OPENAI_API_KEY`
- Java 25 + Maven on PATH
- **Marker server running** on `http://localhost:8000` (see setup below)
- S3-compatible bucket configured (existing setup)
---
## Marker Server Setup (one-time)
Marker is a local Python service — no cloud credentials required.
```bash
# Install (Python 3.10+ required)
pip install marker-pdf
# Start the server on port 8000
marker_server --port 8000
```
The server is ready when you see:
```
INFO: Uvicorn running on http://0.0.0.0:8000
```
Keep the server running in the background (or use a process manager like `systemd` or `screen`).
---
## Backend Configuration
Add or update `backend/src/main/resources/application.yaml`:
```yaml
app:
figure-storage:
endpoint: https://your-s3-endpoint
region: your-region
bucket: ${S3_BUCKET:aiteacher}
access-key-id: ${S3_ACCESS_KEY_ID}
secret-access-key: ${S3_SECRET_ACCESS_KEY}
min-image-size-px: 100 # skip decorative images smaller than 100×100 px
marker:
base-url: ${MARKER_BASE_URL:http://localhost:8000}
embedding:
batch-size: 20
batch-delay-ms: 2000
```
No GCP credentials or project IDs are needed.
---
## Database Migration
Two Flyway migrations run automatically on startup:
- `V4__document_hierarchy.sql` — adds `chapter` and `section` tables
- `V5__figures_and_refs.sql` — adds `figure` and `chunk_figure_ref` tables
No manual DB setup needed.
---
## Re-embedding Existing Books
Books embedded by feature 001 (text-only) remain functional for text queries. To add image
support, trigger a re-embed:
```bash
curl -X POST http://localhost:8080/api/v1/books/{bookId}/reembed \
-u admin:password
```
The book transitions to `PROCESSING`, old chunks and figures are deleted, and the new
image-aware pipeline runs. Status can be polled via `GET /api/v1/books`.
---
## Verifying Image Extraction
1. Ensure Marker is running: `curl http://localhost:8000` should respond.
2. Upload a PDF with diagrams: `POST /api/v1/books/upload`
3. Wait for `status: "READY"` via `GET /api/v1/books`
4. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page
5. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry
---
## Frontend: Rendering Inline Figures
The assistant message `content` field will contain figure references in the format
`[Fig. 12-4, p.184]`. The frontend should:
1. Parse `[Fig. X, p.N]` patterns in assistant message text
2. Look up the matching entry in `sources` where `type === "FIGURE"`
3. Render the figure inline using the `imageUrl` field
---
## Running Tests
```bash
cd backend
mvn test
```
Key new test classes:
- `MarkerPageParserTest` — unit tests for JSON parsing and block-to-PageResult mapping
- `FigureExtractionServiceTest` — unit tests for base64 decode, size filtering, classification
- `NeurosurgeryRetrieverTest` — unit tests for dual-search merge and deduplication
- `BookEmbeddingServiceIntegrationTest` — integration test: upload PDF with known figures,
verify figures appear in `GET /api/v1/books/{id}/figures`