122 lines
3.3 KiB
Markdown
122 lines
3.3 KiB
Markdown
# Quickstart: Enhanced Embedding with Image Parsing and Metadata
|
||
|
||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)
|
||
|
||
---
|
||
|
||
## Prerequisites
|
||
|
||
- Docker Compose running (PostgreSQL + pgvector)
|
||
- OpenAI API key set as env var `OPENAI_API_KEY`
|
||
- Java 25 + Maven on PATH
|
||
- **Marker server running** on `http://localhost:8000` (see setup below)
|
||
- S3-compatible bucket configured (existing setup)
|
||
|
||
---
|
||
|
||
## Marker Server Setup (one-time)
|
||
|
||
Marker is a local Python service — no cloud credentials required.
|
||
|
||
```bash
|
||
# Install (Python 3.10+ required)
|
||
pip install marker-pdf
|
||
|
||
# Start the server on port 8000
|
||
marker_server --port 8000
|
||
```
|
||
|
||
The server is ready when you see:
|
||
```
|
||
INFO: Uvicorn running on http://0.0.0.0:8000
|
||
```
|
||
|
||
Keep the server running in the background (or use a process manager like `systemd` or `screen`).
|
||
|
||
---
|
||
|
||
## Backend Configuration
|
||
|
||
Add or update `backend/src/main/resources/application.yaml`:
|
||
|
||
```yaml
|
||
app:
|
||
figure-storage:
|
||
endpoint: https://your-s3-endpoint
|
||
region: your-region
|
||
bucket: ${S3_BUCKET:aiteacher}
|
||
access-key-id: ${S3_ACCESS_KEY_ID}
|
||
secret-access-key: ${S3_SECRET_ACCESS_KEY}
|
||
min-image-size-px: 100 # skip decorative images smaller than 100×100 px
|
||
marker:
|
||
base-url: ${MARKER_BASE_URL:http://localhost:8000}
|
||
embedding:
|
||
batch-size: 20
|
||
batch-delay-ms: 2000
|
||
```
|
||
|
||
No GCP credentials or project IDs are needed.
|
||
|
||
---
|
||
|
||
## Database Migration
|
||
|
||
Two Flyway migrations run automatically on startup:
|
||
|
||
- `V4__document_hierarchy.sql` — adds `chapter` and `section` tables
|
||
- `V5__figures_and_refs.sql` — adds `figure` and `chunk_figure_ref` tables
|
||
|
||
No manual DB setup needed.
|
||
|
||
---
|
||
|
||
## Re-embedding Existing Books
|
||
|
||
Books embedded by feature 001 (text-only) remain functional for text queries. To add image
|
||
support, trigger a re-embed:
|
||
|
||
```bash
|
||
curl -X POST http://localhost:8080/api/v1/books/{bookId}/reembed \
|
||
-u admin:password
|
||
```
|
||
|
||
The book transitions to `PROCESSING`, old chunks and figures are deleted, and the new
|
||
image-aware pipeline runs. Status can be polled via `GET /api/v1/books`.
|
||
|
||
---
|
||
|
||
## Verifying Image Extraction
|
||
|
||
1. Ensure Marker is running: `curl http://localhost:8000` should respond.
|
||
2. Upload a PDF with diagrams: `POST /api/v1/books/upload`
|
||
3. Wait for `status: "READY"` via `GET /api/v1/books`
|
||
4. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page
|
||
5. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry
|
||
|
||
---
|
||
|
||
## Frontend: Rendering Inline Figures
|
||
|
||
The assistant message `content` field will contain figure references in the format
|
||
`[Fig. 12-4, p.184]`. The frontend should:
|
||
|
||
1. Parse `[Fig. X, p.N]` patterns in assistant message text
|
||
2. Look up the matching entry in `sources` where `type === "FIGURE"`
|
||
3. Render the figure inline using the `imageUrl` field
|
||
|
||
---
|
||
|
||
## Running Tests
|
||
|
||
```bash
|
||
cd backend
|
||
mvn test
|
||
```
|
||
|
||
Key new test classes:
|
||
- `MarkerPageParserTest` — unit tests for JSON parsing and block-to-PageResult mapping
|
||
- `FigureExtractionServiceTest` — unit tests for base64 decode, size filtering, classification
|
||
- `NeurosurgeryRetrieverTest` — unit tests for dual-search merge and deduplication
|
||
- `BookEmbeddingServiceIntegrationTest` — integration test: upload PDF with known figures,
|
||
verify figures appear in `GET /api/v1/books/{id}/figures`
|