ai-teacher/specs/002-image-aware-embedding/tasks.md

# Tasks: Enhanced Embedding with Image Parsing and Metadata

**Input**: Design documents from `/specs/002-image-aware-embedding/`
**Prerequisites**: plan.md ✓ | spec.md ✓ | research.md ✓ | data-model.md ✓ | contracts/ ✓

**Organization**: Tasks grouped by user story to enable independent implementation and testing.

## Format: `[ID] [P?] [Story] Description`

- **[P]**: Can run in parallel (different files, no shared dependencies)
- **[US1/US2/US3]**: Which user story this task belongs to

---

## Phase 1: Setup (Shared Infrastructure)

**Purpose**: Database migrations and configuration that establish the foundation for all new code

- [X] T001 Create Flyway migration `V4__document_hierarchy.sql` — add `chapter` and `section` tables per data-model.md §Postgres Schema in `backend/src/main/resources/db/migration/V4__document_hierarchy.sql`
- [X] T002 Create Flyway migration `V5__figures_and_refs.sql` — add `figure` and `chunk_figure_ref` tables per data-model.md §Postgres Schema in `backend/src/main/resources/db/migration/V5__figures_and_refs.sql`
- [X] T003 Add figure-storage configuration keys to `backend/src/main/resources/application.properties`: `app.figure-storage.base-path=./uploads` and `app.figure-storage.min-image-size-px=100`
- [X] T004 Add `uploads/` directory to `.gitignore` at repo root; create `uploads/figures/.gitkeep` to preserve directory structure

---

## Phase 2: Foundational (Blocking Prerequisites)

**Purpose**: Core types and infrastructure that ALL user stories depend on — nothing in Phase 3+ can start until this phase is complete

**⚠️ CRITICAL**: No user story work can begin until this phase is complete

- [X] T005 [P] Create `FigureType` enum in `backend/src/main/java/com/aiteacher/document/FigureType.java` — values: `ANATOMICAL_DIAGRAM`, `SURGICAL_PHOTOGRAPH`, `MRI_CT_SCAN`, `TABLE`, `CHART`, `INTRAOPERATIVE_IMAGE`
- [X] T006 [P] Create `FigureStorageService` interface in `backend/src/main/java/com/aiteacher/figure/FigureStorageService.java` — declare `Path save(UUID bookId, String figureId, BufferedImage image)`, `Path resolve(UUID bookId, String filename)`, and `void delete(UUID bookId)`
- [X] T007 Create `LocalFigureStorageService` implementation in `backend/src/main/java/com/aiteacher/figure/LocalFigureStorageService.java` — writes PNG files under `${app.figure-storage.base-path}/figures/{bookId}/`; implements `FigureStorageService`; depends on T006
- [X] T008 Create `FigureStorageConfig` bean in `backend/src/main/java/com/aiteacher/config/FigureStorageConfig.java` — reads `app.figure-storage.base-path` and `app.figure-storage.min-image-size-px` as `@ConfigurationProperties`; registers `LocalFigureStorageService` as `@Bean`; adds `ResourceHandler` mapping `GET /api/v1/figures/**` to the base-path directory
- [X] T009 [P] Create `ChapterEntity` JPA entity and `ChapterRepository` in `backend/src/main/java/com/aiteacher/document/` — `@Entity(name="chapter")`, fields: `id` (String PK), `bookId` (UUID FK → book), `number` (int), `title` (String), `pageStart` (int), `createdAt` (Instant); `ChapterRepository extends JpaRepository<ChapterEntity, String>`
- [X] T010 [P] Create `SectionEntity` JPA entity and `SectionRepository` in `backend/src/main/java/com/aiteacher/document/` — `@Entity(name="section")`, fields: `id` (String PK), `chapterId` (String FK → chapter), `bookId` (UUID FK → book), `number` (String), `title` (String), `pageStart`/`pageEnd` (int), `fullText` (TEXT column), `createdAt` (Instant); `SectionRepository extends JpaRepository<SectionEntity, String>` with `findAllByBookId(UUID)`
- [X] T011 [P] Create `FigureEntity` JPA entity and `FigureRepository` in `backend/src/main/java/com/aiteacher/document/` — `@Entity(name="figure")`, fields: `id` (String PK), `bookId` (UUID), `sectionId` (String, nullable), `chapterId` (String, nullable), `label` (String), `caption` (TEXT), `figureType` (`@Enumerated` FigureType), `page` (int), `imagePath` (String), `captionEmbeddingId` (UUID, nullable), `createdAt` (Instant); `FigureRepository` with `findAllByBookId(UUID)`, `deleteAllByBookId(UUID)`
- [X] T012 Create `ChunkFigureRefEntity` JPA entity and `ChunkFigureRefRepository` in `backend/src/main/java/com/aiteacher/document/` — composite PK `(chunkId UUID, figureId String)`, `mentionPage` (int); `ChunkFigureRefRepository` with `findByChunkIdIn(List<UUID>)`, `deleteByFigureIdIn(List<String>)`

**Checkpoint**: Migrations will run on next startup; all JPA entities are wired; figure storage reads config correctly

---

## Phase 3: User Story 2 — All Pages Scanned for Images During Embedding (Priority: P1)

**Goal**: When a book is uploaded, every page is inspected for images; each found image is extracted, persisted, described, and embedded as a searchable chunk alongside its metadata

**Independent Test**: Upload a PDF containing at least one page with a labelled anatomical diagram. After status shows `READY`, call `GET /api/v1/books/{id}/figures` — response must contain at least one entry with `figureType`, `caption`, `page`, and `imageUrl` populated. Verify the PNG file exists at the path in `imagePath`.

- [X] T013 [US2] ~~Create `PdfStructureParser`~~ → **SUPERSEDED**: PDF parsing is handled by `MarkerPageParser` (see T013b). `PdfStructureParser` exists but is not wired into the pipeline.
- [X] T013b [US2] Create `MarkerPageParser` in `backend/src/main/java/com/aiteacher/document/MarkerPageParser.java` — POSTs PDF to `http://localhost:8000/marker/upload?output_format=json` via Spring `RestClient`; parses JSON response into `List<PageResult>` (one per page block); extracts heading, ordered text, and pre-cropped figure PNG bytes per page
- [X] T014 [US2] Update `FigureExtractionService` in `backend/src/main/java/com/aiteacher/document/FigureExtractionService.java` — **Marker migration**: removed PDFBox rendering + bbox-crop loop; decodes PNG bytes from `PageResult.FigureData` via `ImageIO.read()`; skips images below `min-image-size-px`; classifies `FigureType`; saves via `FigureStorageService`; persists `FigureEntity`
- [X] T015 [US2] Create `VisionDescriptionService` in `backend/src/main/java/com/aiteacher/document/VisionDescriptionService.java` — accepts a `Path` to a PNG and a caption String; calls the OpenAI vision model (via Spring AI `ChatClient` with image media type) to generate a 2–4 sentence clinical description; returns the generated description string; handles API failures by returning the caption as fallback
- [X] T016 [US2] Create `TextChunkingService` in `backend/src/main/java/com/aiteacher/document/TextChunkingService.java` — accepts a `SectionEntity`; splits `fullText` into overlapping 400–600 token windows (20-token overlap); wraps each window in a Spring AI `Document` with the flat metadata map defined in data-model.md §Text chunk document; returns `List<Document>`
- [X] T017 [US2] Create `ChunkFigureRefService` in `backend/src/main/java/com/aiteacher/document/ChunkFigureRefService.java` — accepts a Spring AI `Document` (with its `id` as `chunkId`) and a `List<FigureEntity>` for the book; scans chunk text for patterns `Fig\.\s*\d+[\-\.]\d+` and `Figure\s+\d+[\-\.]\d+`; matches against figure labels; persists `ChunkFigureRefEntity` rows via `ChunkFigureRefRepository`
- [X] T018 [US2] Update `BookEmbeddingService.embedBook()` — **Marker migration**: injected `MarkerPageParser` replacing `DocumentAiPageParser`; updated `figureExtractionService.extract()` call (removed `pdfPath` arg); updated log message. Pipeline: (1) `MarkerPageParser` → `List<PageResult>`; (2) `buildAndSaveSections()` → sections; (3) `TextChunkingService` → chunks → embed; (4) `FigureExtractionService.extract()` → figures; (5) `VisionDescriptionService` → embed figure chunks; (6) `ChunkFigureRefService` → refs
- [X] T019 [US2] Extend `BookEmbeddingService.deleteBookChunks()` to also delete: all `ChunkFigureRefEntity` rows (via `findByFigureIdIn`), all `FigureEntity` rows (via `deleteAllByBookId`), all figure PNG files (via `FigureStorageService.delete(bookId)`), all `SectionEntity` and `ChapterEntity` rows for the book
- [X] T020 [US2] Add `POST /api/v1/books/{id}/reembed` endpoint to `BookController` in `backend/src/main/java/com/aiteacher/book/BookController.java` — returns `202` with `{ bookId, status: "PROCESSING" }`; returns `404` if not found; returns `409` if already `PROCESSING`; calls `deleteBookChunks()` then `embedBook()` asynchronously

**Checkpoint**: Upload a PDF with figures → poll `GET /api/v1/books` for `READY` → `GET /api/v1/books/{id}/figures` returns figure list → PNG accessible at `GET /api/v1/figures/{bookId}/{filename}`

---

## Phase 4: User Story 1 — Image Content Surfaced in Query Results (Priority: P1)

**Goal**: User asks a question answered by a diagram — the system retrieves that diagram's content and surfaces it in the chat response with a citation

**Independent Test**: With a book embedded (Phase 3 checkpoint passed), ask a chat question whose answer is depicted only in a diagram. The response `sources` array must contain at least one entry with `type: "FIGURE"` and a non-empty `imageUrl`.

- [X] T021 [US1] Create `NeurosurgeryRetriever` service in `backend/src/main/java/com/aiteacher/retrieval/NeurosurgeryRetriever.java` — (1) text chunk search: `vectorStore.similaritySearch` with filter `type == TEXT AND book_id == bookId`, topK=5; (2) figure search: same store, filter `type == FIGURE AND book_id == bookId`, topK=3; (3) expand text chunk results to parent sections via `SectionRepository.findAllById(sectionIds)`; (4) fetch explicitly linked figures via `ChunkFigureRefRepository.findByChunkIdIn(chunkIds)` + `FigureRepository.findAllById`; (5) deduplicate figures across lists by `figureId`; return `RetrievalResult(parentSections, figureVectorHits, linkedFigures)` — add `RetrievalResult` record in same package
- [X] T022 [US1] Refactor `ChatService.sendMessage()` in `backend/src/main/java/com/aiteacher/chat/ChatService.java` — replace `QuestionAnswerAdvisor` with a manual call to `NeurosurgeryRetriever`; build the LLM user message from: section full texts as `[Section X.Y — Title, pp.A-B]\n{fullText}` blocks, followed by `AVAILABLE FIGURES FOR THIS SECTION:` list with `- {label} (p.{page}): {caption} [image: {filename}]` lines per figure; append the instruction `When referencing diagrams, cite them as [Fig. X, p.N].`; send via `chatClient.prompt().system(SYSTEM_PROMPT).user(prompt).call()`
- [X] T023 [US1] Add `GET /api/v1/books/{id}/figures` endpoint to `BookController` — returns `200` with `List<FigureResponse>`; `FigureResponse` is a new record in `backend/src/main/java/com/aiteacher/book/FigureResponse.java` with fields `figureId`, `label`, `caption`, `figureType`, `page`, `imageUrl` (assembled as `/api/v1/figures/{bookId}/{filename}`), `sectionId`, `sectionTitle`; returns `404` if book not found
- [X] T024 [US1] Update `extractSources()` in `ChatService` to build both TEXT and FIGURE source entries: TEXT entries keep existing fields plus `"type": "TEXT"`; FIGURE entries add `"type": "FIGURE"`, `"figureId"`, `"label"`, `"caption"`, `"figureType"`, `"imageUrl"` — source data comes from `RetrievalResult` (text chunk Documents and merged FigureEntity list)

**Checkpoint**: Chat question answered by a diagram → response body contains `sources[n].type == "FIGURE"` with populated `imageUrl`; image loads from the returned URL

---

## Phase 5: User Story 3 — Rich Metadata Enables Precise Source Attribution (Priority: P2)

**Goal**: Users see distinct, informative citations for text vs. image sources; image sources render inline in the chat UI

**Independent Test**: After triggering a response with figure sources, inspect the chat message in the UI — text sources and figure sources are visually distinguishable; figure sources render the actual image inline using the `imageUrl`

- [X] T025 [P] [US3] Update API response types in `frontend/src/services/api.ts` — extend the `Source` type to include `type: 'TEXT' | 'FIGURE'`, `figureId?: string`, `label?: string`, `caption?: string`, `figureType?: string`, `imageUrl?: string`
- [X] T026 [P] [US3] Update the chat source/citation display in the frontend (wherever sources are currently rendered, e.g. `frontend/src/components/` or `frontend/src/views/`) — render TEXT sources with a document icon and page number; render FIGURE sources with the image (`<img :src="source.imageUrl">`) below the label and caption text
- [X] T027 [US3] Add figure-type badge rendering in the frontend figure display: show a label derived from `figureType` (e.g. "MRI / CT", "Anatomical Diagram", "Table") alongside the figure caption so users can identify content type without opening the image

---

## Phase 6: Polish & Cross-Cutting Concerns

- [X] T028 Update `README.md` Mermaid architecture diagram to show three storage tiers: pgvector (semantic search), Postgres (source of truth — sections, figures, refs), and file store (extracted PNGs) — **required by Constitution Principle IV in the same PR as the other changes**
- [X] T029 [P] Write `FigureExtractionServiceTest` unit test in `backend/src/test/java/com/aiteacher/document/FigureExtractionServiceTest.java` — test: images below min size are skipped; `FigureType` classification matches keyword table in data-model.md; caption parsed from adjacent text line
- [X] T030 [P] Write `NeurosurgeryRetrieverTest` unit test in `backend/src/test/java/com/aiteacher/retrieval/NeurosurgeryRetrieverTest.java` — test: figure IDs from both vector hits and chunk refs are merged without duplicates; `RetrievalResult` contains the deduplicated set
- [X] T031 Run quickstart.md validation end-to-end: upload a real PDF with a labelled diagram → wait for `READY` → call `GET /api/v1/books/{id}/figures` → send a chat message about the diagram → verify `sources` contains a `FIGURE` entry → verify `imageUrl` resolves to a PNG

---

## Dependencies & Execution Order

### Phase Dependencies

- **Phase 1 (Setup)**: No dependencies — start immediately
- **Phase 2 (Foundational)**: Requires Phase 1 complete (migrations must run before JPA entities can be wired)
- **Phase 3 (US2)**: Requires Phase 2 complete — all JPA entities + FigureStorageService must exist
- **Phase 4 (US1)**: Requires Phase 3 complete — figures must exist in Postgres + vector store before retrieval can surface them
- **Phase 5 (US3)**: Requires Phase 4 complete — frontend depends on the extended `sources` format from T024
- **Phase 6 (Polish)**: Requires all story phases complete

### Within Phase 3 (Embedding Pipeline)

```
T013 (PdfStructureParser) ──────────────────────────┐
T014 (FigureExtractionService) ─────────────────────┤
T015 (VisionDescriptionService) ────────────────────┤─→ T018 (BookEmbeddingService orchestrator)
T016 (TextChunkingService) ─────────────────────────┤           └─→ T019 (cleanup)
T017 (ChunkFigureRefService) ───────────────────────┘           └─→ T020 (reembed endpoint)
```

T013–T017 can be implemented in parallel (different files, no shared dependencies). T018 depends on all of them.

### Within Phase 4 (Retrieval)

```
T021 (NeurosurgeryRetriever) ──────────────────────┐
                                                   └─→ T022 (ChatService update)
                                                   └─→ T024 (extractSources update)
T023 (figures endpoint) ── independent [P]
```

### Parallel Opportunities per Phase

**Phase 2**: T005, T006, T009, T010, T011 can all run in parallel. T007 depends on T006. T012 can follow T010/T011.

**Phase 3**: T013, T014, T015, T016, T017 all in parallel. T018 depends on all.

**Phase 5**: T025 and T026 in parallel; T027 can follow T026.

**Phase 6**: T029 and T030 in parallel.

---

## Implementation Strategy

### MVP: User Story 2 Only (Embedding Pipeline)

1. Phase 1 (Setup) → Phase 2 (Foundational) → Phase 3 (US2, T013–T020)
2. **Validate**: `GET /api/v1/books/{id}/figures` returns figures for a test book
3. **Stop and demo** — the pipeline produces image chunks without any retrieval changes

### Full Feature Delivery

1. Phase 1 + 2 → Foundation ready
2. Phase 3 (US2) → Embedding pipeline produces image chunks ← **demo point**
3. Phase 4 (US1) → Chat surfaces image content in responses ← **core payoff**
4. Phase 5 (US3) → Frontend renders inline figures with type badges
5. Phase 6 (Polish) → README, tests, end-to-end validation

---

## Notes

- [P] tasks = different files, no dependencies on each other within the same phase
- [US1/US2/US3] label maps each task to a user story for traceability
- Phase 3 (US2) must be fully complete before beginning Phase 4 (US1) — retrieval cannot surface figures that do not yet exist
- The `uploads/figures/` directory must exist and be writable at runtime; `FigureStorageService` creates subdirectories automatically
- Re-embedding (T020) deletes all existing chunks and figures for the book before re-running — safe to call on books processed by feature 001