Files
ai-teacher/specs/002-image-aware-embedding/tasks.md
T
2026-04-04 21:30:18 +02:00

170 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Tasks: Enhanced Embedding with Image Parsing and Metadata
**Input**: Design documents from `/specs/002-image-aware-embedding/`
**Prerequisites**: plan.md ✓ | spec.md ✓ | research.md ✓ | data-model.md ✓ | contracts/ ✓
**Organization**: Tasks grouped by user story to enable independent implementation and testing.
## Format: `[ID] [P?] [Story] Description`
- **[P]**: Can run in parallel (different files, no shared dependencies)
- **[US1/US2/US3]**: Which user story this task belongs to
---
## Phase 1: Setup (Shared Infrastructure)
**Purpose**: Database migrations and configuration that establish the foundation for all new code
- [X] T001 Create Flyway migration `V4__document_hierarchy.sql` — add `chapter` and `section` tables per data-model.md §Postgres Schema in `backend/src/main/resources/db/migration/V4__document_hierarchy.sql`
- [X] T002 Create Flyway migration `V5__figures_and_refs.sql` — add `figure` and `chunk_figure_ref` tables per data-model.md §Postgres Schema in `backend/src/main/resources/db/migration/V5__figures_and_refs.sql`
- [X] T003 Add figure-storage configuration keys to `backend/src/main/resources/application.properties`: `app.figure-storage.base-path=./uploads` and `app.figure-storage.min-image-size-px=100`
- [X] T004 Add `uploads/` directory to `.gitignore` at repo root; create `uploads/figures/.gitkeep` to preserve directory structure
---
## Phase 2: Foundational (Blocking Prerequisites)
**Purpose**: Core types and infrastructure that ALL user stories depend on — nothing in Phase 3+ can start until this phase is complete
**⚠️ CRITICAL**: No user story work can begin until this phase is complete
- [X] T005 [P] Create `FigureType` enum in `backend/src/main/java/com/aiteacher/document/FigureType.java` — values: `ANATOMICAL_DIAGRAM`, `SURGICAL_PHOTOGRAPH`, `MRI_CT_SCAN`, `TABLE`, `CHART`, `INTRAOPERATIVE_IMAGE`
- [X] T006 [P] Create `FigureStorageService` interface in `backend/src/main/java/com/aiteacher/figure/FigureStorageService.java` — declare `Path save(UUID bookId, String figureId, BufferedImage image)`, `Path resolve(UUID bookId, String filename)`, and `void delete(UUID bookId)`
- [X] T007 Create `LocalFigureStorageService` implementation in `backend/src/main/java/com/aiteacher/figure/LocalFigureStorageService.java` — writes PNG files under `${app.figure-storage.base-path}/figures/{bookId}/`; implements `FigureStorageService`; depends on T006
- [X] T008 Create `FigureStorageConfig` bean in `backend/src/main/java/com/aiteacher/config/FigureStorageConfig.java` — reads `app.figure-storage.base-path` and `app.figure-storage.min-image-size-px` as `@ConfigurationProperties`; registers `LocalFigureStorageService` as `@Bean`; adds `ResourceHandler` mapping `GET /api/v1/figures/**` to the base-path directory
- [X] T009 [P] Create `ChapterEntity` JPA entity and `ChapterRepository` in `backend/src/main/java/com/aiteacher/document/``@Entity(name="chapter")`, fields: `id` (String PK), `bookId` (UUID FK → book), `number` (int), `title` (String), `pageStart` (int), `createdAt` (Instant); `ChapterRepository extends JpaRepository<ChapterEntity, String>`
- [X] T010 [P] Create `SectionEntity` JPA entity and `SectionRepository` in `backend/src/main/java/com/aiteacher/document/``@Entity(name="section")`, fields: `id` (String PK), `chapterId` (String FK → chapter), `bookId` (UUID FK → book), `number` (String), `title` (String), `pageStart`/`pageEnd` (int), `fullText` (TEXT column), `createdAt` (Instant); `SectionRepository extends JpaRepository<SectionEntity, String>` with `findAllByBookId(UUID)`
- [X] T011 [P] Create `FigureEntity` JPA entity and `FigureRepository` in `backend/src/main/java/com/aiteacher/document/``@Entity(name="figure")`, fields: `id` (String PK), `bookId` (UUID), `sectionId` (String, nullable), `chapterId` (String, nullable), `label` (String), `caption` (TEXT), `figureType` (`@Enumerated` FigureType), `page` (int), `imagePath` (String), `captionEmbeddingId` (UUID, nullable), `createdAt` (Instant); `FigureRepository` with `findAllByBookId(UUID)`, `deleteAllByBookId(UUID)`
- [X] T012 Create `ChunkFigureRefEntity` JPA entity and `ChunkFigureRefRepository` in `backend/src/main/java/com/aiteacher/document/` — composite PK `(chunkId UUID, figureId String)`, `mentionPage` (int); `ChunkFigureRefRepository` with `findByChunkIdIn(List<UUID>)`, `deleteByFigureIdIn(List<String>)`
**Checkpoint**: Migrations will run on next startup; all JPA entities are wired; figure storage reads config correctly
---
## Phase 3: User Story 2 — All Pages Scanned for Images During Embedding (Priority: P1)
**Goal**: When a book is uploaded, every page is inspected for images; each found image is extracted, persisted, described, and embedded as a searchable chunk alongside its metadata
**Independent Test**: Upload a PDF containing at least one page with a labelled anatomical diagram. After status shows `READY`, call `GET /api/v1/books/{id}/figures` — response must contain at least one entry with `figureType`, `caption`, `page`, and `imageUrl` populated. Verify the PNG file exists at the path in `imagePath`.
- [X] T013 [US2] ~~Create `PdfStructureParser`~~**SUPERSEDED**: PDF parsing is handled by `MarkerPageParser` (see T013b). `PdfStructureParser` exists but is not wired into the pipeline.
- [X] T013b [US2] Create `MarkerPageParser` in `backend/src/main/java/com/aiteacher/document/MarkerPageParser.java` — POSTs PDF to `http://localhost:8000/marker/upload?output_format=json` via Spring `RestClient`; parses JSON response into `List<PageResult>` (one per page block); extracts heading, ordered text, and pre-cropped figure PNG bytes per page
- [X] T014 [US2] Update `FigureExtractionService` in `backend/src/main/java/com/aiteacher/document/FigureExtractionService.java`**Marker migration**: removed PDFBox rendering + bbox-crop loop; decodes PNG bytes from `PageResult.FigureData` via `ImageIO.read()`; skips images below `min-image-size-px`; classifies `FigureType`; saves via `FigureStorageService`; persists `FigureEntity`
- [X] T015 [US2] Create `VisionDescriptionService` in `backend/src/main/java/com/aiteacher/document/VisionDescriptionService.java` — accepts a `Path` to a PNG and a caption String; calls the OpenAI vision model (via Spring AI `ChatClient` with image media type) to generate a 24 sentence clinical description; returns the generated description string; handles API failures by returning the caption as fallback
- [X] T016 [US2] Create `TextChunkingService` in `backend/src/main/java/com/aiteacher/document/TextChunkingService.java` — accepts a `SectionEntity`; splits `fullText` into overlapping 400600 token windows (20-token overlap); wraps each window in a Spring AI `Document` with the flat metadata map defined in data-model.md §Text chunk document; returns `List<Document>`
- [X] T017 [US2] Create `ChunkFigureRefService` in `backend/src/main/java/com/aiteacher/document/ChunkFigureRefService.java` — accepts a Spring AI `Document` (with its `id` as `chunkId`) and a `List<FigureEntity>` for the book; scans chunk text for patterns `Fig\.\s*\d+[\-\.]\d+` and `Figure\s+\d+[\-\.]\d+`; matches against figure labels; persists `ChunkFigureRefEntity` rows via `ChunkFigureRefRepository`
- [X] T018 [US2] Update `BookEmbeddingService.embedBook()`**Marker migration**: injected `MarkerPageParser` replacing `DocumentAiPageParser`; updated `figureExtractionService.extract()` call (removed `pdfPath` arg); updated log message. Pipeline: (1) `MarkerPageParser``List<PageResult>`; (2) `buildAndSaveSections()` → sections; (3) `TextChunkingService` → chunks → embed; (4) `FigureExtractionService.extract()` → figures; (5) `VisionDescriptionService` → embed figure chunks; (6) `ChunkFigureRefService` → refs
- [X] T019 [US2] Extend `BookEmbeddingService.deleteBookChunks()` to also delete: all `ChunkFigureRefEntity` rows (via `findByFigureIdIn`), all `FigureEntity` rows (via `deleteAllByBookId`), all figure PNG files (via `FigureStorageService.delete(bookId)`), all `SectionEntity` and `ChapterEntity` rows for the book
- [X] T020 [US2] Add `POST /api/v1/books/{id}/reembed` endpoint to `BookController` in `backend/src/main/java/com/aiteacher/book/BookController.java` — returns `202` with `{ bookId, status: "PROCESSING" }`; returns `404` if not found; returns `409` if already `PROCESSING`; calls `deleteBookChunks()` then `embedBook()` asynchronously
**Checkpoint**: Upload a PDF with figures → poll `GET /api/v1/books` for `READY``GET /api/v1/books/{id}/figures` returns figure list → PNG accessible at `GET /api/v1/figures/{bookId}/{filename}`
---
## Phase 4: User Story 1 — Image Content Surfaced in Query Results (Priority: P1)
**Goal**: User asks a question answered by a diagram — the system retrieves that diagram's content and surfaces it in the chat response with a citation
**Independent Test**: With a book embedded (Phase 3 checkpoint passed), ask a chat question whose answer is depicted only in a diagram. The response `sources` array must contain at least one entry with `type: "FIGURE"` and a non-empty `imageUrl`.
- [X] T021 [US1] Create `NeurosurgeryRetriever` service in `backend/src/main/java/com/aiteacher/retrieval/NeurosurgeryRetriever.java` — (1) text chunk search: `vectorStore.similaritySearch` with filter `type == TEXT AND book_id == bookId`, topK=5; (2) figure search: same store, filter `type == FIGURE AND book_id == bookId`, topK=3; (3) expand text chunk results to parent sections via `SectionRepository.findAllById(sectionIds)`; (4) fetch explicitly linked figures via `ChunkFigureRefRepository.findByChunkIdIn(chunkIds)` + `FigureRepository.findAllById`; (5) deduplicate figures across lists by `figureId`; return `RetrievalResult(parentSections, figureVectorHits, linkedFigures)` — add `RetrievalResult` record in same package
- [X] T022 [US1] Refactor `ChatService.sendMessage()` in `backend/src/main/java/com/aiteacher/chat/ChatService.java` — replace `QuestionAnswerAdvisor` with a manual call to `NeurosurgeryRetriever`; build the LLM user message from: section full texts as `[Section X.Y — Title, pp.A-B]\n{fullText}` blocks, followed by `AVAILABLE FIGURES FOR THIS SECTION:` list with `- {label} (p.{page}): {caption} [image: {filename}]` lines per figure; append the instruction `When referencing diagrams, cite them as [Fig. X, p.N].`; send via `chatClient.prompt().system(SYSTEM_PROMPT).user(prompt).call()`
- [X] T023 [US1] Add `GET /api/v1/books/{id}/figures` endpoint to `BookController` — returns `200` with `List<FigureResponse>`; `FigureResponse` is a new record in `backend/src/main/java/com/aiteacher/book/FigureResponse.java` with fields `figureId`, `label`, `caption`, `figureType`, `page`, `imageUrl` (assembled as `/api/v1/figures/{bookId}/{filename}`), `sectionId`, `sectionTitle`; returns `404` if book not found
- [X] T024 [US1] Update `extractSources()` in `ChatService` to build both TEXT and FIGURE source entries: TEXT entries keep existing fields plus `"type": "TEXT"`; FIGURE entries add `"type": "FIGURE"`, `"figureId"`, `"label"`, `"caption"`, `"figureType"`, `"imageUrl"` — source data comes from `RetrievalResult` (text chunk Documents and merged FigureEntity list)
**Checkpoint**: Chat question answered by a diagram → response body contains `sources[n].type == "FIGURE"` with populated `imageUrl`; image loads from the returned URL
---
## Phase 5: User Story 3 — Rich Metadata Enables Precise Source Attribution (Priority: P2)
**Goal**: Users see distinct, informative citations for text vs. image sources; image sources render inline in the chat UI
**Independent Test**: After triggering a response with figure sources, inspect the chat message in the UI — text sources and figure sources are visually distinguishable; figure sources render the actual image inline using the `imageUrl`
- [X] T025 [P] [US3] Update API response types in `frontend/src/services/api.ts` — extend the `Source` type to include `type: 'TEXT' | 'FIGURE'`, `figureId?: string`, `label?: string`, `caption?: string`, `figureType?: string`, `imageUrl?: string`
- [X] T026 [P] [US3] Update the chat source/citation display in the frontend (wherever sources are currently rendered, e.g. `frontend/src/components/` or `frontend/src/views/`) — render TEXT sources with a document icon and page number; render FIGURE sources with the image (`<img :src="source.imageUrl">`) below the label and caption text
- [X] T027 [US3] Add figure-type badge rendering in the frontend figure display: show a label derived from `figureType` (e.g. "MRI / CT", "Anatomical Diagram", "Table") alongside the figure caption so users can identify content type without opening the image
---
## Phase 6: Polish & Cross-Cutting Concerns
- [X] T028 Update `README.md` Mermaid architecture diagram to show three storage tiers: pgvector (semantic search), Postgres (source of truth — sections, figures, refs), and file store (extracted PNGs) — **required by Constitution Principle IV in the same PR as the other changes**
- [X] T029 [P] Write `FigureExtractionServiceTest` unit test in `backend/src/test/java/com/aiteacher/document/FigureExtractionServiceTest.java` — test: images below min size are skipped; `FigureType` classification matches keyword table in data-model.md; caption parsed from adjacent text line
- [X] T030 [P] Write `NeurosurgeryRetrieverTest` unit test in `backend/src/test/java/com/aiteacher/retrieval/NeurosurgeryRetrieverTest.java` — test: figure IDs from both vector hits and chunk refs are merged without duplicates; `RetrievalResult` contains the deduplicated set
- [X] T031 Run quickstart.md validation end-to-end: upload a real PDF with a labelled diagram → wait for `READY` → call `GET /api/v1/books/{id}/figures` → send a chat message about the diagram → verify `sources` contains a `FIGURE` entry → verify `imageUrl` resolves to a PNG
---
## Dependencies & Execution Order
### Phase Dependencies
- **Phase 1 (Setup)**: No dependencies — start immediately
- **Phase 2 (Foundational)**: Requires Phase 1 complete (migrations must run before JPA entities can be wired)
- **Phase 3 (US2)**: Requires Phase 2 complete — all JPA entities + FigureStorageService must exist
- **Phase 4 (US1)**: Requires Phase 3 complete — figures must exist in Postgres + vector store before retrieval can surface them
- **Phase 5 (US3)**: Requires Phase 4 complete — frontend depends on the extended `sources` format from T024
- **Phase 6 (Polish)**: Requires all story phases complete
### Within Phase 3 (Embedding Pipeline)
```
T013 (PdfStructureParser) ──────────────────────────┐
T014 (FigureExtractionService) ─────────────────────┤
T015 (VisionDescriptionService) ────────────────────┤─→ T018 (BookEmbeddingService orchestrator)
T016 (TextChunkingService) ─────────────────────────┤ └─→ T019 (cleanup)
T017 (ChunkFigureRefService) ───────────────────────┘ └─→ T020 (reembed endpoint)
```
T013T017 can be implemented in parallel (different files, no shared dependencies). T018 depends on all of them.
### Within Phase 4 (Retrieval)
```
T021 (NeurosurgeryRetriever) ──────────────────────┐
└─→ T022 (ChatService update)
└─→ T024 (extractSources update)
T023 (figures endpoint) ── independent [P]
```
### Parallel Opportunities per Phase
**Phase 2**: T005, T006, T009, T010, T011 can all run in parallel. T007 depends on T006. T012 can follow T010/T011.
**Phase 3**: T013, T014, T015, T016, T017 all in parallel. T018 depends on all.
**Phase 5**: T025 and T026 in parallel; T027 can follow T026.
**Phase 6**: T029 and T030 in parallel.
---
## Implementation Strategy
### MVP: User Story 2 Only (Embedding Pipeline)
1. Phase 1 (Setup) → Phase 2 (Foundational) → Phase 3 (US2, T013T020)
2. **Validate**: `GET /api/v1/books/{id}/figures` returns figures for a test book
3. **Stop and demo** — the pipeline produces image chunks without any retrieval changes
### Full Feature Delivery
1. Phase 1 + 2 → Foundation ready
2. Phase 3 (US2) → Embedding pipeline produces image chunks ← **demo point**
3. Phase 4 (US1) → Chat surfaces image content in responses ← **core payoff**
4. Phase 5 (US3) → Frontend renders inline figures with type badges
5. Phase 6 (Polish) → README, tests, end-to-end validation
---
## Notes
- [P] tasks = different files, no dependencies on each other within the same phase
- [US1/US2/US3] label maps each task to a user story for traceability
- Phase 3 (US2) must be fully complete before beginning Phase 4 (US1) — retrieval cannot surface figures that do not yet exist
- The `uploads/figures/` directory must exist and be writable at runtime; `FigureStorageService` creates subdirectories automatically
- Re-embedding (T020) deletes all existing chunks and figures for the book before re-running — safe to call on books processed by feature 001