Files

T

Adrien ea1276dc2e adding Marker to parse effectively pdf

2026-04-04 21:30:18 +02:00

17 KiB

Raw Permalink Blame History

Tasks: Enhanced Embedding with Image Parsing and Metadata

Input: Design documents from /specs/002-image-aware-embedding/ Prerequisites: plan.md ✓ | spec.md ✓ | research.md ✓ | data-model.md ✓ | contracts/ ✓

Organization: Tasks grouped by user story to enable independent implementation and testing.

Format: `[ID] [P?] [Story] Description`

[P]: Can run in parallel (different files, no shared dependencies)
[US1/US2/US3]: Which user story this task belongs to

Phase 1: Setup (Shared Infrastructure)

Purpose: Database migrations and configuration that establish the foundation for all new code

T001 Create Flyway migration V4__document_hierarchy.sql — add chapter and section tables per data-model.md §Postgres Schema in backend/src/main/resources/db/migration/V4__document_hierarchy.sql
T002 Create Flyway migration V5__figures_and_refs.sql — add figure and chunk_figure_ref tables per data-model.md §Postgres Schema in backend/src/main/resources/db/migration/V5__figures_and_refs.sql
T003 Add figure-storage configuration keys to backend/src/main/resources/application.properties: app.figure-storage.base-path=./uploads and app.figure-storage.min-image-size-px=100
T004 Add uploads/ directory to .gitignore at repo root; create uploads/figures/.gitkeep to preserve directory structure

Phase 2: Foundational (Blocking Prerequisites)

Purpose: Core types and infrastructure that ALL user stories depend on — nothing in Phase 3+ can start until this phase is complete

⚠️ CRITICAL: No user story work can begin until this phase is complete

T005 [P] Create FigureType enum in backend/src/main/java/com/aiteacher/document/FigureType.java — values: ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN, TABLE, CHART, INTRAOPERATIVE_IMAGE
T006 [P] Create FigureStorageService interface in backend/src/main/java/com/aiteacher/figure/FigureStorageService.java — declare Path save(UUID bookId, String figureId, BufferedImage image), Path resolve(UUID bookId, String filename), and void delete(UUID bookId)
T007 Create LocalFigureStorageService implementation in backend/src/main/java/com/aiteacher/figure/LocalFigureStorageService.java — writes PNG files under ${app.figure-storage.base-path}/figures/{bookId}/; implements FigureStorageService; depends on T006
T008 Create FigureStorageConfig bean in backend/src/main/java/com/aiteacher/config/FigureStorageConfig.java — reads app.figure-storage.base-path and app.figure-storage.min-image-size-px as @ConfigurationProperties; registers LocalFigureStorageService as @Bean; adds ResourceHandler mapping GET /api/v1/figures/** to the base-path directory
T009 [P] Create ChapterEntity JPA entity and ChapterRepository in backend/src/main/java/com/aiteacher/document/ — @Entity(name="chapter"), fields: id (String PK), bookId (UUID FK → book), number (int), title (String), pageStart (int), createdAt (Instant); ChapterRepository extends JpaRepository<ChapterEntity, String>
T010 [P] Create SectionEntity JPA entity and SectionRepository in backend/src/main/java/com/aiteacher/document/ — @Entity(name="section"), fields: id (String PK), chapterId (String FK → chapter), bookId (UUID FK → book), number (String), title (String), pageStart/pageEnd (int), fullText (TEXT column), createdAt (Instant); SectionRepository extends JpaRepository<SectionEntity, String> with findAllByBookId(UUID)
T011 [P] Create FigureEntity JPA entity and FigureRepository in backend/src/main/java/com/aiteacher/document/ — @Entity(name="figure"), fields: id (String PK), bookId (UUID), sectionId (String, nullable), chapterId (String, nullable), label (String), caption (TEXT), figureType (@Enumerated FigureType), page (int), imagePath (String), captionEmbeddingId (UUID, nullable), createdAt (Instant); FigureRepository with findAllByBookId(UUID), deleteAllByBookId(UUID)
T012 Create ChunkFigureRefEntity JPA entity and ChunkFigureRefRepository in backend/src/main/java/com/aiteacher/document/ — composite PK (chunkId UUID, figureId String), mentionPage (int); ChunkFigureRefRepository with findByChunkIdIn(List<UUID>), deleteByFigureIdIn(List<String>)

Checkpoint: Migrations will run on next startup; all JPA entities are wired; figure storage reads config correctly

Phase 3: User Story 2 — All Pages Scanned for Images During Embedding (Priority: P1)

Goal: When a book is uploaded, every page is inspected for images; each found image is extracted, persisted, described, and embedded as a searchable chunk alongside its metadata

Independent Test: Upload a PDF containing at least one page with a labelled anatomical diagram. After status shows READY, call GET /api/v1/books/{id}/figures — response must contain at least one entry with figureType, caption, page, and imageUrl populated. Verify the PNG file exists at the path in imagePath.

T013 [US2] ~~Create PdfStructureParser~~ → SUPERSEDED: PDF parsing is handled by MarkerPageParser (see T013b). PdfStructureParser exists but is not wired into the pipeline.
T013b [US2] Create MarkerPageParser in backend/src/main/java/com/aiteacher/document/MarkerPageParser.java — POSTs PDF to http://localhost:8000/marker/upload?output_format=json via Spring RestClient; parses JSON response into List<PageResult> (one per page block); extracts heading, ordered text, and pre-cropped figure PNG bytes per page
T014 [US2] Update FigureExtractionService in backend/src/main/java/com/aiteacher/document/FigureExtractionService.java — Marker migration: removed PDFBox rendering + bbox-crop loop; decodes PNG bytes from PageResult.FigureData via ImageIO.read(); skips images below min-image-size-px; classifies FigureType; saves via FigureStorageService; persists FigureEntity
T015 [US2] Create VisionDescriptionService in backend/src/main/java/com/aiteacher/document/VisionDescriptionService.java — accepts a Path to a PNG and a caption String; calls the OpenAI vision model (via Spring AI ChatClient with image media type) to generate a 2–4 sentence clinical description; returns the generated description string; handles API failures by returning the caption as fallback
T016 [US2] Create TextChunkingService in backend/src/main/java/com/aiteacher/document/TextChunkingService.java — accepts a SectionEntity; splits fullText into overlapping 400–600 token windows (20-token overlap); wraps each window in a Spring AI Document with the flat metadata map defined in data-model.md §Text chunk document; returns List<Document>
T017 [US2] Create ChunkFigureRefService in backend/src/main/java/com/aiteacher/document/ChunkFigureRefService.java — accepts a Spring AI Document (with its id as chunkId) and a List<FigureEntity> for the book; scans chunk text for patterns Fig\.\s*\d+[\-\.]\d+ and Figure\s+\d+[\-\.]\d+; matches against figure labels; persists ChunkFigureRefEntity rows via ChunkFigureRefRepository
T018 [US2] Update BookEmbeddingService.embedBook() — Marker migration: injected MarkerPageParser replacing DocumentAiPageParser; updated figureExtractionService.extract() call (removed pdfPath arg); updated log message. Pipeline: (1) MarkerPageParser → List<PageResult>; (2) buildAndSaveSections() → sections; (3) TextChunkingService → chunks → embed; (4) FigureExtractionService.extract() → figures; (5) VisionDescriptionService → embed figure chunks; (6) ChunkFigureRefService → refs
T019 [US2] Extend BookEmbeddingService.deleteBookChunks() to also delete: all ChunkFigureRefEntity rows (via findByFigureIdIn), all FigureEntity rows (via deleteAllByBookId), all figure PNG files (via FigureStorageService.delete(bookId)), all SectionEntity and ChapterEntity rows for the book
T020 [US2] Add POST /api/v1/books/{id}/reembed endpoint to BookController in backend/src/main/java/com/aiteacher/book/BookController.java — returns 202 with { bookId, status: "PROCESSING" }; returns 404 if not found; returns 409 if already PROCESSING; calls deleteBookChunks() then embedBook() asynchronously

Checkpoint: Upload a PDF with figures → poll GET /api/v1/books for READY → GET /api/v1/books/{id}/figures returns figure list → PNG accessible at GET /api/v1/figures/{bookId}/{filename}

Phase 4: User Story 1 — Image Content Surfaced in Query Results (Priority: P1)

Goal: User asks a question answered by a diagram — the system retrieves that diagram's content and surfaces it in the chat response with a citation

Independent Test: With a book embedded (Phase 3 checkpoint passed), ask a chat question whose answer is depicted only in a diagram. The response sources array must contain at least one entry with type: "FIGURE" and a non-empty imageUrl.

T021 [US1] Create NeurosurgeryRetriever service in backend/src/main/java/com/aiteacher/retrieval/NeurosurgeryRetriever.java — (1) text chunk search: vectorStore.similaritySearch with filter type == TEXT AND book_id == bookId, topK=5; (2) figure search: same store, filter type == FIGURE AND book_id == bookId, topK=3; (3) expand text chunk results to parent sections via SectionRepository.findAllById(sectionIds); (4) fetch explicitly linked figures via ChunkFigureRefRepository.findByChunkIdIn(chunkIds) + FigureRepository.findAllById; (5) deduplicate figures across lists by figureId; return RetrievalResult(parentSections, figureVectorHits, linkedFigures) — add RetrievalResult record in same package
T022 [US1] Refactor ChatService.sendMessage() in backend/src/main/java/com/aiteacher/chat/ChatService.java — replace QuestionAnswerAdvisor with a manual call to NeurosurgeryRetriever; build the LLM user message from: section full texts as [Section X.Y — Title, pp.A-B]\n{fullText} blocks, followed by AVAILABLE FIGURES FOR THIS SECTION: list with - {label} (p.{page}): {caption} [image: {filename}] lines per figure; append the instruction When referencing diagrams, cite them as [Fig. X, p.N].; send via chatClient.prompt().system(SYSTEM_PROMPT).user(prompt).call()
T023 [US1] Add GET /api/v1/books/{id}/figures endpoint to BookController — returns 200 with List<FigureResponse>; FigureResponse is a new record in backend/src/main/java/com/aiteacher/book/FigureResponse.java with fields figureId, label, caption, figureType, page, imageUrl (assembled as /api/v1/figures/{bookId}/{filename}), sectionId, sectionTitle; returns 404 if book not found
T024 [US1] Update extractSources() in ChatService to build both TEXT and FIGURE source entries: TEXT entries keep existing fields plus "type": "TEXT"; FIGURE entries add "type": "FIGURE", "figureId", "label", "caption", "figureType", "imageUrl" — source data comes from RetrievalResult (text chunk Documents and merged FigureEntity list)

Checkpoint: Chat question answered by a diagram → response body contains sources[n].type == "FIGURE" with populated imageUrl; image loads from the returned URL

Phase 5: User Story 3 — Rich Metadata Enables Precise Source Attribution (Priority: P2)

Goal: Users see distinct, informative citations for text vs. image sources; image sources render inline in the chat UI

Independent Test: After triggering a response with figure sources, inspect the chat message in the UI — text sources and figure sources are visually distinguishable; figure sources render the actual image inline using the imageUrl

T025 [P] [US3] Update API response types in frontend/src/services/api.ts — extend the Source type to include type: 'TEXT' | 'FIGURE', figureId?: string, label?: string, caption?: string, figureType?: string, imageUrl?: string
T026 [P] [US3] Update the chat source/citation display in the frontend (wherever sources are currently rendered, e.g. frontend/src/components/ or frontend/src/views/) — render TEXT sources with a document icon and page number; render FIGURE sources with the image (<img :src="source.imageUrl">) below the label and caption text
T027 [US3] Add figure-type badge rendering in the frontend figure display: show a label derived from figureType (e.g. "MRI / CT", "Anatomical Diagram", "Table") alongside the figure caption so users can identify content type without opening the image

Phase 6: Polish & Cross-Cutting Concerns

T028 Update README.md Mermaid architecture diagram to show three storage tiers: pgvector (semantic search), Postgres (source of truth — sections, figures, refs), and file store (extracted PNGs) — required by Constitution Principle IV in the same PR as the other changes
T029 [P] Write FigureExtractionServiceTest unit test in backend/src/test/java/com/aiteacher/document/FigureExtractionServiceTest.java — test: images below min size are skipped; FigureType classification matches keyword table in data-model.md; caption parsed from adjacent text line
T030 [P] Write NeurosurgeryRetrieverTest unit test in backend/src/test/java/com/aiteacher/retrieval/NeurosurgeryRetrieverTest.java — test: figure IDs from both vector hits and chunk refs are merged without duplicates; RetrievalResult contains the deduplicated set
T031 Run quickstart.md validation end-to-end: upload a real PDF with a labelled diagram → wait for READY → call GET /api/v1/books/{id}/figures → send a chat message about the diagram → verify sources contains a FIGURE entry → verify imageUrl resolves to a PNG

Dependencies & Execution Order

Phase Dependencies

Phase 1 (Setup): No dependencies — start immediately
Phase 2 (Foundational): Requires Phase 1 complete (migrations must run before JPA entities can be wired)
Phase 3 (US2): Requires Phase 2 complete — all JPA entities + FigureStorageService must exist
Phase 4 (US1): Requires Phase 3 complete — figures must exist in Postgres + vector store before retrieval can surface them
Phase 5 (US3): Requires Phase 4 complete — frontend depends on the extended sources format from T024
Phase 6 (Polish): Requires all story phases complete

Within Phase 3 (Embedding Pipeline)

T013 (PdfStructureParser) ──────────────────────────┐
T014 (FigureExtractionService) ─────────────────────┤
T015 (VisionDescriptionService) ────────────────────┤─→ T018 (BookEmbeddingService orchestrator)
T016 (TextChunkingService) ─────────────────────────┤           └─→ T019 (cleanup)
T017 (ChunkFigureRefService) ───────────────────────┘           └─→ T020 (reembed endpoint)

T013–T017 can be implemented in parallel (different files, no shared dependencies). T018 depends on all of them.

Within Phase 4 (Retrieval)

T021 (NeurosurgeryRetriever) ──────────────────────┐
                                                   └─→ T022 (ChatService update)
                                                   └─→ T024 (extractSources update)
T023 (figures endpoint) ── independent [P]

Parallel Opportunities per Phase

Phase 2: T005, T006, T009, T010, T011 can all run in parallel. T007 depends on T006. T012 can follow T010/T011.

Phase 3: T013, T014, T015, T016, T017 all in parallel. T018 depends on all.

Phase 5: T025 and T026 in parallel; T027 can follow T026.

Phase 6: T029 and T030 in parallel.

Implementation Strategy

MVP: User Story 2 Only (Embedding Pipeline)

Phase 1 (Setup) → Phase 2 (Foundational) → Phase 3 (US2, T013–T020)
Validate: GET /api/v1/books/{id}/figures returns figures for a test book
Stop and demo — the pipeline produces image chunks without any retrieval changes

Full Feature Delivery

Phase 1 + 2 → Foundation ready
Phase 3 (US2) → Embedding pipeline produces image chunks ← demo point
Phase 4 (US1) → Chat surfaces image content in responses ← core payoff
Phase 5 (US3) → Frontend renders inline figures with type badges
Phase 6 (Polish) → README, tests, end-to-end validation

Notes

[P] tasks = different files, no dependencies on each other within the same phase
[US1/US2/US3] label maps each task to a user story for traceability
Phase 3 (US2) must be fully complete before beginning Phase 4 (US1) — retrieval cannot surface figures that do not yet exist
The uploads/figures/ directory must exist and be writable at runtime; FigureStorageService creates subdirectories automatically
Re-embedding (T020) deletes all existing chunks and figures for the book before re-running — safe to call on books processed by feature 001

17 KiB Raw Permalink Blame History Unescape Escape