adding Marker to parse effectively pdf

2026-04-04 21:30:18 +02:00
parent b154e29f2d
commit ea1276dc2e
25 changed files with 2318 additions and 285 deletions
@@ -1,40 +1,42 @@
 # Implementation Plan: Enhanced Embedding with Image Parsing and Metadata

-**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03 | **Spec**: [spec.md](spec.md)  
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 | **Spec**: [spec.md](spec.md)  
 **Input**: Feature specification from `/specs/002-image-aware-embedding/spec.md`

 ## Summary

-Enhance the book embedding pipeline to extract images from every PDF page, generate descriptive
-text for each image, and store all content (text chunks + figure captions) with rich, consistent
-metadata in the vector store. A new document hierarchy (Book → Chapter → Section → TextChunk +
-Figure) is introduced. Postgres holds the full-text sections and figure metadata; the vector
-store holds chunk and figure caption embeddings; the local file store holds extracted image files.
-At query time, both the text-chunk store and figure-caption store are searched in parallel and
-results are merged before being sent to the LLM.
+Enhance the PDF embedding pipeline to extract figures and generate AI descriptions for them,
+making image content semantically searchable alongside text. PDF parsing and figure extraction
+are delegated to a local **Marker** server (`http://localhost:8000/marker/upload`), which
+returns reading-order text and pre-cropped figure images (base64) in a single JSON response,
+eliminating the need for PDFBox column heuristics and figure bbox rendering.

 ## Technical Context

 **Language/Version**: Java 25 (backend), TypeScript / Node 20 (frontend)  
-**Primary Dependencies**: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings + chat), PDFBox (via Spring AI PDF reader dependency)  
-**Storage**: PostgreSQL (JPA + Flyway), pgvector (Spring AI `VectorStore`), local file system (extracted images — `/uploads/figures/`)  
-**Testing**: Spring Boot Test, JUnit 5, Mockito  
-**Target Platform**: Linux server (Docker Compose)  
-**Project Type**: Web application — backend REST API + Vue 3 frontend  
-**Performance Goals**: Full book (up to 500 pages with images) processed in ≤ 30 minutes; query response unchanged from existing baseline  
-**Constraints**: No new deployable units; all changes within the existing `backend/` module; image storage on local disk (S3 migration is a future concern, behind an interface)  
-**Scale/Scope**: POC — <10 concurrent users; single shared book library
+**Primary Dependencies**: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings +
+GPT-4o vision), PDFBox 3.0.3 (via `spring-ai-pdf-document-reader` — retained transitively,
+no longer used directly), Marker local HTTP API (`http://localhost:8000/marker/upload`)  
+**Storage**: PostgreSQL (JPA + Flyway), pgvector (Spring AI `VectorStore`), S3-compatible
+object store (figure images via `FigureStorageService`)  
+**Testing**: Maven / JUnit 5 (`spring-boot-starter-test`)  
+**Target Platform**: Linux server  
+**Project Type**: Web application (backend API + frontend client)  
+**Performance Goals**: SC-003 — book processing time ≤ 3× text-only for ≤ 500 pages  
+**Constraints**: REST API only (Constitution III); Marker server must be running locally;
+S3-compatible storage configured via env vars  
+**Scale/Scope**: POC — handful of books, <10 users

 ## Constitution Check

-*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
+*GATE: Must pass before Phase 0 research. Re-checked after Phase 1 design.*

 | Principle | Status | Notes |
 |-----------|--------|-------|
-| I — KISS | ⚠️ Justified violation — see Complexity Tracking | Hierarchical model + dual search adds complexity; justified by precision requirement |
-| II — Easy to Change | ✅ | Figure storage wrapped behind `FigureStorageService` interface; can swap local disk for S3 |
-| III — Web-First | ✅ | All new capabilities exposed via existing REST API; no new deployable units |
-| IV — Docs as Architecture | ⚠️ Required | README Mermaid diagram MUST be updated in this PR to show new storage tiers |
+| **I. KISS** | ✅ Justified | Marker replaces a bespoke PDFBox column heuristic + Google Cloud SDK with one HTTP call. Net complexity reduction vs. the Document AI approach. |
+| **II. Easy to Change** | ✅ | `MarkerPageParser` is the only class that knows about Marker; swap the implementation to replace Marker with any other parser. `PageResult` DTO remains unchanged. |
+| **III. Web-First** | ✅ | Internal pipeline change; no public API contract change. |
+| **IV. Documentation** | ✅ | README must be updated to show Marker as a local external service. |

 ## Project Structure

@@ -46,60 +48,38 @@ specs/002-image-aware-embedding/
 ├── research.md          # Phase 0 output
 ├── data-model.md        # Phase 1 output
 ├── quickstart.md        # Phase 1 output
-├── contracts/           # Phase 1 output
-└── tasks.md             # Phase 2 output (/speckit.tasks)
+├── contracts/
+│   ├── api.md           # HTTP API contracts (unchanged from initial plan)
+│   └── marker-page-result.md  # Internal DTO contract (MarkerPageParser → downstream)
+└── tasks.md             # Phase 2 output (/speckit.tasks — not created here)
 ```

-### Source Code (repository root)
+### Source Code

 ```text
 backend/
 ├── src/main/java/com/aiteacher/
+│   ├── config/
+│   │   └── MarkerConfig.java          # NEW: RestClient bean + base-url property
+│   ├── document/
+│   │   ├── MarkerPageParser.java      # NEW: replaces DocumentAiPageParser + PdfStructureParser
+│   │   ├── PageResult.java            # UPDATED: FigureBbox → FigureData (bytes not bbox)
+│   │   ├── FigureExtractionService.java  # UPDATED: no PDFBox render; decode bytes directly
+│   │   ├── TextChunkingService.java   # UNCHANGED
+│   │   ├── VisionDescriptionService.java # UNCHANGED
+│   │   └── [removed] DocumentAiPageParser.java
 │   ├── book/
-│   │   ├── Book.java                         (existing)
-│   │   ├── BookController.java               (existing)
-│   │   ├── BookService.java                  (existing)
-│   │   ├── BookRepository.java               (existing)
-│   │   ├── BookStatus.java                   (existing)
-│   │   ├── BookEmbeddingService.java         (existing — enhanced)
-│   │   └── NoKnowledgeSourceException.java   (existing)
-│   ├── document/                             (new package)
-│   │   ├── BookNode.java
-│   │   ├── ChapterNode.java
-│   │   ├── SectionNode.java
-│   │   ├── SectionRepository.java
-│   │   ├── TextChunkNode.java
-│   │   ├── FigureNode.java
-│   │   ├── FigureRepository.java
-│   │   ├── FigureType.java
-│   │   ├── ChunkFigureRef.java
-│   │   └── ChunkFigureRefRepository.java
-│   ├── figure/                               (new package)
-│   │   ├── FigureStorageService.java         (interface)
-│   │   └── LocalFigureStorageService.java    (implementation)
-│   ├── retrieval/                            (new package)
-│   │   └── NeurosurgeryRetriever.java
-│   ├── chat/
-│   │   └── ChatService.java                  (updated — uses NeurosurgeryRetriever)
-│   └── config/
-│       └── FigureStorageConfig.java          (new — configures upload dir)
-└── src/main/resources/
-    └── db/migration/
-        ├── V4__document_hierarchy.sql        (new)
-        └── V5__figures_and_refs.sql          (new)
-
-uploads/
-└── figures/                                  (runtime — extracted images; gitignored)
+│   │   └── BookEmbeddingService.java  # MINOR UPDATE: inject MarkerPageParser, drop DocumentAiPageParser
+│   └── [removed] config/DocumentAiConfig.java
+├── src/main/resources/
+│   └── application.yaml               # UPDATED: remove document-ai.*, add marker.base-url
+└── pom.xml                            # UPDATED: remove google-cloud-document-ai
 ```

-**Structure Decision**: Option 2 (Web Application) confirmed. All backend changes stay within
-`backend/`. Two new packages (`document/`, `retrieval/`) plus one interface package (`figure/`)
-keep concerns separated without adding a deployable unit.
+**Structure Decision**: Option 2 (backend + frontend) per constitution Technology Constraints.
+Frontend changes are display-only (render figure citations inline).

 ## Complexity Tracking

-| Violation | Why Needed | Simpler Alternative Rejected Because |
-|-----------|------------|-------------------------------------|
-| Document hierarchy (BookNode → ChapterNode → SectionNode) | Parent-child retrieval: chunks reference their parent section so the LLM receives full section context, not just the matching fragment. This is the established solution for RAG precision. | Flat page-per-doc model (current) loses inter-sentence context; chunk-only retrieval produces incomplete answers for multi-paragraph clinical questions |
-| Dual vector search (text chunks + figure captions) | Figure captions must be independently searchable — a query about "cavernous sinus anatomy" must surface the diagram even if no text chunk scores highly | Single vector store search would miss figures whose captions don't happen to be the highest-similarity hit; this is the core deliverable of the feature |
-| Third storage tier (local file store for images) | Extracted images cannot live in Postgres (binary blobs degrade query performance) or the vector store (only vectors). A file-per-image approach is standard. | Storing images as base64 in Postgres JSONB would bloat the DB and complicate backup/restore; the `FigureStorageService` interface keeps the implementation swappable |
+> No constitution violations — Marker reduces complexity compared to the previous
+> Google Document AI approach (fewer dependencies, no GCP credentials, no 15-page batching).