Implementation Plan: Enhanced Embedding with Image Parsing and Metadata

Branch: 002-image-aware-embedding | Date: 2026-04-04 | Spec: spec.md
Input: Feature specification from /specs/002-image-aware-embedding/spec.md

Summary

Enhance the PDF embedding pipeline to extract figures and generate AI descriptions for them, making image content semantically searchable alongside text. PDF parsing and figure extraction are delegated to a local Marker server (http://localhost:8000/marker/upload), which returns reading-order text and pre-cropped figure images (base64) in a single JSON response, eliminating the need for PDFBox column heuristics and figure bbox rendering.

Technical Context

Language/Version: Java 25 (backend), TypeScript / Node 20 (frontend)
Primary Dependencies: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings + GPT-4o vision), PDFBox 3.0.3 (via spring-ai-pdf-document-reader — retained transitively, no longer used directly), Marker local HTTP API (http://localhost:8000/marker/upload)
Storage: PostgreSQL (JPA + Flyway), pgvector (Spring AI VectorStore), S3-compatible object store (figure images via FigureStorageService)
Testing: Maven / JUnit 5 (spring-boot-starter-test)
Target Platform: Linux server
Project Type: Web application (backend API + frontend client)
Performance Goals: SC-003 — book processing time ≤ 3× text-only for ≤ 500 pages
Constraints: REST API only (Constitution III); Marker server must be running locally; S3-compatible storage configured via env vars
Scale/Scope: POC — handful of books, <10 users

Constitution Check

GATE: Must pass before Phase 0 research. Re-checked after Phase 1 design.

Principle	Status	Notes
I. KISS	✅ Justified	Marker replaces a bespoke PDFBox column heuristic + Google Cloud SDK with one HTTP call. Net complexity reduction vs. the Document AI approach.
II. Easy to Change	✅	`MarkerPageParser` is the only class that knows about Marker; swap the implementation to replace Marker with any other parser. `PageResult` DTO remains unchanged.
III. Web-First	✅	Internal pipeline change; no public API contract change.
IV. Documentation	✅	README must be updated to show Marker as a local external service.

Project Structure

Documentation (this feature)

specs/002-image-aware-embedding/
├── plan.md              # This file
├── research.md          # Phase 0 output
├── data-model.md        # Phase 1 output
├── quickstart.md        # Phase 1 output
├── contracts/
│   ├── api.md           # HTTP API contracts (unchanged from initial plan)
│   └── marker-page-result.md  # Internal DTO contract (MarkerPageParser → downstream)
└── tasks.md             # Phase 2 output (/speckit.tasks — not created here)

Source Code

backend/
├── src/main/java/com/aiteacher/
│   ├── config/
│   │   └── MarkerConfig.java          # NEW: RestClient bean + base-url property
│   ├── document/
│   │   ├── MarkerPageParser.java      # NEW: replaces DocumentAiPageParser + PdfStructureParser
│   │   ├── PageResult.java            # UPDATED: FigureBbox → FigureData (bytes not bbox)
│   │   ├── FigureExtractionService.java  # UPDATED: no PDFBox render; decode bytes directly
│   │   ├── TextChunkingService.java   # UNCHANGED
│   │   ├── VisionDescriptionService.java # UNCHANGED
│   │   └── [removed] DocumentAiPageParser.java
│   ├── book/
│   │   └── BookEmbeddingService.java  # MINOR UPDATE: inject MarkerPageParser, drop DocumentAiPageParser
│   └── [removed] config/DocumentAiConfig.java
├── src/main/resources/
│   └── application.yaml               # UPDATED: remove document-ai.*, add marker.base-url
└── pom.xml                            # UPDATED: remove google-cloud-document-ai

Structure Decision: Option 2 (backend + frontend) per constitution Technology Constraints. Frontend changes are display-only (render figure citations inline).

Complexity Tracking

No constitution violations — Marker reduces complexity compared to the previous Google Document AI approach (fewer dependencies, no GCP credentials, no 15-page batching).

4.4 KiB Raw Permalink Blame History Unescape Escape