4.4 KiB
Implementation Plan: Enhanced Embedding with Image Parsing and Metadata
Branch: 002-image-aware-embedding | Date: 2026-04-04 | Spec: spec.md
Input: Feature specification from /specs/002-image-aware-embedding/spec.md
Summary
Enhance the PDF embedding pipeline to extract figures and generate AI descriptions for them,
making image content semantically searchable alongside text. PDF parsing and figure extraction
are delegated to a local Marker server (http://localhost:8000/marker/upload), which
returns reading-order text and pre-cropped figure images (base64) in a single JSON response,
eliminating the need for PDFBox column heuristics and figure bbox rendering.
Technical Context
Language/Version: Java 25 (backend), TypeScript / Node 20 (frontend)
Primary Dependencies: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings +
GPT-4o vision), PDFBox 3.0.3 (via spring-ai-pdf-document-reader — retained transitively,
no longer used directly), Marker local HTTP API (http://localhost:8000/marker/upload)
Storage: PostgreSQL (JPA + Flyway), pgvector (Spring AI VectorStore), S3-compatible
object store (figure images via FigureStorageService)
Testing: Maven / JUnit 5 (spring-boot-starter-test)
Target Platform: Linux server
Project Type: Web application (backend API + frontend client)
Performance Goals: SC-003 — book processing time ≤ 3× text-only for ≤ 500 pages
Constraints: REST API only (Constitution III); Marker server must be running locally;
S3-compatible storage configured via env vars
Scale/Scope: POC — handful of books, <10 users
Constitution Check
GATE: Must pass before Phase 0 research. Re-checked after Phase 1 design.
| Principle | Status | Notes |
|---|---|---|
| I. KISS | ✅ Justified | Marker replaces a bespoke PDFBox column heuristic + Google Cloud SDK with one HTTP call. Net complexity reduction vs. the Document AI approach. |
| II. Easy to Change | ✅ | MarkerPageParser is the only class that knows about Marker; swap the implementation to replace Marker with any other parser. PageResult DTO remains unchanged. |
| III. Web-First | ✅ | Internal pipeline change; no public API contract change. |
| IV. Documentation | ✅ | README must be updated to show Marker as a local external service. |
Project Structure
Documentation (this feature)
specs/002-image-aware-embedding/
├── plan.md # This file
├── research.md # Phase 0 output
├── data-model.md # Phase 1 output
├── quickstart.md # Phase 1 output
├── contracts/
│ ├── api.md # HTTP API contracts (unchanged from initial plan)
│ └── marker-page-result.md # Internal DTO contract (MarkerPageParser → downstream)
└── tasks.md # Phase 2 output (/speckit.tasks — not created here)
Source Code
backend/
├── src/main/java/com/aiteacher/
│ ├── config/
│ │ └── MarkerConfig.java # NEW: RestClient bean + base-url property
│ ├── document/
│ │ ├── MarkerPageParser.java # NEW: replaces DocumentAiPageParser + PdfStructureParser
│ │ ├── PageResult.java # UPDATED: FigureBbox → FigureData (bytes not bbox)
│ │ ├── FigureExtractionService.java # UPDATED: no PDFBox render; decode bytes directly
│ │ ├── TextChunkingService.java # UNCHANGED
│ │ ├── VisionDescriptionService.java # UNCHANGED
│ │ └── [removed] DocumentAiPageParser.java
│ ├── book/
│ │ └── BookEmbeddingService.java # MINOR UPDATE: inject MarkerPageParser, drop DocumentAiPageParser
│ └── [removed] config/DocumentAiConfig.java
├── src/main/resources/
│ └── application.yaml # UPDATED: remove document-ai.*, add marker.base-url
└── pom.xml # UPDATED: remove google-cloud-document-ai
Structure Decision: Option 2 (backend + frontend) per constitution Technology Constraints. Frontend changes are display-only (render figure citations inline).
Complexity Tracking
No constitution violations — Marker reduces complexity compared to the previous Google Document AI approach (fewer dependencies, no GCP credentials, no 15-page batching).