Files
ai-teacher/specs/002-image-aware-embedding/plan.md
T
2026-04-04 21:30:18 +02:00

86 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Implementation Plan: Enhanced Embedding with Image Parsing and Metadata
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 | **Spec**: [spec.md](spec.md)
**Input**: Feature specification from `/specs/002-image-aware-embedding/spec.md`
## Summary
Enhance the PDF embedding pipeline to extract figures and generate AI descriptions for them,
making image content semantically searchable alongside text. PDF parsing and figure extraction
are delegated to a local **Marker** server (`http://localhost:8000/marker/upload`), which
returns reading-order text and pre-cropped figure images (base64) in a single JSON response,
eliminating the need for PDFBox column heuristics and figure bbox rendering.
## Technical Context
**Language/Version**: Java 25 (backend), TypeScript / Node 20 (frontend)
**Primary Dependencies**: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings +
GPT-4o vision), PDFBox 3.0.3 (via `spring-ai-pdf-document-reader` — retained transitively,
no longer used directly), Marker local HTTP API (`http://localhost:8000/marker/upload`)
**Storage**: PostgreSQL (JPA + Flyway), pgvector (Spring AI `VectorStore`), S3-compatible
object store (figure images via `FigureStorageService`)
**Testing**: Maven / JUnit 5 (`spring-boot-starter-test`)
**Target Platform**: Linux server
**Project Type**: Web application (backend API + frontend client)
**Performance Goals**: SC-003 — book processing time ≤ 3× text-only for ≤ 500 pages
**Constraints**: REST API only (Constitution III); Marker server must be running locally;
S3-compatible storage configured via env vars
**Scale/Scope**: POC — handful of books, <10 users
## Constitution Check
*GATE: Must pass before Phase 0 research. Re-checked after Phase 1 design.*
| Principle | Status | Notes |
|-----------|--------|-------|
| **I. KISS** | ✅ Justified | Marker replaces a bespoke PDFBox column heuristic + Google Cloud SDK with one HTTP call. Net complexity reduction vs. the Document AI approach. |
| **II. Easy to Change** | ✅ | `MarkerPageParser` is the only class that knows about Marker; swap the implementation to replace Marker with any other parser. `PageResult` DTO remains unchanged. |
| **III. Web-First** | ✅ | Internal pipeline change; no public API contract change. |
| **IV. Documentation** | ✅ | README must be updated to show Marker as a local external service. |
## Project Structure
### Documentation (this feature)
```text
specs/002-image-aware-embedding/
├── plan.md # This file
├── research.md # Phase 0 output
├── data-model.md # Phase 1 output
├── quickstart.md # Phase 1 output
├── contracts/
│ ├── api.md # HTTP API contracts (unchanged from initial plan)
│ └── marker-page-result.md # Internal DTO contract (MarkerPageParser → downstream)
└── tasks.md # Phase 2 output (/speckit.tasks — not created here)
```
### Source Code
```text
backend/
├── src/main/java/com/aiteacher/
│ ├── config/
│ │ └── MarkerConfig.java # NEW: RestClient bean + base-url property
│ ├── document/
│ │ ├── MarkerPageParser.java # NEW: replaces DocumentAiPageParser + PdfStructureParser
│ │ ├── PageResult.java # UPDATED: FigureBbox → FigureData (bytes not bbox)
│ │ ├── FigureExtractionService.java # UPDATED: no PDFBox render; decode bytes directly
│ │ ├── TextChunkingService.java # UNCHANGED
│ │ ├── VisionDescriptionService.java # UNCHANGED
│ │ └── [removed] DocumentAiPageParser.java
│ ├── book/
│ │ └── BookEmbeddingService.java # MINOR UPDATE: inject MarkerPageParser, drop DocumentAiPageParser
│ └── [removed] config/DocumentAiConfig.java
├── src/main/resources/
│ └── application.yaml # UPDATED: remove document-ai.*, add marker.base-url
└── pom.xml # UPDATED: remove google-cloud-document-ai
```
**Structure Decision**: Option 2 (backend + frontend) per constitution Technology Constraints.
Frontend changes are display-only (render figure citations inline).
## Complexity Tracking
> No constitution violations — Marker reduces complexity compared to the previous
> Google Document AI approach (fewer dependencies, no GCP credentials, no 15-page batching).