ai-teacher/specs/002-image-aware-embedding/research.md

# Research: Enhanced Embedding with Image Parsing and Metadata

**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)

This document resolves all technical unknowns identified during planning. Decisions 1–10 cover
the core pipeline. The **Marker Study** section at the bottom explains why Marker was chosen
over Google Document AI to drive PDF parsing and figure extraction.

---

## Decision 1: Document Hierarchy Model

**Decision**: Adopt a four-level hierarchy — `BookNode` → `ChapterNode` → `SectionNode` →
`TextChunkNode` + `FigureNode`. The `SectionNode` is the pivotal unit: it holds the full section
text in Postgres and is used for parent-child context expansion at retrieval time.

**Rationale**: A flat page-per-document model (current implementation) loses structural context.
When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text,
not just the matching fragment. Parent-child retrieval — where chunks point to their parent
section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section
association explicit and queryable.

**Alternatives considered**:
- Keep flat page model, add metadata only → rejected: insufficient for precise citation and
  context expansion
- Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent
  to LLM; cost and latency increase

---

## Decision 2: Document Parsing Strategy

**Decision**: Use **Marker** (local HTTP server, `http://localhost:8000/marker/upload`) as the
single entry point for PDF parsing. A single `POST` with `output_format=json` returns:
- Reading-order text blocks (headings, paragraphs) — no column-split heuristic needed
- Pre-cropped figure images as base64-encoded PNG in the `images` map of each `Figure` block
- Table, equation, and code blocks as structured HTML

`MarkerPageParser` translates the Marker JSON response into `List<PageResult>`, which is the
same internal DTO used by the rest of the pipeline.

**Rationale**: Marker handles column reordering, scanned-page OCR, and figure cropping in one
call, eliminating the PDFBox column heuristic (`PdfStructureParser`) and the PDFBox
render+crop loop in `FigureExtractionService`. Net result: fewer classes, no cloud dependency,
no GCP credentials.

**Alternatives considered**:
- PDFBox column heuristic (previous approach) → rejected: 50/50 split fails on asymmetric
  columns and scanned pages
- Google Document AI Layout Parser → rejected: adds GCP credentials, per-page billing, 15-page
  batch limit, and still requires PDFBox to render+crop figure regions from bounding boxes.
  See Marker Study below for detailed comparison.
- Screenshot each page + OCR → far slower; loses digital text quality

---

## Decision 3: Figure Content Representation

**Decision**: Generate a textual description of each extracted image using the OpenAI vision
model (GPT-4o). This description becomes the `content` field of the figure's vector store
document. The figure caption (parsed from the surrounding text) is also included to maximise
retrieval signal.

**Rationale**: Caption-only embedding would miss figures with no caption or with sparse labels.
Vision-generated descriptions produce richer semantic content (anatomy terms, structural
relationships) that matches clinical queries. The OpenAI client already in use supports image
inputs; no additional dependency is required.

**Alternatives considered**:
- Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
- Local vision model (LLaVA) → requires self-hosting; out of scope for POC
- OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)

---

## Decision 4: Dual Vector Search

**Decision**: At query time, run two parallel similarity searches:
1. Text chunk search (filtered by `type = "TEXT"` and `book_id`)
2. Figure caption search (filtered by `type = "FIGURE"` and `book_id`)

Results are merged and deduplicated. The LLM prompt receives the expanded parent section text
plus a structured figure reference list.

**Rationale**: A single search would rank text and figures against each other; figures with
terse captions would systematically lose to text chunks. Separate searches with independent
`topK` allow tuning each modality independently.

**Alternatives considered**:
- Single search, filter by relevance score → figure captions score lower than text; figures
  are systematically under-retrieved
- Post-process text results to look up linked figures only → misses figures that are relevant
  to the query but not explicitly referenced in the retrieved text chunks

---

## Decision 5: Chunk-to-Figure Linking

**Decision**: During text parsing, whenever a pattern matching `Fig.\s+\d+[\-\.]\d+` or
`Figure\s+\d+[\-\.]\d+` is found in a chunk, insert a row into the `chunk_figure_refs` table
linking `chunkId` → `figureId`. At retrieval time, after text chunks are retrieved, their
associated figures are fetched from this table and added to the LLM prompt.

**Rationale**: Explicit linking ensures that when a text chunk is retrieved, its referenced
figures are always surfaced — even if the figure's caption did not score highly in the vector
search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.

**Alternatives considered**:
- Rely entirely on dual vector search → may miss figures referenced in retrieved text but
  scoring below the topK threshold in the figure search

---

## Decision 6: Image Storage

**Decision**: Marker returns figure images as base64-encoded PNG bytes in the JSON response.
`FigureExtractionService` decodes these bytes and passes them to `FigureStorageService`, which
persists them to an S3-compatible bucket (`${app.figure-storage.bucket}`). The image path/URL
is stored in `figure.image_path` in Postgres.

The `FigureStorageService` interface is unchanged; only the caller changes (from PDFBox crop
to base64 decode).

**Rationale**: Marker's pre-cropped images remove the need for PDFBox rendering.
`FigureStorageService` interface boundary satisfies Constitution Principle II (Easy to Change).

**Alternatives considered**:
- Store base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades

---

## Decision 7: Figure Type Classification

**Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
2. Marker `block_type` hint (`"Table"` → TABLE, `"Figure"` / `"Picture"` → ANATOMICAL_DIAGRAM default)
3. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable

**Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
Heuristic classification avoids a separate model call per image at extraction time.

**Alternatives considered**:
- Vision model classification → accurate but adds latency and cost per figure; deferrable
- Single `FIGURE` type → loses citation granularity required by spec FR-004

---

## Decision 8: Metadata Schema for Vector Store Documents

**Decision**: All vector store documents carry a flat `Map<String, Object>` metadata for Spring
AI filtering. Schema:

| Field | Text Chunk | Figure Chunk |
|-------|-----------|-------------|
| `type` | `"TEXT"` | `"FIGURE"` |
| `book_id` | ✓ | ✓ |
| `book_title` | ✓ | ✓ |
| `chapter_id` | ✓ | ✓ |
| `section_id` | ✓ | ✓ |
| `section_title` | ✓ | ✓ |
| `page_start` | ✓ | — |
| `page_end` | ✓ | — |
| `chunk_index` | ✓ | — |
| `total_chunks` | ✓ | — |
| `figure_id` | — | ✓ |
| `figure_type` | — | ✓ |
| `image_path` | — | ✓ |
| `label` | — | ✓ |
| `page` | — | ✓ |

**Rationale**: Flat map is required by Spring AI `FilterExpressionBuilder`. Separation by `type`
allows independent filtering in dual search.

---

## Decision 9: Re-embedding Existing Books

**Decision**: Books already processed under feature 001 (text-only) are NOT automatically
re-embedded. An explicit re-embed action is exposed via `POST /api/v1/books/{id}/reembed`
(admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.

**Rationale**: Automatic re-embedding on deploy would block the system and risk data loss if
the process fails mid-way. An explicit, idempotent trigger is safer and more observable.

---

## Decision 10: Minimum Image Size Threshold

**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. Marker
returns PNG bytes; `FigureExtractionService` decodes to `BufferedImage` solely to check
dimensions. This threshold filters out decorative elements without a classification model.

**Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
The threshold is configurable via `app.figure-storage.min-image-size-px`.

**Alternatives considered**:
- No threshold → decorative icons pollute the figure index
- ML-based classification → accurate but adds model dependency; not needed at POC scale

---

# Marker Study — Why Marker Replaces Google Document AI

*Added 2026-04-04.*

## What Marker Offers

Marker is an open-source, locally-runnable PDF-to-structured-content converter that uses a
pipeline of deep-learning models (surya for OCR + layout detection, texify for equations).
Key capabilities relevant to this project:

| Capability | Marker | Google Document AI |
|-----------|--------|--------------------|
| Multi-column reading order | ✅ | ✅ |
| OCR on scanned pages | ✅ | ✅ |
| Figure detection | ✅ returns pre-cropped images | ⚠️ returns bbox only; PDFBox still needed |
| Table extraction | ✅ HTML tables | ✅ |
| JSON output with image bytes | ✅ base64 in `images` map | ❌ |
| No cloud credentials | ✅ | ❌ GCP service account required |
| No per-page billing | ✅ | ❌ ~$10/1,000 pages |
| Batch size limits | None (local) | 15 pages / 20 MB per sync call |
| Setup | `pip install marker-pdf && marker_server` | GCP project + processor + IAM |

---

## Does Marker Solve the Current Pain Points?

### Pain Point 1: Naive 50/50 Column Split

**Answer: Yes, Marker fixes this completely.**

`PdfStructureParser.extractPageText()` splits pages at the horizontal midpoint with a 20%
threshold. This fails on asymmetric columns and scanned pages. Marker's surya layout model
returns blocks in natural reading order — no heuristic needed.

### Pain Point 2: Figure Detection Misses Rasterized Figures

**Answer: Yes, Marker fixes this for most cases.**

`FigureExtractionService` previously iterated PDF XObjects (only finds embedded XObject images,
misses rasterized figures and vector-path drawings). Marker's layout model detects visual
elements by type and returns the cropped image bytes directly — no PDFBox page rendering needed.

### Pain Point 3: OCR on Scanned Pages

**Answer: Yes, Marker handles scanned pages transparently via surya OCR.**

### Pain Point 4: Caption Detection

**Answer: Improved — Marker groups caption blocks with their figure block.**

The `block_type = "Caption"` block appears as a sibling or child adjacent to the `"Figure"`
block in the Marker JSON, making caption association structural rather than regex-based.

---

## Marker API Integration

### Local Server Setup

```bash
pip install marker-pdf
marker_server --port 8000
```

The server exposes `POST /marker/upload` (the user's configured endpoint).

### Request

```
POST http://localhost:8000/marker/upload
Content-Type: multipart/form-data

file=@document.pdf
output_format=json
```

### Response (abbreviated)

```json
{
  "output_format": "json",
  "output": {
    "block_type": "Document",
    "children": [
      {
        "block_type": "Page",
        "id": "/page/0/Page/0",
        "children": [
          {
            "block_type": "SectionHeader",
            "id": "/page/0/SectionHeader/0",
            "html": "<h1>Cavernous Sinus Anatomy</h1>"
          },
          {
            "block_type": "Text",
            "id": "/page/0/Text/1",
            "html": "<p>The cavernous sinus contains...</p>"
          },
          {
            "block_type": "Figure",
            "id": "/page/0/Figure/2",
            "html": "<figure><img src='/page/0/Figure/2'/></figure>",
            "images": {
              "/page/0/Figure/2": "iVBORw0KGgo..."
            }
          },
          {
            "block_type": "Caption",
            "id": "/page/0/Caption/3",
            "html": "<p>Fig. 12-4. Coronal cross-section...</p>"
          }
        ]
      }
    ],
    "metadata": { "page_stats": [...] }
  }
}
```

### Java Integration Pattern

```java
// MarkerPageParser — core call
MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
body.add("file", new FileSystemResource(pdfPath));
body.add("output_format", "json");

JsonNode response = restClient.post()
    .uri(baseUrl + "/marker/upload")
    .contentType(MediaType.MULTIPART_FORM_DATA)
    .body(body)
    .retrieve()
    .body(JsonNode.class);

JsonNode document = response.get("output");
```

### Mapping Marker Blocks to PageResult

```
Page block (id "/page/N/Page/M") → PageResult(pageNumber = N+1)
  SectionHeader children           → headingTitle (first match)
  Text, TextInlineMath children    → orderedText (HTML stripped, joined \n\n)
  Figure children with images map  → FigureData(imageBytes = base64decode(images[id]))
  Caption sibling of Figure        → FigureData.nearestCaption
```

---

## Architecture Change

```
Before (Document AI — removed):
  DocumentAiPageParser
      → Google Document AI API (GCP, 15-page batches, credentials)
      → returns text blocks + figure bboxes
  PdfStructureParser (PDFBox column heuristic)
  FigureExtractionService
      → renders page via PDFBox at 150 DPI
      → crops bbox region

After (Marker):
  MarkerPageParser
      → POST PDF to http://localhost:8000/marker/upload (output_format=json)
      → returns text blocks (correct reading order) + Figure blocks with base64 images
      → produces List<PageResult> (same DTO, FigureData carries bytes not bbox)
  FigureExtractionService (simplified)
      → base64-decodes image bytes from PageResult.FigureData
      → checks min size (ImageIO.read → getWidth/getHeight)
      → saves to S3 via FigureStorageService (UNCHANGED)
  VisionDescriptionService (UNCHANGED)
  BookEmbeddingService orchestration (MINOR: inject MarkerPageParser)
```

**What is removed**:
- `DocumentAiPageParser` — replaced by `MarkerPageParser`
- `DocumentAiConfig` — replaced by `MarkerConfig`
- `PdfStructureParser` — Marker handles reading order
- `google-cloud-document-ai` Maven dependency
- `app.document-ai.*` configuration properties

**What stays the same**:
- `PageResult` DTO structure (fields renamed, not restructured)
- `FigureExtractionService` public interface
- `TextChunkingService`, `VisionDescriptionService`, `BookEmbeddingService` orchestration
- All JPA entities, repositories, vector store, S3 storage

---

## Constitution Compliance

| Principle | Assessment |
|-----------|------------|
| **I. KISS** | ✅ Simpler than Document AI — one HTTP call replaces GCP SDK + PDFBox render loop. No new dependency beyond an HTTP client (Spring RestClient, already available). |
| **II. Easy to Change** | ✅ `MarkerPageParser` is the only Marker-aware class. Swap it to use any other parser. `PageResult` DTO unchanged in contract. |
| **III. Web-First** | ✅ Internal pipeline change; no API contract change. |
| **IV. Documentation** | ✅ README must show Marker as a local external service dependency. |

---

## Risks & Mitigations

| Risk | Likelihood | Mitigation |
|------|-----------|------------|
| Marker server not running when book is uploaded | Medium | `BookEmbeddingService` catches exception from `MarkerPageParser`, marks book as `FAILED`, logs full error. |
| Marker misses some figures (complex PDFs) | Medium | `app.figure-storage.min-image-size-px` threshold can be tuned. Add fallback: if Marker returns 0 figures for a page with known images, log a warning. |
| SC-003 (≤ 3× processing time) violated | Low | Marker runs locally (no network latency to cloud). Benchmark with a real 500-page book early. |
| Large PDF upload to Marker (>100MB) | Low | Marker server handles the full file; no batching needed. Multipart upload limit configurable. |
| Marker image quality vs PDFBox crop | Low | Marker crops at native resolution; quality is equivalent or better than 150 DPI PDFBox render. |