Files
2026-04-04 21:30:18 +02:00

85 lines
3.3 KiB
Markdown

# Internal Contract: MarkerPageParser → FigureExtractionService / BookEmbeddingService
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04
**Type**: Internal Java DTO (not an HTTP contract)
---
## Purpose
`PageResult` is the internal data transfer object produced by `MarkerPageParser` for each
PDF page. It decouples the Marker HTTP API from the rest of the pipeline. Downstream consumers
(`BookEmbeddingService`, `FigureExtractionService`, `TextChunkingService`) are unaware of
Marker and depend only on this DTO.
---
## Java Record
```java
package com.aiteacher.document;
import java.util.List;
/**
* Internal DTO produced by MarkerPageParser for one PDF page.
* Decouples the Marker HTTP API from downstream services.
*/
public record PageResult(
int pageNumber, // 1-based, derived from Marker page block index
String orderedText, // full page text in correct reading order (blocks joined by \n\n)
String headingTitle, // first SectionHeader block on page, or null
List<FigureData> figures // extracted figure images (may be empty)
) {
/**
* A figure extracted from the page.
* Image bytes are PNG data decoded from the Marker JSON `images` map.
*/
public record FigureData(
byte[] imageBytes, // PNG image data (base64-decoded from Marker response)
String nearestCaption, // text of the adjacent Caption block, or null
String blockId // Marker block ID (e.g. "/page/0/Figure/2") for traceability
) {}
}
```
---
## Production Rules
| Field | Rule |
|-------|------|
| `pageNumber` | 1-based index derived from the Marker page block's position in the `children` array (index + 1). |
| `orderedText` | HTML-stripped text from all `Text`, `TextInlineMath`, `SectionHeader`, `ListItem`, and `Table` blocks, joined with `\n\n`. Marker already returns them in reading order. |
| `headingTitle` | Plain text of the first `SectionHeader` block on the page. `null` if no heading detected. |
| `figures` | One `FigureData` per `Figure` or `Picture` block that has a non-empty `images` entry. Blocks with no image data are skipped. |
| `imageBytes` | Base64-decoded bytes from `block.images[blockId]`. Marker returns PNG. |
| `nearestCaption` | Plain text of the first `Caption` block that is a sibling appearing immediately after the figure block. `null` if absent. |
---
## Mapping from Marker JSON
```
Marker JSON → PageResult
Page block ("/page/N/Page/M") → PageResult(pageNumber = N + 1)
SectionHeader child → headingTitle (first match, HTML-stripped)
Text / TextInlineMath children → orderedText (HTML-stripped, joined \n\n)
Figure / Picture child → FigureData
images[blockId] → FigureData.imageBytes (base64-decoded)
next Caption sibling → FigureData.nearestCaption (HTML-stripped)
blockId → FigureData.blockId
```
---
## Consumers
| Consumer | What It Uses |
|----------|-------------|
| `BookEmbeddingService` | `orderedText``SectionEntity.fullText`; `headingTitle``SectionEntity.title` |
| `FigureExtractionService` | `figures` list → decodes `imageBytes`, checks min size, saves to S3 |
| `TextChunkingService` | Receives `SectionEntity` (uses `orderedText` indirectly) — **unchanged** |