85 lines
3.3 KiB
Markdown
85 lines
3.3 KiB
Markdown
# Internal Contract: MarkerPageParser → FigureExtractionService / BookEmbeddingService
|
|
|
|
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04
|
|
**Type**: Internal Java DTO (not an HTTP contract)
|
|
|
|
---
|
|
|
|
## Purpose
|
|
|
|
`PageResult` is the internal data transfer object produced by `MarkerPageParser` for each
|
|
PDF page. It decouples the Marker HTTP API from the rest of the pipeline. Downstream consumers
|
|
(`BookEmbeddingService`, `FigureExtractionService`, `TextChunkingService`) are unaware of
|
|
Marker and depend only on this DTO.
|
|
|
|
---
|
|
|
|
## Java Record
|
|
|
|
```java
|
|
package com.aiteacher.document;
|
|
|
|
import java.util.List;
|
|
|
|
/**
|
|
* Internal DTO produced by MarkerPageParser for one PDF page.
|
|
* Decouples the Marker HTTP API from downstream services.
|
|
*/
|
|
public record PageResult(
|
|
int pageNumber, // 1-based, derived from Marker page block index
|
|
String orderedText, // full page text in correct reading order (blocks joined by \n\n)
|
|
String headingTitle, // first SectionHeader block on page, or null
|
|
List<FigureData> figures // extracted figure images (may be empty)
|
|
) {
|
|
|
|
/**
|
|
* A figure extracted from the page.
|
|
* Image bytes are PNG data decoded from the Marker JSON `images` map.
|
|
*/
|
|
public record FigureData(
|
|
byte[] imageBytes, // PNG image data (base64-decoded from Marker response)
|
|
String nearestCaption, // text of the adjacent Caption block, or null
|
|
String blockId // Marker block ID (e.g. "/page/0/Figure/2") for traceability
|
|
) {}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Production Rules
|
|
|
|
| Field | Rule |
|
|
|-------|------|
|
|
| `pageNumber` | 1-based index derived from the Marker page block's position in the `children` array (index + 1). |
|
|
| `orderedText` | HTML-stripped text from all `Text`, `TextInlineMath`, `SectionHeader`, `ListItem`, and `Table` blocks, joined with `\n\n`. Marker already returns them in reading order. |
|
|
| `headingTitle` | Plain text of the first `SectionHeader` block on the page. `null` if no heading detected. |
|
|
| `figures` | One `FigureData` per `Figure` or `Picture` block that has a non-empty `images` entry. Blocks with no image data are skipped. |
|
|
| `imageBytes` | Base64-decoded bytes from `block.images[blockId]`. Marker returns PNG. |
|
|
| `nearestCaption` | Plain text of the first `Caption` block that is a sibling appearing immediately after the figure block. `null` if absent. |
|
|
|
|
---
|
|
|
|
## Mapping from Marker JSON
|
|
|
|
```
|
|
Marker JSON → PageResult
|
|
|
|
Page block ("/page/N/Page/M") → PageResult(pageNumber = N + 1)
|
|
SectionHeader child → headingTitle (first match, HTML-stripped)
|
|
Text / TextInlineMath children → orderedText (HTML-stripped, joined \n\n)
|
|
Figure / Picture child → FigureData
|
|
images[blockId] → FigureData.imageBytes (base64-decoded)
|
|
next Caption sibling → FigureData.nearestCaption (HTML-stripped)
|
|
blockId → FigureData.blockId
|
|
```
|
|
|
|
---
|
|
|
|
## Consumers
|
|
|
|
| Consumer | What It Uses |
|
|
|----------|-------------|
|
|
| `BookEmbeddingService` | `orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title` |
|
|
| `FigureExtractionService` | `figures` list → decodes `imageBytes`, checks min size, saves to S3 |
|
|
| `TextChunkingService` | Receives `SectionEntity` (uses `orderedText` indirectly) — **unchanged** |
|