3.3 KiB
3.3 KiB
Internal Contract: MarkerPageParser → FigureExtractionService / BookEmbeddingService
Branch: 002-image-aware-embedding | Date: 2026-04-04
Type: Internal Java DTO (not an HTTP contract)
Purpose
PageResult is the internal data transfer object produced by MarkerPageParser for each
PDF page. It decouples the Marker HTTP API from the rest of the pipeline. Downstream consumers
(BookEmbeddingService, FigureExtractionService, TextChunkingService) are unaware of
Marker and depend only on this DTO.
Java Record
package com.aiteacher.document;
import java.util.List;
/**
* Internal DTO produced by MarkerPageParser for one PDF page.
* Decouples the Marker HTTP API from downstream services.
*/
public record PageResult(
int pageNumber, // 1-based, derived from Marker page block index
String orderedText, // full page text in correct reading order (blocks joined by \n\n)
String headingTitle, // first SectionHeader block on page, or null
List<FigureData> figures // extracted figure images (may be empty)
) {
/**
* A figure extracted from the page.
* Image bytes are PNG data decoded from the Marker JSON `images` map.
*/
public record FigureData(
byte[] imageBytes, // PNG image data (base64-decoded from Marker response)
String nearestCaption, // text of the adjacent Caption block, or null
String blockId // Marker block ID (e.g. "/page/0/Figure/2") for traceability
) {}
}
Production Rules
| Field | Rule |
|---|---|
pageNumber |
1-based index derived from the Marker page block's position in the children array (index + 1). |
orderedText |
HTML-stripped text from all Text, TextInlineMath, SectionHeader, ListItem, and Table blocks, joined with \n\n. Marker already returns them in reading order. |
headingTitle |
Plain text of the first SectionHeader block on the page. null if no heading detected. |
figures |
One FigureData per Figure or Picture block that has a non-empty images entry. Blocks with no image data are skipped. |
imageBytes |
Base64-decoded bytes from block.images[blockId]. Marker returns PNG. |
nearestCaption |
Plain text of the first Caption block that is a sibling appearing immediately after the figure block. null if absent. |
Mapping from Marker JSON
Marker JSON → PageResult
Page block ("/page/N/Page/M") → PageResult(pageNumber = N + 1)
SectionHeader child → headingTitle (first match, HTML-stripped)
Text / TextInlineMath children → orderedText (HTML-stripped, joined \n\n)
Figure / Picture child → FigureData
images[blockId] → FigureData.imageBytes (base64-decoded)
next Caption sibling → FigureData.nearestCaption (HTML-stripped)
blockId → FigureData.blockId
Consumers
| Consumer | What It Uses |
|---|---|
BookEmbeddingService |
orderedText → SectionEntity.fullText; headingTitle → SectionEntity.title |
FigureExtractionService |
figures list → decodes imageBytes, checks min size, saves to S3 |
TextChunkingService |
Receives SectionEntity (uses orderedText indirectly) — unchanged |