# Internal Contract: MarkerPageParser → FigureExtractionService / BookEmbeddingService **Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 **Type**: Internal Java DTO (not an HTTP contract) --- ## Purpose `PageResult` is the internal data transfer object produced by `MarkerPageParser` for each PDF page. It decouples the Marker HTTP API from the rest of the pipeline. Downstream consumers (`BookEmbeddingService`, `FigureExtractionService`, `TextChunkingService`) are unaware of Marker and depend only on this DTO. --- ## Java Record ```java package com.aiteacher.document; import java.util.List; /** * Internal DTO produced by MarkerPageParser for one PDF page. * Decouples the Marker HTTP API from downstream services. */ public record PageResult( int pageNumber, // 1-based, derived from Marker page block index String orderedText, // full page text in correct reading order (blocks joined by \n\n) String headingTitle, // first SectionHeader block on page, or null List figures // extracted figure images (may be empty) ) { /** * A figure extracted from the page. * Image bytes are PNG data decoded from the Marker JSON `images` map. */ public record FigureData( byte[] imageBytes, // PNG image data (base64-decoded from Marker response) String nearestCaption, // text of the adjacent Caption block, or null String blockId // Marker block ID (e.g. "/page/0/Figure/2") for traceability ) {} } ``` --- ## Production Rules | Field | Rule | |-------|------| | `pageNumber` | 1-based index derived from the Marker page block's position in the `children` array (index + 1). | | `orderedText` | HTML-stripped text from all `Text`, `TextInlineMath`, `SectionHeader`, `ListItem`, and `Table` blocks, joined with `\n\n`. Marker already returns them in reading order. | | `headingTitle` | Plain text of the first `SectionHeader` block on the page. `null` if no heading detected. | | `figures` | One `FigureData` per `Figure` or `Picture` block that has a non-empty `images` entry. Blocks with no image data are skipped. | | `imageBytes` | Base64-decoded bytes from `block.images[blockId]`. Marker returns PNG. | | `nearestCaption` | Plain text of the first `Caption` block that is a sibling appearing immediately after the figure block. `null` if absent. | --- ## Mapping from Marker JSON ``` Marker JSON → PageResult Page block ("/page/N/Page/M") → PageResult(pageNumber = N + 1) SectionHeader child → headingTitle (first match, HTML-stripped) Text / TextInlineMath children → orderedText (HTML-stripped, joined \n\n) Figure / Picture child → FigureData images[blockId] → FigureData.imageBytes (base64-decoded) next Caption sibling → FigureData.nearestCaption (HTML-stripped) blockId → FigureData.blockId ``` --- ## Consumers | Consumer | What It Uses | |----------|-------------| | `BookEmbeddingService` | `orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title` | | `FigureExtractionService` | `figures` list → decodes `imageBytes`, checks min size, saves to S3 | | `TextChunkingService` | Receives `SectionEntity` (uses `orderedText` indirectly) — **unchanged** |