Files
ai-teacher/specs/002-image-aware-embedding/contracts/marker-page-result.md
T
2026-04-04 21:30:18 +02:00

3.3 KiB

Internal Contract: MarkerPageParser → FigureExtractionService / BookEmbeddingService

Branch: 002-image-aware-embedding | Date: 2026-04-04
Type: Internal Java DTO (not an HTTP contract)


Purpose

PageResult is the internal data transfer object produced by MarkerPageParser for each PDF page. It decouples the Marker HTTP API from the rest of the pipeline. Downstream consumers (BookEmbeddingService, FigureExtractionService, TextChunkingService) are unaware of Marker and depend only on this DTO.


Java Record

package com.aiteacher.document;

import java.util.List;

/**
 * Internal DTO produced by MarkerPageParser for one PDF page.
 * Decouples the Marker HTTP API from downstream services.
 */
public record PageResult(
    int pageNumber,              // 1-based, derived from Marker page block index
    String orderedText,          // full page text in correct reading order (blocks joined by \n\n)
    String headingTitle,         // first SectionHeader block on page, or null
    List<FigureData> figures     // extracted figure images (may be empty)
) {

    /**
     * A figure extracted from the page.
     * Image bytes are PNG data decoded from the Marker JSON `images` map.
     */
    public record FigureData(
        byte[] imageBytes,       // PNG image data (base64-decoded from Marker response)
        String nearestCaption,   // text of the adjacent Caption block, or null
        String blockId           // Marker block ID (e.g. "/page/0/Figure/2") for traceability
    ) {}
}

Production Rules

Field Rule
pageNumber 1-based index derived from the Marker page block's position in the children array (index + 1).
orderedText HTML-stripped text from all Text, TextInlineMath, SectionHeader, ListItem, and Table blocks, joined with \n\n. Marker already returns them in reading order.
headingTitle Plain text of the first SectionHeader block on the page. null if no heading detected.
figures One FigureData per Figure or Picture block that has a non-empty images entry. Blocks with no image data are skipped.
imageBytes Base64-decoded bytes from block.images[blockId]. Marker returns PNG.
nearestCaption Plain text of the first Caption block that is a sibling appearing immediately after the figure block. null if absent.

Mapping from Marker JSON

Marker JSON → PageResult

Page block ("/page/N/Page/M")       → PageResult(pageNumber = N + 1)
  SectionHeader child                → headingTitle (first match, HTML-stripped)
  Text / TextInlineMath children    → orderedText (HTML-stripped, joined \n\n)
  Figure / Picture child            → FigureData
    images[blockId]                  → FigureData.imageBytes (base64-decoded)
    next Caption sibling             → FigureData.nearestCaption (HTML-stripped)
    blockId                          → FigureData.blockId

Consumers

Consumer What It Uses
BookEmbeddingService orderedTextSectionEntity.fullText; headingTitleSectionEntity.title
FigureExtractionService figures list → decodes imageBytes, checks min size, saves to S3
TextChunkingService Receives SectionEntity (uses orderedText indirectly) — unchanged