Internal Contract: DocumentAiPageParser → FigureExtractionService

Branch: 002-image-aware-embedding | Date: 2026-04-04
Type: Internal Java DTO (not an HTTP contract)

Purpose

PageResult is the internal data transfer object produced by DocumentAiPageParser for each PDF page. It decouples the Google Document AI SDK types from the rest of the pipeline so that PdfStructureParser can be replaced without cascading changes.

Java Record

package com.aiteacher.document;

import java.util.List;

/**
 * Internal DTO produced by DocumentAiPageParser for one PDF page.
 * Decouples the Document AI SDK types from downstream services.
 */
public record PageResult(
    int pageNumber,           // 1-based, matches Document.Page.getPageNumber()
    String orderedText,       // full page text in correct reading order (blocks joined by \n\n)
    String headingTitle,      // first HEADING block on page, or null
    List<FigureBbox> figures  // detected figure regions (may be empty)
) {

    /**
     * Normalized bounding box for a detected figure region.
     * Coordinates are in the [0.0, 1.0] range relative to page dimensions.
     */
    public record FigureBbox(
        float x,       // left edge (normalized)
        float y,       // top edge (normalized)
        float width,   // width (normalized)
        float height,  // height (normalized)
        String nearestCaption  // text of adjacent paragraph block, or null
    ) {}
}

Production Rules

Field	Rule
`orderedText`	Concatenation of all `PARAGRAPH` and `HEADING_*` blocks, joined with `\n\n`. Tables are represented as tab-separated text.
`headingTitle`	First block whose `blockType` is `HEADING_1` through `HEADING_6`. `null` if no heading detected.
`figures`	One entry per `VisualElement` with `type == "figure"` and `confidence ≥ 0.5`. Sorted top-to-bottom by `y`.
`nearestCaption`	The `PARAGRAPH` block immediately following the figure bbox (by Y coordinate). May be `null` if no paragraph follows within 10% of page height.

Mapping from Document AI Proto

Document.Page.Block         → orderedText (concatenated)
Document.Page.Block (HEADING_*) → headingTitle (first match)
Document.Page.VisualElement → FigureBbox
  └─ layout.bounding_poly.normalized_vertices[0] → (x, y) top-left
  └─ normalized_vertices[2] → (x+w, y+h) bottom-right

Consumers

Consumer	What It Uses
`BookEmbeddingService`	`orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title`
`FigureExtractionService`	`figures` list → renders page via PDFBox, crops each bbox to `BufferedImage`
`TextChunkingService`	Receives `SectionEntity` (indirectly uses `orderedText`) — unchanged

2.8 KiB Raw Blame History