adding Marker to parse effectively pdf

2026-04-04 21:30:18 +02:00
parent b154e29f2d
commit ea1276dc2e
25 changed files with 2318 additions and 285 deletions
@@ -0,0 +1,79 @@
+# Internal Contract: DocumentAiPageParser → FigureExtractionService
+
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04  
+**Type**: Internal Java DTO (not an HTTP contract)
+
+---
+
+## Purpose
+
+`PageResult` is the internal data transfer object produced by `DocumentAiPageParser` for each
+PDF page. It decouples the Google Document AI SDK types from the rest of the pipeline so that
+`PdfStructureParser` can be replaced without cascading changes.
+
+---
+
+## Java Record
+
+```java
+package com.aiteacher.document;
+
+import java.util.List;
+
+/**
+ * Internal DTO produced by DocumentAiPageParser for one PDF page.
+ * Decouples the Document AI SDK types from downstream services.
+ */
+public record PageResult(
+    int pageNumber,           // 1-based, matches Document.Page.getPageNumber()
+    String orderedText,       // full page text in correct reading order (blocks joined by \n\n)
+    String headingTitle,      // first HEADING block on page, or null
+    List<FigureBbox> figures  // detected figure regions (may be empty)
+) {
+
+    /**
+     * Normalized bounding box for a detected figure region.
+     * Coordinates are in the [0.0, 1.0] range relative to page dimensions.
+     */
+    public record FigureBbox(
+        float x,       // left edge (normalized)
+        float y,       // top edge (normalized)
+        float width,   // width (normalized)
+        float height,  // height (normalized)
+        String nearestCaption  // text of adjacent paragraph block, or null
+    ) {}
+}
+```
+
+---
+
+## Production Rules
+
+| Field | Rule |
+|-------|------|
+| `orderedText` | Concatenation of all `PARAGRAPH` and `HEADING_*` blocks, joined with `\n\n`. Tables are represented as tab-separated text. |
+| `headingTitle` | First block whose `blockType` is `HEADING_1` through `HEADING_6`. `null` if no heading detected. |
+| `figures` | One entry per `VisualElement` with `type == "figure"` and `confidence ≥ 0.5`. Sorted top-to-bottom by `y`. |
+| `nearestCaption` | The `PARAGRAPH` block immediately following the figure bbox (by Y coordinate). May be `null` if no paragraph follows within 10% of page height. |
+
+---
+
+## Mapping from Document AI Proto
+
+```
+Document.Page.Block         → orderedText (concatenated)
+Document.Page.Block (HEADING_*) → headingTitle (first match)
+Document.Page.VisualElement → FigureBbox
+  └─ layout.bounding_poly.normalized_vertices[0] → (x, y) top-left
+  └─ normalized_vertices[2] → (x+w, y+h) bottom-right
+```
+
+---
+
+## Consumers
+
+| Consumer | What It Uses |
+|----------|-------------|
+| `BookEmbeddingService` | `orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title` |
+| `FigureExtractionService` | `figures` list → renders page via PDFBox, crops each bbox to `BufferedImage` |
+| `TextChunkingService` | Receives `SectionEntity` (indirectly uses `orderedText`) — **unchanged** |
@@ -0,0 +1,84 @@
+# Internal Contract: MarkerPageParser → FigureExtractionService / BookEmbeddingService
+
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04  
+**Type**: Internal Java DTO (not an HTTP contract)
+
+---
+
+## Purpose
+
+`PageResult` is the internal data transfer object produced by `MarkerPageParser` for each
+PDF page. It decouples the Marker HTTP API from the rest of the pipeline. Downstream consumers
+(`BookEmbeddingService`, `FigureExtractionService`, `TextChunkingService`) are unaware of
+Marker and depend only on this DTO.
+
+---
+
+## Java Record
+
+```java
+package com.aiteacher.document;
+
+import java.util.List;
+
+/**
+ * Internal DTO produced by MarkerPageParser for one PDF page.
+ * Decouples the Marker HTTP API from downstream services.
+ */
+public record PageResult(
+    int pageNumber,              // 1-based, derived from Marker page block index
+    String orderedText,          // full page text in correct reading order (blocks joined by \n\n)
+    String headingTitle,         // first SectionHeader block on page, or null
+    List<FigureData> figures     // extracted figure images (may be empty)
+) {
+
+    /**
+     * A figure extracted from the page.
+     * Image bytes are PNG data decoded from the Marker JSON `images` map.
+     */
+    public record FigureData(
+        byte[] imageBytes,       // PNG image data (base64-decoded from Marker response)
+        String nearestCaption,   // text of the adjacent Caption block, or null
+        String blockId           // Marker block ID (e.g. "/page/0/Figure/2") for traceability
+    ) {}
+}
+```
+
+---
+
+## Production Rules
+
+| Field | Rule |
+|-------|------|
+| `pageNumber` | 1-based index derived from the Marker page block's position in the `children` array (index + 1). |
+| `orderedText` | HTML-stripped text from all `Text`, `TextInlineMath`, `SectionHeader`, `ListItem`, and `Table` blocks, joined with `\n\n`. Marker already returns them in reading order. |
+| `headingTitle` | Plain text of the first `SectionHeader` block on the page. `null` if no heading detected. |
+| `figures` | One `FigureData` per `Figure` or `Picture` block that has a non-empty `images` entry. Blocks with no image data are skipped. |
+| `imageBytes` | Base64-decoded bytes from `block.images[blockId]`. Marker returns PNG. |
+| `nearestCaption` | Plain text of the first `Caption` block that is a sibling appearing immediately after the figure block. `null` if absent. |
+
+---
+
+## Mapping from Marker JSON
+
+```
+Marker JSON → PageResult
+
+Page block ("/page/N/Page/M")       → PageResult(pageNumber = N + 1)
+  SectionHeader child                → headingTitle (first match, HTML-stripped)
+  Text / TextInlineMath children    → orderedText (HTML-stripped, joined \n\n)
+  Figure / Picture child            → FigureData
+    images[blockId]                  → FigureData.imageBytes (base64-decoded)
+    next Caption sibling             → FigureData.nearestCaption (HTML-stripped)
+    blockId                          → FigureData.blockId
+```
+
+---
+
+## Consumers
+
+| Consumer | What It Uses |
+|----------|-------------|
+| `BookEmbeddingService` | `orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title` |
+| `FigureExtractionService` | `figures` list → decodes `imageBytes`, checks min size, saves to S3 |
+| `TextChunkingService` | Receives `SectionEntity` (uses `orderedText` indirectly) — **unchanged** |