adding Marker to parse effectively pdf
This commit is contained in:
@@ -0,0 +1,84 @@
|
||||
# Internal Contract: MarkerPageParser → FigureExtractionService / BookEmbeddingService
|
||||
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04
|
||||
**Type**: Internal Java DTO (not an HTTP contract)
|
||||
|
||||
---
|
||||
|
||||
## Purpose
|
||||
|
||||
`PageResult` is the internal data transfer object produced by `MarkerPageParser` for each
|
||||
PDF page. It decouples the Marker HTTP API from the rest of the pipeline. Downstream consumers
|
||||
(`BookEmbeddingService`, `FigureExtractionService`, `TextChunkingService`) are unaware of
|
||||
Marker and depend only on this DTO.
|
||||
|
||||
---
|
||||
|
||||
## Java Record
|
||||
|
||||
```java
|
||||
package com.aiteacher.document;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
/**
|
||||
* Internal DTO produced by MarkerPageParser for one PDF page.
|
||||
* Decouples the Marker HTTP API from downstream services.
|
||||
*/
|
||||
public record PageResult(
|
||||
int pageNumber, // 1-based, derived from Marker page block index
|
||||
String orderedText, // full page text in correct reading order (blocks joined by \n\n)
|
||||
String headingTitle, // first SectionHeader block on page, or null
|
||||
List<FigureData> figures // extracted figure images (may be empty)
|
||||
) {
|
||||
|
||||
/**
|
||||
* A figure extracted from the page.
|
||||
* Image bytes are PNG data decoded from the Marker JSON `images` map.
|
||||
*/
|
||||
public record FigureData(
|
||||
byte[] imageBytes, // PNG image data (base64-decoded from Marker response)
|
||||
String nearestCaption, // text of the adjacent Caption block, or null
|
||||
String blockId // Marker block ID (e.g. "/page/0/Figure/2") for traceability
|
||||
) {}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Production Rules
|
||||
|
||||
| Field | Rule |
|
||||
|-------|------|
|
||||
| `pageNumber` | 1-based index derived from the Marker page block's position in the `children` array (index + 1). |
|
||||
| `orderedText` | HTML-stripped text from all `Text`, `TextInlineMath`, `SectionHeader`, `ListItem`, and `Table` blocks, joined with `\n\n`. Marker already returns them in reading order. |
|
||||
| `headingTitle` | Plain text of the first `SectionHeader` block on the page. `null` if no heading detected. |
|
||||
| `figures` | One `FigureData` per `Figure` or `Picture` block that has a non-empty `images` entry. Blocks with no image data are skipped. |
|
||||
| `imageBytes` | Base64-decoded bytes from `block.images[blockId]`. Marker returns PNG. |
|
||||
| `nearestCaption` | Plain text of the first `Caption` block that is a sibling appearing immediately after the figure block. `null` if absent. |
|
||||
|
||||
---
|
||||
|
||||
## Mapping from Marker JSON
|
||||
|
||||
```
|
||||
Marker JSON → PageResult
|
||||
|
||||
Page block ("/page/N/Page/M") → PageResult(pageNumber = N + 1)
|
||||
SectionHeader child → headingTitle (first match, HTML-stripped)
|
||||
Text / TextInlineMath children → orderedText (HTML-stripped, joined \n\n)
|
||||
Figure / Picture child → FigureData
|
||||
images[blockId] → FigureData.imageBytes (base64-decoded)
|
||||
next Caption sibling → FigureData.nearestCaption (HTML-stripped)
|
||||
blockId → FigureData.blockId
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Consumers
|
||||
|
||||
| Consumer | What It Uses |
|
||||
|----------|-------------|
|
||||
| `BookEmbeddingService` | `orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title` |
|
||||
| `FigureExtractionService` | `figures` list → decodes `imageBytes`, checks min size, saves to S3 |
|
||||
| `TextChunkingService` | Receives `SectionEntity` (uses `orderedText` indirectly) — **unchanged** |
|
||||
Reference in New Issue
Block a user