adding Marker to parse effectively pdf
This commit is contained in:
@@ -0,0 +1,79 @@
|
||||
# Internal Contract: DocumentAiPageParser → FigureExtractionService
|
||||
|
||||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04
|
||||
**Type**: Internal Java DTO (not an HTTP contract)
|
||||
|
||||
---
|
||||
|
||||
## Purpose
|
||||
|
||||
`PageResult` is the internal data transfer object produced by `DocumentAiPageParser` for each
|
||||
PDF page. It decouples the Google Document AI SDK types from the rest of the pipeline so that
|
||||
`PdfStructureParser` can be replaced without cascading changes.
|
||||
|
||||
---
|
||||
|
||||
## Java Record
|
||||
|
||||
```java
|
||||
package com.aiteacher.document;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
/**
|
||||
* Internal DTO produced by DocumentAiPageParser for one PDF page.
|
||||
* Decouples the Document AI SDK types from downstream services.
|
||||
*/
|
||||
public record PageResult(
|
||||
int pageNumber, // 1-based, matches Document.Page.getPageNumber()
|
||||
String orderedText, // full page text in correct reading order (blocks joined by \n\n)
|
||||
String headingTitle, // first HEADING block on page, or null
|
||||
List<FigureBbox> figures // detected figure regions (may be empty)
|
||||
) {
|
||||
|
||||
/**
|
||||
* Normalized bounding box for a detected figure region.
|
||||
* Coordinates are in the [0.0, 1.0] range relative to page dimensions.
|
||||
*/
|
||||
public record FigureBbox(
|
||||
float x, // left edge (normalized)
|
||||
float y, // top edge (normalized)
|
||||
float width, // width (normalized)
|
||||
float height, // height (normalized)
|
||||
String nearestCaption // text of adjacent paragraph block, or null
|
||||
) {}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Production Rules
|
||||
|
||||
| Field | Rule |
|
||||
|-------|------|
|
||||
| `orderedText` | Concatenation of all `PARAGRAPH` and `HEADING_*` blocks, joined with `\n\n`. Tables are represented as tab-separated text. |
|
||||
| `headingTitle` | First block whose `blockType` is `HEADING_1` through `HEADING_6`. `null` if no heading detected. |
|
||||
| `figures` | One entry per `VisualElement` with `type == "figure"` and `confidence ≥ 0.5`. Sorted top-to-bottom by `y`. |
|
||||
| `nearestCaption` | The `PARAGRAPH` block immediately following the figure bbox (by Y coordinate). May be `null` if no paragraph follows within 10% of page height. |
|
||||
|
||||
---
|
||||
|
||||
## Mapping from Document AI Proto
|
||||
|
||||
```
|
||||
Document.Page.Block → orderedText (concatenated)
|
||||
Document.Page.Block (HEADING_*) → headingTitle (first match)
|
||||
Document.Page.VisualElement → FigureBbox
|
||||
└─ layout.bounding_poly.normalized_vertices[0] → (x, y) top-left
|
||||
└─ normalized_vertices[2] → (x+w, y+h) bottom-right
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Consumers
|
||||
|
||||
| Consumer | What It Uses |
|
||||
|----------|-------------|
|
||||
| `BookEmbeddingService` | `orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title` |
|
||||
| `FigureExtractionService` | `figures` list → renders page via PDFBox, crops each bbox to `BufferedImage` |
|
||||
| `TextChunkingService` | Receives `SectionEntity` (indirectly uses `orderedText`) — **unchanged** |
|
||||
Reference in New Issue
Block a user