adding Marker to parse effectively pdf
This commit is contained in:
@@ -52,6 +52,76 @@ graph TD
|
||||
end
|
||||
```
|
||||
|
||||
## Marker API Response Structure
|
||||
|
||||
The PDF parsing pipeline calls a local [Marker](https://github.com/VikParuchuri/marker) server (`POST /marker/upload`).
|
||||
|
||||
### Top-level envelope
|
||||
|
||||
```json
|
||||
{
|
||||
"format": "json",
|
||||
"output": "<JSON-encoded string>"
|
||||
}
|
||||
```
|
||||
|
||||
`output` is a **JSON-encoded string** (not a nested object) and must be parsed a second time to get the document tree.
|
||||
|
||||
### Parsed `output` shape
|
||||
|
||||
```
|
||||
{
|
||||
"children": [ <Page block>, ... ]
|
||||
}
|
||||
```
|
||||
|
||||
### Block types
|
||||
|
||||
Every block shares these fields:
|
||||
|
||||
| Field | Type | Notes |
|
||||
|------------------|-------------------|--------------------------------------------|
|
||||
| `id` | string | e.g. `/page/0/Picture/2` |
|
||||
| `block_type` | string | see table below |
|
||||
| `html` | string | rendered HTML; may contain `<content-ref>` |
|
||||
| `bbox` | `[x0,y0,x1,y1]` | bounding box in page coordinates |
|
||||
| `children` | array or null | nested blocks |
|
||||
| `images` | object or null | base64 PNG map (leaf image blocks only) |
|
||||
| `section_hierarchy` | object | heading ancestry |
|
||||
|
||||
#### Known `block_type` values
|
||||
|
||||
| block_type | Category | Notes |
|
||||
|------------------|----------|-------------------------------------------------------|
|
||||
| `Page` | structure | Top-level; direct children are the page content |
|
||||
| `SectionHeader` | text | Section / chapter heading |
|
||||
| `Text` | text | |
|
||||
| `TextInlineMath` | text | |
|
||||
| `ListItem` | text | |
|
||||
| `Table` | text | |
|
||||
| `Code` | text | |
|
||||
| `Equation` | text | |
|
||||
| `Footnote` | text | |
|
||||
| `Caption` | text | Usually a child of a `*Group` block |
|
||||
| `PageHeader` | text | |
|
||||
| `PageFooter` | text | |
|
||||
| `Handwriting` | text | |
|
||||
| `Picture` | image | Leaf block; `images` map holds base64 PNG keyed by ID |
|
||||
| `Figure` | image | Leaf block; same as `Picture` |
|
||||
| `PictureGroup` | container | Wraps one `Picture` + one `Caption` child |
|
||||
| `FigureGroup` | container | Wraps one `Figure` + one `Caption` child |
|
||||
|
||||
### Image extraction
|
||||
|
||||
Images are only present on **leaf** image blocks (`Picture`, `Figure`).
|
||||
Group blocks (`PictureGroup`, `FigureGroup`) have `images: null` — the base64 PNG lives on the child leaf block.
|
||||
|
||||
```
|
||||
PictureGroup
|
||||
├── Picture ← images: { "/page/0/Picture/2": "<base64 PNG>" }
|
||||
└── Caption ← html: "<p>Figure 1 — ...</p>"
|
||||
```
|
||||
|
||||
## Stack
|
||||
|
||||
- **Backend**: Spring Boot 4.0.5 + Spring AI 2.0.0-M4, Java 21, Maven
|
||||
|
||||
Reference in New Issue
Block a user