adding Marker to parse effectively pdf

2026-04-04 21:30:18 +02:00
parent b154e29f2d
commit ea1276dc2e
25 changed files with 2318 additions and 285 deletions
@@ -52,6 +52,76 @@ graph TD
    end
 ```

+## Marker API Response Structure
+
+The PDF parsing pipeline calls a local [Marker](https://github.com/VikParuchuri/marker) server (`POST /marker/upload`).
+
+### Top-level envelope
+
+```json
+{
+  "format": "json",
+  "output": "<JSON-encoded string>"
+}
+```
+
+`output` is a **JSON-encoded string** (not a nested object) and must be parsed a second time to get the document tree.
+
+### Parsed `output` shape
+
+```
+{
+  "children": [ <Page block>, ... ]
+}
+```
+
+### Block types
+
+Every block shares these fields:
+
+| Field            | Type              | Notes                                      |
+|------------------|-------------------|--------------------------------------------|
+| `id`             | string            | e.g. `/page/0/Picture/2`                   |
+| `block_type`     | string            | see table below                            |
+| `html`           | string            | rendered HTML; may contain `<content-ref>` |
+| `bbox`           | `[x0,y0,x1,y1]`  | bounding box in page coordinates           |
+| `children`       | array or null     | nested blocks                              |
+| `images`         | object or null    | base64 PNG map (leaf image blocks only)    |
+| `section_hierarchy` | object         | heading ancestry                           |
+
+#### Known `block_type` values
+
+| block_type       | Category | Notes                                                 |
+|------------------|----------|-------------------------------------------------------|
+| `Page`           | structure | Top-level; direct children are the page content       |
+| `SectionHeader`  | text      | Section / chapter heading                             |
+| `Text`           | text      |                                                       |
+| `TextInlineMath` | text      |                                                       |
+| `ListItem`       | text      |                                                       |
+| `Table`          | text      |                                                       |
+| `Code`           | text      |                                                       |
+| `Equation`       | text      |                                                       |
+| `Footnote`       | text      |                                                       |
+| `Caption`        | text      | Usually a child of a `*Group` block                   |
+| `PageHeader`     | text      |                                                       |
+| `PageFooter`     | text      |                                                       |
+| `Handwriting`    | text      |                                                       |
+| `Picture`        | image     | Leaf block; `images` map holds base64 PNG keyed by ID |
+| `Figure`         | image     | Leaf block; same as `Picture`                         |
+| `PictureGroup`   | container | Wraps one `Picture` + one `Caption` child             |
+| `FigureGroup`    | container | Wraps one `Figure` + one `Caption` child              |
+
+### Image extraction
+
+Images are only present on **leaf** image blocks (`Picture`, `Figure`).
+Group blocks (`PictureGroup`, `FigureGroup`) have `images: null` — the base64 PNG lives on the child leaf block.
+
+```
+PictureGroup
+├── Picture   ← images: { "/page/0/Picture/2": "<base64 PNG>" }
+└── Caption   ← html: "<p>Figure 1 — ...</p>"
+```
+
 ## Stack

 - **Backend**: Spring Boot 4.0.5 + Spring AI 2.0.0-M4, Java 21, Maven