add possibility to disable delete and upload of books

enhance page parsing using json output and html
adding Marker to parse effectively pdf
2026-04-06 14:09:17 +02:00 · 2026-04-05 21:55:30 +02:00 · 2026-04-04 21:30:18 +02:00 · 2026-04-04 13:26:55 +02:00
35 changed files with 2591 additions and 381 deletions
@@ -1,10 +1,13 @@
 # ai-teacher Development Guidelines

-Auto-generated from all feature plans. Last updated: 2026-04-03
+Auto-generated from all feature plans. Last updated: 2026-04-04

 ## Active Technologies
 - Java 25 (backend), TypeScript / Node 20 (frontend) + Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings + chat), PDFBox (via Spring AI PDF reader dependency) (002-image-aware-embedding)
 - PostgreSQL (JPA + Flyway), pgvector (Spring AI `VectorStore`), local file system (extracted images — `/uploads/figures/`) (002-image-aware-embedding)
+- Java 25 (backend), TypeScript / Node 20 (frontend) + Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API, PDFBox (rendering only), `com.google.cloud:google-cloud-documentai` (~2.40.x) (002-image-aware-embedding)
+- PostgreSQL (JPA + Flyway), pgvector (Spring AI VectorStore), S3 / local filesystem (figure images) (002-image-aware-embedding)
+- PostgreSQL (JPA + Flyway), pgvector (Spring AI `VectorStore`), S3-compatible (002-image-aware-embedding)

 - Java 21 (backend), TypeScript / Node 20 (frontend) (001-neuro-rag-learning)

@@ -24,9 +27,10 @@ npm test && npm run lint
 Java 21 (backend), TypeScript / Node 20 (frontend): Follow standard conventions

 ## Recent Changes
+- 002-image-aware-embedding: Added Java 25 (backend), TypeScript / Node 20 (frontend) + Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings +
+- 002-image-aware-embedding: Added Java 25 (backend), TypeScript / Node 20 (frontend) + Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API, PDFBox (rendering only), `com.google.cloud:google-cloud-documentai` (~2.40.x)
 - 002-image-aware-embedding: Added Java 25 (backend), TypeScript / Node 20 (frontend) + Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings + chat), PDFBox (via Spring AI PDF reader dependency)

- 001-neuro-rag-learning: Added Java 21 (backend), TypeScript / Node 20 (frontend)

 <!-- MANUAL ADDITIONS START -->
 <!-- MANUAL ADDITIONS END -->
@@ -0,0 +1,566 @@
+# Marker
+
+Marker converts documents to markdown, JSON, chunks, and HTML quickly and accurately.
+
+- Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files in all languages
+- Formats tables, forms, equations, inline math, links, references, and code blocks
+- Extracts and saves images
+- Removes headers/footers/other artifacts
+- Extensible with your own formatting and logic
+- Does structured extraction, given a JSON schema (beta)
+- Optionally boost accuracy with LLMs (and your own prompt)
+- Works on GPU, CPU, or MPS
+
+For our managed API or on-prem document intelligence solution, check out [our platform here](https://datalab.to?utm_source=gh-marker).
+
+## Performance
+
+<img src="data/images/overall.png" width="800px"/>
+
+Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix, as well as other open source tools.
+
+The above results are running single PDF pages serially.  Marker is significantly faster when running in batch mode, with a projected throughput of 25 pages/second on an H100.
+
+See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
+
+## Hybrid Mode
+
+For the highest accuracy, pass the `--use_llm` flag to use an LLM alongside marker.  This will do things like merge tables across pages, handle inline math, format tables properly, and extract values from forms.  It can use any gemini or ollama model.  By default, it uses `gemini-2.0-flash`.  See [below](#llm-services) for details.
+
+Here is a table benchmark comparing marker, gemini flash alone, and marker with use_llm:
+
+<img src="data/images/table.png" width="400px"/>
+
+As you can see, the use_llm mode offers higher accuracy than marker or gemini alone.
+
+## Examples
+
+| PDF | File type | Markdown                                                                                                                     | JSON                                                                                                   |
+|-----|-----------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
+| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/thinkpython/thinkpython.md)                 | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/thinkpython.json)         |
+| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/switch_transformers/switch_trans.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/switch_trans.json) |
+| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/multicolcnn/multicolcnn.md)                 | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/multicolcnn.json)         |
+
+# Commercial usage
+
+Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page [here](https://www.datalab.to/pricing?utm_source=gh-marker).
+
+# Hosted API & On-prem
+
+There's a [hosted API](https://www.datalab.to?utm_source=gh-marker) and [painless on-prem solution](https://www.datalab.to/blog/self-serve-on-prem-licensing) for marker - it's free to sign up, and we'll throw in credits for you to test it out.
+
+The API:
+- Supports PDF, image, PPT, PPTX, DOC, DOCX, XLS, XLSX, HTML, EPUB files
+- Is 1/4th the price of leading cloud-based competitors
+- Fast - ~15s for a 250 page PDF
+- Supports LLM mode
+- High uptime (99.99%)
+
+# Community
+
+[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
+
+# Installation
+
+You'll need python 3.10+ and [PyTorch](https://pytorch.org/get-started/locally/).
+
+Install with:
+
+```shell
+pip install marker-pdf
+```
+
+If you want to use marker on documents other than PDFs, you will need to install additional dependencies with:
+
+```shell
+pip install marker-pdf[full]
+```
+
+# Usage
+
+First, some configuration:
+
+- Your torch device will be automatically detected, but you can override this.  For example, `TORCH_DEVICE=cuda`.
+- Some PDFs, even digital ones, have bad text in them.  Set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
+- If you care about inline math, set `force_ocr` to convert inline math to LaTeX.
+
+## Interactive App
+
+I've included a streamlit app that lets you interactively try marker with some basic options.  Run it with:
+
+```shell
+pip install streamlit streamlit-ace
+marker_gui
+```
+
+## Convert a single file
+
+```shell
+marker_single /path/to/file.pdf
+```
+
+You can pass in PDFs or images.
+
+Options:
+- `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
+- `--output_format [markdown|json|html|chunks]`: Specify the format for the output results.
+- `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
+- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
+- `--use_llm`: Uses an LLM to improve accuracy.  You will need to configure the LLM backend - see [below](#llm-services).
+- `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text.  This will also format inline math properly.
+- `--block_correction_prompt`: if LLM mode is active, an optional prompt that will be used to correct the output of marker.  This is useful for custom formatting or logic that you want to apply to the output.
+- `--strip_existing_ocr`: Remove all existing OCR text in the document and re-OCR with surya.
+- `--redo_inline_math`: If you want the absolute highest quality inline math conversion, use this along with `--use_llm`.
+- `--disable_image_extraction`: Don't extract images from the PDF.  If you also specify `--use_llm`, then images will be replaced with a description.
+- `--debug`: Enable debug mode for additional logging and diagnostic information.
+- `--processors TEXT`: Override the default processors by providing their full module paths, separated by commas. Example: `--processors "module1.processor1,module2.processor2"`
+- `--config_json PATH`: Path to a JSON configuration file containing additional settings.
+- `config --help`: List all available builders, processors, and converters, and their associated configuration.  These values can be used to build a JSON configuration file for additional tweaking of marker defaults.
+- `--converter_cls`: One of `marker.converters.pdf.PdfConverter` (default) or `marker.converters.table.TableConverter`.  The `PdfConverter` will convert the whole PDF, the `TableConverter` will only extract and convert tables.
+- `--llm_service`: Which llm service to use if `--use_llm` is passed.  This defaults to `marker.services.gemini.GoogleGeminiService`.
+- `--help`: see all of the flags that can be passed into marker.  (it supports many more options then are listed above)
+
+The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/recognition/languages.py).  If you don't need OCR, marker can work with any language.
+
+## Convert multiple files
+
+```shell
+marker /path/to/input/folder
+```
+
+- `marker` supports all the same options from `marker_single` above.
+- `--workers` is the number of conversion workers to run simultaneously.  This is automatically set by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage.  Marker will use 5GB of VRAM per worker at the peak, and 3.5GB average.
+
+## Convert multiple files on multiple GPUs
+
+```shell
+NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
+```
+
+- `NUM_DEVICES` is the number of GPUs to use.  Should be `2` or greater.
+- `NUM_WORKERS` is the number of parallel processes to run on each GPU.
+
+## Use from python
+
+See the `PdfConverter` class at `marker/converters/pdf.py` function for additional arguments that can be passed.
+
+```python
+from marker.converters.pdf import PdfConverter
+from marker.models import create_model_dict
+from marker.output import text_from_rendered
+
+converter = PdfConverter(
+    artifact_dict=create_model_dict(),
+)
+rendered = converter("FILEPATH")
+text, _, images = text_from_rendered(rendered)
+```
+
+`rendered` will be a pydantic basemodel with different properties depending on the output type requested.  With markdown output (default), you'll have the properties `markdown`, `metadata`, and `images`.  For json output, you'll have `children`, `block_type`, and `metadata`.
+
+### Custom configuration
+
+You can pass configuration using the `ConfigParser`.  To see all available options, do `marker_single --help`.
+
+```python
+from marker.converters.pdf import PdfConverter
+from marker.models import create_model_dict
+from marker.config.parser import ConfigParser
+
+config = {
+    "output_format": "json",
+    "ADDITIONAL_KEY": "VALUE"
+}
+config_parser = ConfigParser(config)
+
+converter = PdfConverter(
+    config=config_parser.generate_config_dict(),
+    artifact_dict=create_model_dict(),
+    processor_list=config_parser.get_processors(),
+    renderer=config_parser.get_renderer(),
+    llm_service=config_parser.get_llm_service()
+)
+rendered = converter("FILEPATH")
+```
+
+### Extract blocks
+
+Each document consists of one or more pages.  Pages contain blocks, which can themselves contain other blocks.  It's possible to programmatically manipulate these blocks.
+
+Here's an example of extracting all forms from a document:
+
+```python
+from marker.converters.pdf import PdfConverter
+from marker.models import create_model_dict
+from marker.schema import BlockTypes
+
+converter = PdfConverter(
+    artifact_dict=create_model_dict(),
+)
+document = converter.build_document("FILEPATH")
+forms = document.contained_blocks((BlockTypes.Form,))
+```
+
+Look at the processors for more examples of extracting and manipulating blocks.
+
+## Other converters
+
+You can also use other converters that define different conversion pipelines:
+
+### Extract tables
+
+The `TableConverter` will only convert and extract tables:
+
+```python
+from marker.converters.table import TableConverter
+from marker.models import create_model_dict
+from marker.output import text_from_rendered
+
+converter = TableConverter(
+    artifact_dict=create_model_dict(),
+)
+rendered = converter("FILEPATH")
+text, _, images = text_from_rendered(rendered)
+```
+
+This takes all the same configuration as the PdfConverter.  You can specify the configuration `force_layout_block=Table` to avoid layout detection and instead assume every page is a table.  Set `output_format=json` to also get cell bounding boxes.
+
+You can also run this via the CLI with
+```shell
+marker_single FILENAME --use_llm --force_layout_block Table --converter_cls marker.converters.table.TableConverter --output_format json
+```
+
+### OCR Only
+
+If you only want to run OCR, you can also do that through the `OCRConverter`.  Set `--keep_chars` to keep individual characters and bounding boxes.
+
+```python
+from marker.converters.ocr import OCRConverter
+from marker.models import create_model_dict
+
+converter = OCRConverter(
+    artifact_dict=create_model_dict(),
+)
+rendered = converter("FILEPATH")
+```
+
+This takes all the same configuration as the PdfConverter.
+
+You can also run this via the CLI with
+```shell
+marker_single FILENAME --converter_cls marker.converters.ocr.OCRConverter
+```
+
+### Structured Extraction (beta)
+
+You can run structured extraction via the `ExtractionConverter`.  This requires an llm service to be setup first (see [here](#llm-services) for details).  You'll get a JSON output with the extracted values.
+
+```python
+from marker.converters.extraction import ExtractionConverter
+from marker.models import create_model_dict
+from marker.config.parser import ConfigParser
+from pydantic import BaseModel
+
+class Links(BaseModel):
+    links: list[str]
+
+schema = Links.model_json_schema()
+config_parser = ConfigParser({
+    "page_schema": schema
+})
+
+converter = ExtractionConverter(
+    artifact_dict=create_model_dict(),
+    config=config_parser.generate_config_dict(),
+    llm_service=config_parser.get_llm_service(),
+)
+rendered = converter("FILEPATH")
+```
+
+Rendered will have an `original_markdown` field.  If you pass this back in next time you run the converter, as the `existing_markdown` config key, you can skip re-parsing the document.
+
+# Output Formats
+
+## Markdown
+
+Markdown output will include:
+
+- image links (images will be saved in the same folder)
+- formatted tables
+- embedded LaTeX equations (fenced with `$$`)
+- Code is fenced with triple backticks
+- Superscripts for footnotes
+
+## HTML
+
+HTML output is similar to markdown output:
+
+- Images are included via `img` tags
+- equations are fenced with `<math>` tags
+- code is in `pre` tags
+
+## JSON
+
+JSON output will be organized in a tree-like structure, with the leaf nodes being blocks.  Examples of leaf nodes are a single list item, a paragraph of text, or an image.
+
+The output will be a list, with each list item representing a page.  Each page is considered a block in the internal marker schema.  There are different types of blocks to represent different elements.
+
+Pages have the keys:
+
+- `id` - unique id for the block.
+- `block_type` - the type of block. The possible block types can be seen in `marker/schema/__init__.py`.  As of this writing, they are ["Line", "Span", "FigureGroup", "TableGroup", "ListGroup", "PictureGroup", "Page", "Caption", "Code", "Figure", "Footnote", "Form", "Equation", "Handwriting", "TextInlineMath", "ListItem", "PageFooter", "PageHeader", "Picture", "SectionHeader", "Table", "Text", "TableOfContents", "Document"]
+- `html` - the HTML for the page.  Note that this will have recursive references to children.  The `content-ref` tags must be replaced with the child content if you want the full html.  You can see an example of this at `marker/output.py:json_to_html`.  That function will take in a single block from the json output, and turn it into HTML.
+- `polygon` - the 4-corner polygon of the page, in (x1,y1), (x2,y2), (x3, y3), (x4, y4) format.  (x1,y1) is the top left, and coordinates go clockwise.
+- `children` - the child blocks.
+
+The child blocks have two additional keys:
+
+- `section_hierarchy` - indicates the sections that the block is part of.  `1` indicates an h1 tag, `2` an h2, and so on.
+- `images` - base64 encoded images.  The key will be the block id, and the data will be the encoded image.
+
+Note that child blocks of pages can have their own children as well (a tree structure).
+
+```json
+{
+      "id": "/page/10/Page/366",
+      "block_type": "Page",
+      "html": "<content-ref src='/page/10/SectionHeader/0'></content-ref><content-ref src='/page/10/SectionHeader/1'></content-ref><content-ref src='/page/10/Text/2'></content-ref><content-ref src='/page/10/Text/3'></content-ref><content-ref src='/page/10/Figure/4'></content-ref><content-ref src='/page/10/SectionHeader/5'></content-ref><content-ref src='/page/10/SectionHeader/6'></content-ref><content-ref src='/page/10/TextInlineMath/7'></content-ref><content-ref src='/page/10/TextInlineMath/8'></content-ref><content-ref src='/page/10/Table/9'></content-ref><content-ref src='/page/10/SectionHeader/10'></content-ref><content-ref src='/page/10/Text/11'></content-ref>",
+      "polygon": [[0.0, 0.0], [612.0, 0.0], [612.0, 792.0], [0.0, 792.0]],
+      "children": [
+        {
+          "id": "/page/10/SectionHeader/0",
+          "block_type": "SectionHeader",
+          "html": "<h1>Supplementary Material for <i>Subspace Adversarial Training</i> </h1>",
+          "polygon": [
+            [217.845703125, 80.630859375], [374.73046875, 80.630859375],
+            [374.73046875, 107.0],
+            [217.845703125, 107.0]
+          ],
+          "children": null,
+          "section_hierarchy": {
+            "1": "/page/10/SectionHeader/1"
+          },
+          "images": {}
+        },
+        ...
+        ]
+    }
+
+
+```
+
+## Chunks
+
+Chunks format is similar to JSON, but flattens everything into a single list instead of a tree.  Only the top level blocks from each page show up. It also has the full HTML of each block inside, so you don't need to crawl the tree to reconstruct it.  This enable flexible and easy chunking for RAG.
+
+## Metadata
+
+All output formats will return a metadata dictionary, with the following fields:
+
+```json
+{
+    "table_of_contents": [
+      {
+        "title": "Introduction",
+        "heading_level": 1,
+        "page_id": 0,
+        "polygon": [...]
+      }
+    ], // computed PDF table of contents
+    "page_stats": [
+      {
+        "page_id":  0,
+        "text_extraction_method": "pdftext",
+        "block_counts": [("Span", 200), ...]
+      },
+      ...
+    ]
+}
+```
+
+# LLM Services
+
+When running with the `--use_llm` flag, you have a choice of services you can use:
+
+- `Gemini` - this will use the Gemini developer API by default.  You'll need to pass `--gemini_api_key` to configuration.
+- `Google Vertex` - this will use vertex, which can be more reliable.  You'll need to pass `--vertex_project_id`.  To use it, set `--llm_service=marker.services.vertex.GoogleVertexService`.
+- `Ollama` - this will use local models.  You can configure `--ollama_base_url` and `--ollama_model`. To use it, set `--llm_service=marker.services.ollama.OllamaService`.
+- `Claude` - this will use the anthropic API.  You can configure `--claude_api_key`, and `--claude_model_name`.  To use it, set `--llm_service=marker.services.claude.ClaudeService`.
+- `OpenAI` - this supports any openai-like endpoint. You can configure `--openai_api_key`, `--openai_model`, and `--openai_base_url`. To use it, set `--llm_service=marker.services.openai.OpenAIService`.
+- `Azure OpenAI` - this uses the Azure OpenAI service. You can configure `--azure_endpoint`, `--azure_api_key`, and `--deployment_name`. To use it, set `--llm_service=marker.services.azure_openai.AzureOpenAIService`.
+
+These services may have additional optional configuration as well - you can see it by viewing the classes.
+
+# Internals
+
+Marker is easy to extend.  The core units of marker are:
+
+- `Providers`, at `marker/providers`.  These provide information from a source file, like a PDF.
+- `Builders`, at `marker/builders`.  These generate the initial document blocks and fill in text, using info from the providers.
+- `Processors`, at `marker/processors`.  These process specific blocks, for example the table formatter is a processor.
+- `Renderers`, at `marker/renderers`. These use the blocks to render output.
+- `Schema`, at `marker/schema`.  The classes for all the block types.
+- `Converters`, at `marker/converters`.  They run the whole end to end pipeline.
+
+To customize processing behavior, override the `processors`.  To add new output formats, write a new `renderer`.  For additional input formats, write a new `provider.`
+
+Processors and renderers can be directly passed into the base `PDFConverter`, so you can specify your own custom processing easily.
+
+## API server
+
+There is a very simple API server you can run like this:
+
+```shell
+pip install -U uvicorn fastapi python-multipart
+marker_server --port 8001
+```
+
+This will start a fastapi server that you can access at `localhost:8001`.  You can go to `localhost:8001/docs` to see the endpoint options.
+
+You can send requests like this:
+
+```
+import requests
+import json
+
+post_data = {
+    'filepath': 'FILEPATH',
+    # Add other params here
+}
+
+requests.post("http://localhost:8001/marker", data=json.dumps(post_data)).json()
+```
+
+Note that this is not a very robust API, and is only intended for small-scale use.  If you want to use this server, but want a more robust conversion option, you can use the hosted [Datalab API](https://www.datalab.to/plans).
+
+# Troubleshooting
+
+There are some settings that you may find useful if things aren't working the way you expect:
+
+- If you have issues with accuracy, try setting `--use_llm` to use an LLM to improve quality.  You must set `GOOGLE_API_KEY` to a Gemini API key for this to work.
+- Make sure to set `force_ocr` if you see garbled text - this will re-OCR the document.
+- `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
+- If you're getting out of memory errors, decrease worker count.  You can also try splitting up long PDFs into multiple files.
+
+## Debugging
+
+Pass the `debug` option to activate debug mode.  This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
+
+# Benchmarks
+
+## Overall PDF Conversion
+
+We created a [benchmark set](https://huggingface.co/datasets/datalab-to/marker_benchmark) by extracting single PDF pages from common crawl.  We scored based on a heuristic that aligns text with ground truth text segments, and an LLM as a judge scoring method.
+
+| Method     | Avg Time | Heuristic Score | LLM Score |
+|------------|----------|-----------------|-----------|
+| marker     | 2.83837  | 95.6709         | 4.23916   |
+| llamaparse | 23.348   | 84.2442         | 3.97619   |
+| mathpix    | 6.36223  | 86.4281         | 4.15626   |
+| docling    | 3.69949  | 86.7073         | 3.70429   |
+
+Benchmarks were run on an H100 for markjer and docling - llamaparse and mathpix used their cloud services.  We can also look at it by document type:
+
+<img src="data/images/per_doc.png" width="1000px"/>
+
+| Document Type        | Marker heuristic | Marker LLM | Llamaparse Heuristic | Llamaparse LLM | Mathpix Heuristic | Mathpix LLM | Docling Heuristic | Docling LLM |
+|----------------------|------------------|------------|----------------------|----------------|-------------------|-------------|-------------------|-------------|
+| Scientific paper     | 96.6737          | 4.34899    | 87.1651              | 3.96421        | 91.2267           | 4.46861     | 92.135            | 3.72422     |
+| Book page            | 97.1846          | 4.16168    | 90.9532              | 4.07186        | 93.8886           | 4.35329     | 90.0556           | 3.64671     |
+| Other                | 95.1632          | 4.25076    | 81.1385              | 4.01835        | 79.6231           | 4.00306     | 83.8223           | 3.76147     |
+| Form                 | 88.0147          | 3.84663    | 66.3081              | 3.68712        | 64.7512           | 3.33129     | 68.3857           | 3.40491     |
+| Presentation         | 95.1562          | 4.13669    | 81.2261              | 4              | 83.6737           | 3.95683     | 84.8405           | 3.86331     |
+| Financial document   | 95.3697          | 4.39106    | 82.5812              | 4.16111        | 81.3115           | 4.05556     | 86.3882           | 3.8         |
+| Letter               | 98.4021          | 4.5        | 93.4477              | 4.28125        | 96.0383           | 4.45312     | 92.0952           | 4.09375     |
+| Engineering document | 93.9244          | 4.04412    | 77.4854              | 3.72059        | 80.3319           | 3.88235     | 79.6807           | 3.42647     |
+| Legal document       | 96.689           | 4.27759    | 86.9769              | 3.87584        | 91.601            | 4.20805     | 87.8383           | 3.65552     |
+| Newspaper page       | 98.8733          | 4.25806    | 84.7492              | 3.90323        | 96.9963           | 4.45161     | 92.6496           | 3.51613     |
+| Magazine page        | 98.2145          | 4.38776    | 87.2902              | 3.97959        | 93.5934           | 4.16327     | 93.0892           | 4.02041     |
+
+## Throughput
+
+We benchmarked throughput using a [single long PDF](https://www.greenteapress.com/thinkpython/thinkpython.pdf).
+
+| Method  | Time per page | Time per document | VRAM used |
+|---------|---------------|-------------------|---------- |
+| marker  | 0.18          | 43.42             |  3.17GB   |
+
+The projected throughput is 122 pages per second on an H100 - we can run 22 individual processes given the VRAM used.
+
+## Table Conversion
+
+Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
+
+| Method           | Avg score | Total tables |
+|------------------|-----------|--------------|
+| marker           | 0.816     | 99           |
+| marker w/use_llm | 0.907     | 99           |
+| gemini           | 0.829     | 99           |
+
+The `--use_llm` flag can significantly improve table recognition performance, as you can see.
+
+We filter out tables that we cannot align with the ground truth, since fintabnet and our layout model have slightly different detection methods (this results in some tables being split/merged).
+
+## Running your own benchmarks
+
+You can benchmark the performance of marker on your machine. Install marker manually with:
+
+```shell
+git clone https://github.com/VikParuchuri/marker.git
+poetry install
+```
+
+### Overall PDF Conversion
+
+Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:
+
+```shell
+python benchmarks/overall.py --methods marker --scores heuristic,llm
+```
+
+Options:
+
+- `--use_llm` use an llm to improve the marker results.
+- `--max_rows` how many rows to process for the benchmark.
+- `--methods` can be `llamaparse`, `mathpix`, `docling`, `marker`.  Comma separated.
+- `--scores` which scoring functions to use, can be `llm`, `heuristic`.  Comma separated.
+
+### Table Conversion
+The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
+
+```shell
+python benchmarks/table/table.py --max_rows 100
+```
+
+Options:
+
+- `--use_llm` uses an llm with marker to improve accuracy.
+- `--use_gemini` also benchmarks gemini 2.0 flash.
+
+# How it works
+
+Marker is a pipeline of deep learning models:
+
+- Extract text, OCR if necessary (heuristics, [surya](https://github.com/VikParuchuri/surya))
+- Detect page layout and find reading order ([surya](https://github.com/VikParuchuri/surya))
+- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify), [surya](https://github.com/VikParuchuri/surya))
+- Optionally use an LLM to improve quality
+- Combine blocks and postprocess complete text
+
+It only uses models where necessary, which improves speed and accuracy.
+
+# Limitations
+
+PDF is a tricky format, so marker will not always work perfectly.  Here are some known limitations that are on the roadmap to address:
+
+- Very complex layouts, with nested tables and forms, may not work
+- Forms may not be rendered well
+
+Note: Passing the `--use_llm` and `--force_ocr` flags will mostly solve these issues.
+
+# Usage and Deployment Examples
+
+You can always run `marker` locally, but if you wanted to expose it as an API, we have a few options:
+- Our platform API which is powered by `marker` and `surya` and is easy to test out - it's free to sign up, and we'll include credits, [try it out here](https://datalab.to)
+- Our painless on-prem solution for commercial use, which you can [read about here](https://www.datalab.to/blog/self-serve-on-prem-licensing) and gives you privacy guarantees with high throughput inference optimizations.
+- [Deployment example with Modal](./examples/README_MODAL.md) that shows you how to deploy and access `marker` through a web endpoint using [`Modal`](https://modal.com). Modal is an AI compute platform that enables developers to deploy and scale models on GPUs in minutes.
@@ -52,6 +52,76 @@ graph TD
    end
 ```

+## Marker API Response Structure
+
+The PDF parsing pipeline calls a local [Marker](https://github.com/VikParuchuri/marker) server (`POST /marker/upload`).
+
+### Top-level envelope
+
+```json
+{
+  "format": "json",
+  "output": "<JSON-encoded string>"
+}
+```
+
+`output` is a **JSON-encoded string** (not a nested object) and must be parsed a second time to get the document tree.
+
+### Parsed `output` shape
+
+```
+{
+  "children": [ <Page block>, ... ]
+}
+```
+
+### Block types
+
+Every block shares these fields:
+
+| Field            | Type              | Notes                                      |
+|------------------|-------------------|--------------------------------------------|
+| `id`             | string            | e.g. `/page/0/Picture/2`                   |
+| `block_type`     | string            | see table below                            |
+| `html`           | string            | rendered HTML; may contain `<content-ref>` |
+| `bbox`           | `[x0,y0,x1,y1]`  | bounding box in page coordinates           |
+| `children`       | array or null     | nested blocks                              |
+| `images`         | object or null    | base64 PNG map (leaf image blocks only)    |
+| `section_hierarchy` | object         | heading ancestry                           |
+
+#### Known `block_type` values
+
+| block_type       | Category | Notes                                                 |
+|------------------|----------|-------------------------------------------------------|
+| `Page`           | structure | Top-level; direct children are the page content       |
+| `SectionHeader`  | text      | Section / chapter heading                             |
+| `Text`           | text      |                                                       |
+| `TextInlineMath` | text      |                                                       |
+| `ListItem`       | text      |                                                       |
+| `Table`          | text      |                                                       |
+| `Code`           | text      |                                                       |
+| `Equation`       | text      |                                                       |
+| `Footnote`       | text      |                                                       |
+| `Caption`        | text      | Usually a child of a `*Group` block                   |
+| `PageHeader`     | text      |                                                       |
+| `PageFooter`     | text      |                                                       |
+| `Handwriting`    | text      |                                                       |
+| `Picture`        | image     | Leaf block; `images` map holds base64 PNG keyed by ID |
+| `Figure`         | image     | Leaf block; same as `Picture`                         |
+| `PictureGroup`   | container | Wraps one `Picture` + one `Caption` child             |
+| `FigureGroup`    | container | Wraps one `Figure` + one `Caption` child              |
+
+### Image extraction
+
+Images are only present on **leaf** image blocks (`Picture`, `Figure`).
+Group blocks (`PictureGroup`, `FigureGroup`) have `images: null` — the base64 PNG lives on the child leaf block.
+
+```
+PictureGroup
+├── Picture   ← images: { "/page/0/Picture/2": "<base64 PNG>" }
+└── Caption   ← html: "<p>Figure 1 — ...</p>"
+```
+
 ## Stack

 - **Backend**: Spring Boot 4.0.5 + Spring AI 2.0.0-M4, Java 21, Maven
@@ -81,6 +151,8 @@ npm run dev

 ### Environment Variables

+#### Backend
+
 | Variable | Required | Description |
 |----------|----------|-------------|
 | `OPENAI_API_KEY` | Yes | OpenAI API key for embeddings and chat |
@@ -89,3 +161,14 @@ npm run dev
 | `DB_USERNAME` | Yes | Database username |
 | `DB_PASSWORD` | Yes | Database password |
 | `FIGURE_STORAGE_PATH` | No | Base path for uploaded PDFs and extracted figures (default: `./uploads`) |
+| `UPLOAD_ENABLED` | No | Set to `false` to disable the book upload endpoint (default: `true`) |
+| `DELETE_ENABLED` | No | Set to `false` to disable the book delete endpoint (default: `true`) |
+
+#### Frontend
+
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `VITE_API_URL` | No | Backend API base URL (default: `/api/v1`) |
+| `VITE_APP_PASSWORD` | Yes | Shared password for HTTP Basic auth (must match `APP_PASSWORD`) |
+| `VITE_UPLOAD_ENABLED` | No | Set to `false` to hide the upload UI (default: `true`) |
+| `VITE_DELETE_ENABLED` | No | Set to `false` to hide the delete button (default: `true`) |
@@ -32,6 +32,13 @@
        <type>pom</type>
        <scope>import</scope>
      </dependency>
+      <dependency>
+        <groupId>software.amazon.awssdk</groupId>
+        <artifactId>bom</artifactId>
+        <version>2.30.14</version>
+        <type>pom</type>
+        <scope>import</scope>
+      </dependency>
    </dependencies>
  </dependencyManagement>

@@ -101,13 +108,19 @@
      <artifactId>spring-ai-pdf-document-reader</artifactId>
    </dependency>

-    <!-- PDFBox — explicit for image extraction per page -->
+    <!-- PDFBox — page rendering and cropping for figure extraction -->
    <dependency>
      <groupId>org.apache.pdfbox</groupId>
      <artifactId>pdfbox</artifactId>
      <version>3.0.3</version>
    </dependency>

+    <!-- AWS SDK v2 — S3 figure storage -->
+    <dependency>
+      <groupId>software.amazon.awssdk</groupId>
+      <artifactId>s3</artifactId>
+    </dependency>
+
    <!-- Jackson (JSON) -->
    <dependency>
      <groupId>com.fasterxml.jackson.core</groupId>
@@ -2,7 +2,10 @@ package com.aiteacher.book;

 import com.aiteacher.document.FigureEntity;
 import com.aiteacher.document.FigureRepository;
+import com.aiteacher.document.MarkdownStorageService;
+import org.springframework.beans.factory.annotation.Value;
 import org.springframework.http.HttpStatus;
+import org.springframework.http.MediaType;
 import org.springframework.http.ResponseEntity;
 import org.springframework.web.bind.annotation.*;
 import org.springframework.web.multipart.MultipartFile;
@@ -18,14 +21,24 @@ public class BookController {

    private final BookService bookService;
    private final FigureRepository figureRepository;
+    private final MarkdownStorageService markdownStorageService;

-    public BookController(BookService bookService, FigureRepository figureRepository) {
+    @Value("${app.features.upload-enabled:true}")
+    private boolean uploadEnabled;
+
+    @Value("${app.features.delete-enabled:true}")
+    private boolean deleteEnabled;
+
+    public BookController(BookService bookService, FigureRepository figureRepository,
+                          MarkdownStorageService markdownStorageService) {
        this.bookService = bookService;
        this.figureRepository = figureRepository;
+        this.markdownStorageService = markdownStorageService;
    }

    @PostMapping(consumes = "multipart/form-data")
    public ResponseEntity<?> upload(@RequestParam("file") MultipartFile file) throws IOException {
+        if (!uploadEnabled) return ResponseEntity.status(HttpStatus.METHOD_NOT_ALLOWED).build();
        Book book = bookService.upload(file);
        return ResponseEntity.status(HttpStatus.ACCEPTED).body(toSummaryResponse(book));
    }
@@ -46,6 +59,7 @@ public class BookController {

    @DeleteMapping("/{id}")
    public ResponseEntity<Void> delete(@PathVariable UUID id) {
+        if (!deleteEnabled) return ResponseEntity.status(HttpStatus.METHOD_NOT_ALLOWED).build();
        bookService.delete(id);
        return ResponseEntity.noContent().build();
    }
@@ -59,6 +73,17 @@ public class BookController {
        ));
    }

+    @GetMapping(value = "/{id}/pages/{pageNumber}/html", produces = MediaType.TEXT_HTML_VALUE)
+    public ResponseEntity<String> getPageHtml(@PathVariable UUID id,
+                                               @PathVariable int pageNumber) {
+        bookService.getById(id); // 404 if not found
+        try {
+            return ResponseEntity.ok(markdownStorageService.getText(id, pageNumber));
+        } catch (Exception e) {
+            return ResponseEntity.notFound().build();
+        }
+    }
+
    @GetMapping("/{id}/figures")
    public ResponseEntity<List<FigureResponse>> figures(@PathVariable UUID id) {
        bookService.getById(id); // 404 if not found
@@ -2,6 +2,7 @@ package com.aiteacher.book;

 import com.aiteacher.document.*;
 import com.aiteacher.figure.FigureStorageService;
+
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 import org.springframework.ai.document.Document;
@@ -23,13 +24,7 @@ public class BookEmbeddingService {

    private final VectorStore vectorStore;
    private final BookRepository bookRepository;
-
-    @Value("${app.embedding.batch-size:50}")
-    private int embeddingBatchSize;
-
-    @Value("${app.embedding.batch-delay-ms:1000}")
-    private long embeddingBatchDelayMs;
-    private final PdfStructureParser pdfStructureParser;
+    private final MarkerPageParser markerPageParser;
    private final FigureExtractionService figureExtractionService;
    private final VisionDescriptionService visionDescriptionService;
    private final TextChunkingService textChunkingService;
@@ -39,11 +34,21 @@ public class BookEmbeddingService {
    private final FigureRepository figureRepository;
    private final ChunkFigureRefRepository chunkFigureRefRepository;
    private final FigureStorageService figureStorageService;
+    private final MarkdownStorageService markdownStorageService;
+
+    @Value("${app.embedding.batch-size:50}")
+    private int embeddingBatchSize;
+
+    @Value("${app.embedding.batch-delay-ms:1000}")
+    private long embeddingBatchDelayMs;
+
+    @Value("${app.embedding.skip-embedding:false}")
+    private boolean skipEmbedding;

    public BookEmbeddingService(
            VectorStore vectorStore,
            BookRepository bookRepository,
-            PdfStructureParser pdfStructureParser,
+            MarkerPageParser markerPageParser,
            FigureExtractionService figureExtractionService,
            VisionDescriptionService visionDescriptionService,
            TextChunkingService textChunkingService,
@@ -52,10 +57,11 @@ public class BookEmbeddingService {
            ChapterRepository chapterRepository,
            FigureRepository figureRepository,
            ChunkFigureRefRepository chunkFigureRefRepository,
-            FigureStorageService figureStorageService) {
+            FigureStorageService figureStorageService,
+            MarkdownStorageService markdownStorageService) {
        this.vectorStore = vectorStore;
        this.bookRepository = bookRepository;
-        this.pdfStructureParser = pdfStructureParser;
+        this.markerPageParser = markerPageParser;
        this.figureExtractionService = figureExtractionService;
        this.visionDescriptionService = visionDescriptionService;
        this.textChunkingService = textChunkingService;
@@ -65,11 +71,12 @@ public class BookEmbeddingService {
        this.figureRepository = figureRepository;
        this.chunkFigureRefRepository = chunkFigureRefRepository;
        this.figureStorageService = figureStorageService;
+        this.markdownStorageService = markdownStorageService;
    }

    @Async
    public void embedBook(UUID bookId, String bookTitle, Path pdfPath) {
-        log.info("Starting image-aware embedding for book {} ({})", bookId, bookTitle);
+        log.info("Starting Marker-powered embedding for book {} ({})", bookId, bookTitle);

        Book book = bookRepository.findById(bookId).orElse(null);
        if (book == null) {
@@ -81,59 +88,78 @@ public class BookEmbeddingService {
            book.setStatus(BookStatus.PROCESSING);
            bookRepository.save(book);

-            // Step 1: Parse PDF into page-level sections persisted in Postgres
-            List<SectionEntity> sections = pdfStructureParser.parse(bookId, bookTitle, pdfPath);
            String chapterId = bookId + "-ch1";
+            ChapterEntity chapter = new ChapterEntity(chapterId, bookId, 1, bookTitle, 1);
+            chapterRepository.save(chapter);

-            // Step 2: Build and embed text chunks for all sections in batches
+            // Step 1: Parse with Marker — JSON (structured) + Markdown (per-page) in parallel
+            ParsedBook parsed = markerPageParser.parse(pdfPath);
+
+            List<PageResult> pageResults = parsed.pages();
+
+            // Step 2: Build SectionEntity per page and persist
+            List<SectionEntity> sections = buildAndSaveSections(bookId, bookTitle, chapterId, pageResults);
+
+            // Step 3: Chunk and embed text
            List<Document> allChunks = new ArrayList<>();
            for (SectionEntity section : sections) {
-                List<Document> chunks = textChunkingService.chunk(section, bookTitle);
-                allChunks.addAll(chunks);
+                allChunks.addAll(textChunkingService.chunk(section, bookTitle));
+            }
+            if (skipEmbedding) {
+                log.info("skip-embedding=true — skipping text embedding for book {}", bookId);
+            } else {
+                embedInBatches(allChunks, bookId);
+                log.info("Embedded {} text chunks for book {}", allChunks.size(), bookId);
            }
-            embedInBatches(allChunks, bookId);
-            log.info("Embedded {} text chunks for book {}", allChunks.size(), bookId);

-            // Step 3: Extract images from the PDF, save to file store, persist FigureEntity
-            List<FigureEntity> figures = figureExtractionService.extract(
-                bookId, chapterId, sections, pdfPath);
+            // Step 4: Decode pre-cropped figures from Marker output
+            FigureExtractionService.ExtractionResult extraction =
+                    figureExtractionService.extract(bookId, chapterId, pageResults);
+            List<FigureEntity> figures = extraction.figures();

-            // Step 4: For each figure, generate vision description and embed caption
+            // Step 4b: Save per-page HTML to S3, replacing Marker image src with API URLs
+            parsed.htmlByPage().forEach((pageNumber, html) -> {
+                String resolved = resolveImageSrcs(html, bookId, extraction.blockIdToFigureId());
+                markdownStorageService.save(bookId, pageNumber, resolved);
+            });
+            log.info("Saved {} HTML pages to S3 for book {}", parsed.htmlByPage().size(), bookId);
+
+            // Step 5: Vision analysis (description + visible text) → embed figure chunks
            for (FigureEntity figure : figures) {
-                Path imagePath = figureStorageService.resolve(figure.getImagePath());
-                String description = visionDescriptionService.describe(
-                    imagePath, figure.getCaption());
+                byte[] imageBytes = figureStorageService.getBytes(figure.getImagePath());
+                VisionDescriptionService.ImageAnalysis analysis =
+                        visionDescriptionService.analyze(imageBytes, figure.getCaption());

-                // Use description as caption fallback if no caption was detected
                if (figure.getCaption() == null || figure.getCaption().isBlank()) {
-                    figure.setCaption(description);
+                    figure.setCaption(analysis.description());
                    figureRepository.save(figure);
                }

-                // Content for embedding = vision description + caption for maximum signal
-                String embeddingContent = description
-                    + (figure.getCaption() != null ? "\n" + figure.getCaption() : "");
+                // Embedding content: description + caption + visible image text
+                String embeddingContent = analysis.description()
+                        + (figure.getCaption() != null ? "\n" + figure.getCaption() : "")
+                        + (analysis.imageText().isEmpty() ? "" : "\n" + analysis.imageText());

                String embeddingId = UUID.randomUUID().toString();
-                Map<String, Object> metadata = buildFigureMetadata(figure, bookTitle, embeddingId);
-                Document figureDoc = new Document(embeddingId, embeddingContent, metadata);
-                vectorStore.add(List.of(figureDoc));
-
-                figure.setCaptionEmbeddingId(UUID.fromString(embeddingId));
+                if (!skipEmbedding) {
+                    Document figureDoc = new Document(embeddingId, embeddingContent,
+                            buildFigureMetadata(figure, bookTitle, embeddingId, analysis.imageText()));
+                    vectorStore.add(List.of(figureDoc));
+                    figure.setCaptionEmbeddingId(UUID.fromString(embeddingId));
+                }
                figureRepository.save(figure);
            }
-            log.info("Embedded {} figure captions for book {}", figures.size(), bookId);
+            log.info("Embedded {} figure chunks for book {}", figures.size(), bookId);

-            // Step 5: Link text chunks to figures via text references
+            // Step 6: Link text chunks to figures via in-text references
            for (SectionEntity section : sections) {
                List<Document> sectionChunks = allChunks.stream()
-                    .filter(d -> section.getId().equals(d.getMetadata().get("section_id")))
-                    .toList();
+                        .filter(d -> section.getId().equals(d.getMetadata().get("section_id")))
+                        .toList();
                List<FigureEntity> sectionFigures = figures.stream()
-                    .filter(f -> section.getId().equals(f.getSectionId()))
-                    .toList();
-                chunkFigureRefService.linkChunksToFigures(
-                    sectionChunks, sectionFigures, section.getPageStart());
+                        .filter(f -> section.getId().equals(f.getSectionId()))
+                        .toList();
+                chunkFigureRefService.linkChunksToFigures(sectionChunks, sectionFigures, section.getPageStart());
            }

            book.setStatus(BookStatus.READY);
@@ -142,7 +168,7 @@ public class BookEmbeddingService {
            bookRepository.save(book);

            log.info("Finished embedding book {} — {} pages, {} figures",
-                bookId, sections.size(), figures.size());
+                    bookId, sections.size(), figures.size());

        } catch (Exception ex) {
            log.error("Failed to embed book {}", bookId, ex);
@@ -156,53 +182,63 @@ public class BookEmbeddingService {
    public void deleteBookChunks(UUID bookId) {
        log.info("Deleting all data for book {}", bookId);
        try {
-            // Delete chunk-figure refs (by figureId for this book)
            List<String> figureIds = figureRepository.findAllByBookId(bookId)
-                .stream().map(FigureEntity::getId).toList();
+                    .stream().map(FigureEntity::getId).toList();
            if (!figureIds.isEmpty()) {
                chunkFigureRefRepository.deleteByFigureIdIn(figureIds);
            }
-
-            // Delete figures from Postgres
            figureRepository.deleteAllByBookId(bookId);
-
-            // Delete figure files from disk
            figureStorageService.deleteAll(bookId);
-
-            // Delete sections and chapters from Postgres
+            markdownStorageService.deleteAll(bookId);
            sectionRepository.deleteAllByBookId(bookId);
            chapterRepository.deleteAllByBookId(bookId);

-            // Delete vector store entries (text chunks + figure embeddings)
            FilterExpressionBuilder b = new FilterExpressionBuilder();
            vectorStore.delete(b.eq("book_id", bookId.toString()).build());
-
        } catch (Exception ex) {
            log.warn("Error during cleanup for book {}: {}", bookId, ex.getMessage());
        }
    }

+    // --- Private helpers ---
+
+    private List<SectionEntity> buildAndSaveSections(UUID bookId, String bookTitle,
+                                                      String chapterId,
+                                                      List<PageResult> pageResults) {
+        List<SectionEntity> sections = new ArrayList<>();
+        for (PageResult page : pageResults) {
+            if (page.orderedText().isBlank()) continue;
+
+            String sectionId = bookId + "-p" + page.pageNumber();
+            String title = page.headingTitle() != null ? page.headingTitle() : "Page " + page.pageNumber();
+
+            SectionEntity section = new SectionEntity(
+                    sectionId, chapterId, bookId,
+                    String.valueOf(page.pageNumber()),
+                    title,
+                    page.pageNumber(), page.pageNumber(),
+                    page.orderedText());
+            sections.add(sectionRepository.save(section));
+        }
+        return sections;
+    }
+
    private void embedInBatches(List<Document> docs, UUID bookId) {
        int total = docs.size();
        for (int i = 0; i < total; i += embeddingBatchSize) {
            List<Document> batch = docs.subList(i, Math.min(i + embeddingBatchSize, total));
            vectorStore.add(batch);
-            int batchNum = i / embeddingBatchSize + 1;
-            int totalBatches = (total - 1) / embeddingBatchSize + 1;
-            log.debug("Embedded batch {}/{} for book {}", batchNum, totalBatches, bookId);
+            log.debug("Embedded batch {}/{} for book {}",
+                    i / embeddingBatchSize + 1, (total - 1) / embeddingBatchSize + 1, bookId);
            if (i + embeddingBatchSize < total) {
-                try {
-                    Thread.sleep(embeddingBatchDelayMs);
-                } catch (InterruptedException e) {
-                    Thread.currentThread().interrupt();
-                    log.warn("Embedding batch sleep interrupted for book {}", bookId);
-                }
+                try { Thread.sleep(embeddingBatchDelayMs); }
+                catch (InterruptedException e) { Thread.currentThread().interrupt(); }
            }
        }
    }

    private Map<String, Object> buildFigureMetadata(FigureEntity figure, String bookTitle,
-                                                     String embeddingId) {
+                                                     String embeddingId, String imageText) {
        Map<String, Object> m = new HashMap<>();
        m.put("type", "FIGURE");
        m.put("book_id", figure.getBookId().toString());
@@ -215,9 +251,26 @@ public class BookEmbeddingService {
        m.put("label", figure.getLabel() != null ? figure.getLabel() : "");
        m.put("page", figure.getPage());
        m.put("embedding_id", embeddingId);
+        m.put("image_text", imageText);  // verbatim text visible inside the image
        return m;
    }

+    /**
+     * Replaces Marker's {@code src='{blockId}'} image attributes with resolved API URLs.
+     * Block IDs look like {@code /page/0/Figure/2}.
+     */
+    private String resolveImageSrcs(String html, UUID bookId, Map<String, String> blockIdToFigureId) {
+        for (Map.Entry<String, String> entry : blockIdToFigureId.entrySet()) {
+            String blockId = entry.getKey();
+            String figureId = entry.getValue();
+            String apiUrl = "/api/v1/figures/" + bookId + "/" + figureId + ".png";
+            // Marker emits both single and double-quoted src attributes
+            html = html.replace("src='" + blockId + "'", "src='" + apiUrl + "'");
+            html = html.replace("src=\"" + blockId + "\"", "src=\"" + apiUrl + "\"");
+        }
+        return html;
+    }
+
    private String truncate(String msg, int max) {
        if (msg == null) return null;
        return msg.length() <= max ? msg : msg.substring(0, max);
@@ -1,25 +1,37 @@
 package com.aiteacher.config;

-import org.springframework.beans.factory.annotation.Value;
-import org.springframework.context.annotation.Configuration;
-import org.springframework.web.servlet.config.annotation.ResourceHandlerRegistry;
-import org.springframework.web.servlet.config.annotation.WebMvcConfigurer;
+import com.aiteacher.figure.FigureStorageService;
+import org.springframework.http.HttpStatus;
+import org.springframework.web.bind.annotation.*;
+import org.springframework.web.server.ResponseStatusException;

-import java.nio.file.Paths;
+import jakarta.servlet.http.HttpServletResponse;
+import java.io.IOException;

-@Configuration
-public class FigureStorageConfig implements WebMvcConfigurer {
+/**
+ * Serves figure images by redirecting to a presigned S3 URL.
+ * The key stored in DB is the full S3 object key, e.g. "figures/{bookId}/{figureId}.png".
+ */
+@RestController
+@RequestMapping("/api/v1/figures")
+public class FigureStorageConfig {

-    private final String basePath;
+    private final FigureStorageService figureStorageService;

-    public FigureStorageConfig(@Value("${app.figure-storage.base-path:./uploads}") String basePath) {
-        this.basePath = Paths.get(basePath).toAbsolutePath().normalize().toString();
+    public FigureStorageConfig(FigureStorageService figureStorageService) {
+        this.figureStorageService = figureStorageService;
    }

-    @Override
-    public void addResourceHandlers(ResourceHandlerRegistry registry) {
-        // Serve GET /api/v1/figures/** from the local file store
-        registry.addResourceHandler("/api/v1/figures/**")
-                .addResourceLocations("file:" + basePath + "/figures/");
+    @GetMapping("/{bookId}/{filename}")
+    public void serve(@PathVariable String bookId,
+                      @PathVariable String filename,
+                      HttpServletResponse response) throws IOException {
+        String key = "figures/" + bookId + "/" + filename;
+        try {
+            String url = figureStorageService.presignedUrl(key);
+            response.sendRedirect(url);
+        } catch (Exception ex) {
+            throw new ResponseStatusException(HttpStatus.NOT_FOUND, "Figure not found: " + key);
+        }
    }
 }
@@ -0,0 +1,30 @@
+package com.aiteacher.config;
+
+import org.springframework.beans.factory.annotation.Value;
+import org.springframework.context.annotation.Bean;
+import org.springframework.context.annotation.Configuration;
+import org.springframework.http.client.JdkClientHttpRequestFactory;
+import org.springframework.web.client.RestClient;
+
+import java.net.http.HttpClient;
+
+@Configuration
+public class MarkerConfig {
+
+    @Value("${app.marker.base-url:http://localhost:8000}")
+    private String markerBaseUrl;
+
+    @Bean
+    RestClient markerRestClient() {
+        // Use the JDK HTTP client with no timeout — Marker conversions can take several minutes.
+        HttpClient httpClient = HttpClient.newBuilder()
+                .build();
+        JdkClientHttpRequestFactory factory = new JdkClientHttpRequestFactory(httpClient);
+        // No read timeout set: JDK HTTP client defaults to no deadline.
+
+        return RestClient.builder()
+                .baseUrl(markerBaseUrl)
+                .requestFactory(factory)
+                .build();
+    }
+}
@@ -20,7 +20,9 @@ public class SecurityConfig {
    @Bean
    public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
        http
-            .authorizeHttpRequests(auth -> auth.anyRequest().authenticated())
+            .authorizeHttpRequests(auth -> auth
+                .requestMatchers("/api/v1/figures/**").permitAll()
+                .anyRequest().authenticated())
            .httpBasic(Customizer.withDefaults())
            .csrf(AbstractHttpConfigurer::disable);
        return http.build();
@@ -1,43 +1,43 @@
 package com.aiteacher.document;

 import com.aiteacher.figure.FigureStorageService;
-import org.apache.pdfbox.Loader;
-import org.apache.pdfbox.cos.COSName;
-import org.apache.pdfbox.pdmodel.PDDocument;
-import org.apache.pdfbox.pdmodel.PDPage;
-import org.apache.pdfbox.pdmodel.graphics.PDXObject;
-import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 import org.springframework.beans.factory.annotation.Value;
 import org.springframework.stereotype.Service;

+import javax.imageio.ImageIO;
 import java.awt.image.BufferedImage;
+import java.io.ByteArrayInputStream;
 import java.io.IOException;
-import java.nio.file.Path;
 import java.util.ArrayList;
+import java.util.HashMap;
 import java.util.List;
+import java.util.Map;
 import java.util.UUID;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;

 /**
- * Extracts images from each PDF page using PDFBox.
- * Images below the configured minimum size are skipped.
- * Caption is detected by the "Fig." pattern in page text.
+ * Extracts figure images from {@link PageResult.FigureData} entries produced by
+ * {@link MarkerPageParser}.
+ *
+ * <p>Marker returns pre-cropped PNG bytes for each detected figure, so no PDFBox
+ * page rendering or bounding-box cropping is needed. This service:
+ * <ol>
+ *   <li>Decodes the PNG bytes to check dimensions (skip images below min size)</li>
+ *   <li>Classifies the figure type from caption and surrounding text keywords</li>
+ *   <li>Persists the image via {@link FigureStorageService}</li>
+ *   <li>Persists a {@link FigureEntity} to the database</li>
+ * </ol>
 */
@Service
 public class FigureExtractionService {

    private static final Logger log = LoggerFactory.getLogger(FigureExtractionService.class);

-    // Caption: line starting with "Fig." or "Figure" followed by a number
-    private static final Pattern CAPTION_PATTERN =
-        Pattern.compile("(?m)^(Fig\\.?\\s*\\d+[\\-.]?\\d*[^\\n]*)", Pattern.CASE_INSENSITIVE);
-
-    // Figure label: "Fig. 12-4" or "Fig. 12.4"
    private static final Pattern LABEL_PATTERN =
-        Pattern.compile("(?i)Fig\\.?\\s*(\\d+[\\-.\\d]*)");
+            Pattern.compile("(?i)Fig\\.?\\s*(\\d+[\\-.\\d]*)");

    private final FigureStorageService storageService;
    private final FigureRepository figureRepository;
@@ -52,65 +52,77 @@ public class FigureExtractionService {
        this.minImageSizePx = minImageSizePx;
    }

+    /** Holds the extraction output: persisted figures and a Marker blockId → DB figureId map. */
+    public record ExtractionResult(List<FigureEntity> figures, Map<String, String> blockIdToFigureId) {}
+
    /**
-     * Extracts all qualifying images from the PDF for the given book.
-     * Returns persisted FigureEntity list (without vision descriptions — set later).
+     * Extracts and persists figures for all pages described by {@code pageResults}.
+     *
+     * @param bookId      owning book
+     * @param chapterId   chapter bucket for these sections
+     * @param pageResults Marker parse output — each entry's {@code figures} list
+     *                    carries pre-cropped PNG bytes for that page
+     * @return {@link ExtractionResult} with persisted figures and blockId→figureId map
+     *         (used to resolve markdown image placeholders)
     */
-    public List<FigureEntity> extract(UUID bookId, String chapterId,
-                                      List<SectionEntity> sections, Path pdfPath) {
+    public ExtractionResult extract(UUID bookId, String chapterId,
+                                    List<PageResult> pageResults) {
        List<FigureEntity> figures = new ArrayList<>();
+        Map<String, String> blockIdToFigureId = new HashMap<>();
        int figureCounter = 0;

-        try (PDDocument doc = Loader.loadPDF(pdfPath.toFile())) {
-            for (SectionEntity section : sections) {
-                int pageIndex = section.getPageStart() - 1; // 0-based
-                if (pageIndex < 0 || pageIndex >= doc.getNumberOfPages()) continue;
-
-                PDPage page = doc.getPage(pageIndex);
-                String pageText = section.getFullText();
+        for (PageResult page : pageResults) {
+            if (page.figures().isEmpty()) continue;

+            for (PageResult.FigureData figureData : page.figures()) {
                try {
-                    for (COSName name : page.getResources().getXObjectNames()) {
-                        PDXObject xObject = page.getResources().getXObject(name);
-                        if (!(xObject instanceof PDImageXObject image)) continue;
-
-                        BufferedImage bufferedImage = image.getImage();
-                        if (bufferedImage.getWidth() < minImageSizePx
-                                || bufferedImage.getHeight() < minImageSizePx) {
-                            continue; // skip decorative images
-                        }
-
-                        figureCounter++;
-                        String figureId = bookId + "-fig-" + pageIndex + "-" + figureCounter;
-                        String caption = detectCaption(pageText);
-                        String label = detectLabel(caption, figureCounter);
-                        FigureType type = classifyType(caption, pageText);
-
-                        String imagePath = storageService.save(bookId, figureId, bufferedImage);
-
-                        FigureEntity figure = new FigureEntity(
-                            figureId, bookId, section.getId(), chapterId,
-                            label, caption, type, section.getPageStart(), imagePath
-                        );
-                        figures.add(figureRepository.save(figure));
+                    BufferedImage image = decodeImage(figureData.imageBytes());
+                    if (image == null) {
+                        log.debug("Could not decode image on page {} of book {} (block {})",
+                                page.pageNumber(), bookId, figureData.blockId());
+                        continue;
                    }
-                } catch (IOException ex) {
-                    log.warn("Failed to extract images from page {} of book {}: {}",
-                        section.getPageStart(), bookId, ex.getMessage());
+                    if (image.getWidth() < minImageSizePx || image.getHeight() < minImageSizePx) {
+                        log.debug("Skipping small figure on page {} ({}×{})",
+                                page.pageNumber(), image.getWidth(), image.getHeight());
+                        continue;
+                    }
+
+                    figureCounter++;
+                    String figureId = bookId + "-fig-" + page.pageNumber() + "-" + figureCounter;
+                    String caption = figureData.nearestCaption();
+                    String label = detectLabel(caption, figureCounter);
+                    FigureType type = classifyType(caption, page.orderedText());
+
+                    String sectionId = bookId + "-p" + page.pageNumber();
+                    String imagePath = storageService.save(bookId, figureId, image);
+
+                    FigureEntity figure = new FigureEntity(
+                            figureId, bookId, sectionId, chapterId,
+                            label, caption, type, page.pageNumber(), imagePath);
+                    figures.add(figureRepository.save(figure));
+                    blockIdToFigureId.put(figureData.blockId(), figureId);
+
+                } catch (Exception ex) {
+                    log.warn("Failed to extract figure on page {} of book {}: {}",
+                            page.pageNumber(), bookId, ex.getMessage());
                }
            }
-        } catch (IOException ex) {
-            log.error("Could not open PDF for image extraction, book {}", bookId, ex);
        }

        log.info("Extracted {} figures for book {}", figures.size(), bookId);
-        return figures;
+        return new ExtractionResult(figures, blockIdToFigureId);
    }

-    private String detectCaption(String pageText) {
-        if (pageText == null) return null;
-        Matcher m = CAPTION_PATTERN.matcher(pageText);
-        return m.find() ? m.group(1).trim() : null;
+    // --- Private helpers ---
+
+    private BufferedImage decodeImage(byte[] imageBytes) {
+        if (imageBytes == null || imageBytes.length == 0) return null;
+        try {
+            return ImageIO.read(new ByteArrayInputStream(imageBytes));
+        } catch (IOException ex) {
+            return null;
+        }
    }

    private String detectLabel(String caption, int counter) {
@@ -122,14 +134,18 @@ public class FigureExtractionService {
    }

    private FigureType classifyType(String caption, String pageText) {
-        String combined = ((caption != null ? caption : "") + " " + (pageText != null ? pageText : "")).toLowerCase();
+        String combined = ((caption != null ? caption : "") + " " +
+                           (pageText != null ? pageText : "")).toLowerCase();
        if (combined.contains("mri") || combined.contains("ct ") || combined.contains("magnetic")
-                || combined.contains("tomography")) return FigureType.MRI_CT_SCAN;
-        if (combined.contains("intraoperative") || combined.contains("intra-op")) return FigureType.INTRAOPERATIVE_IMAGE;
-        if (caption != null && caption.toLowerCase().startsWith("table")) return FigureType.TABLE;
+                || combined.contains("tomography"))    return FigureType.MRI_CT_SCAN;
+        if (combined.contains("intraoperative") || combined.contains("intra-op"))
+                                                       return FigureType.INTRAOPERATIVE_IMAGE;
+        if (caption != null && caption.toLowerCase().startsWith("table"))
+                                                       return FigureType.TABLE;
        if (combined.contains("chart") || combined.contains("histogram") || combined.contains("graph"))
-            return FigureType.CHART;
-        if (combined.contains("photograph") || combined.contains("photo")) return FigureType.SURGICAL_PHOTOGRAPH;
+                                                       return FigureType.CHART;
+        if (combined.contains("photograph") || combined.contains("photo"))
+                                                       return FigureType.SURGICAL_PHOTOGRAPH;
        return FigureType.ANATOMICAL_DIAGRAM;
    }
 }
@@ -0,0 +1,14 @@
+package com.aiteacher.document;
+
+import java.util.UUID;
+
+public interface MarkdownStorageService {
+    /** Uploads the markdown content and returns the S3 key. */
+    String save(UUID bookId, int pageNumber, String markdown);
+
+    /** Downloads and returns the markdown content for the given book and page. */
+    String getText(UUID bookId, int pageNumber);
+
+    /** Deletes all markdown files for the given book. */
+    void deleteAll(UUID bookId);
+}
@@ -0,0 +1,287 @@
+package com.aiteacher.document;
+
+import tools.jackson.databind.JsonNode;
+import tools.jackson.databind.ObjectMapper;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.springframework.beans.factory.annotation.Qualifier;
+import org.springframework.core.io.FileSystemResource;
+import org.springframework.http.MediaType;
+import org.springframework.stereotype.Service;
+import org.springframework.util.LinkedMultiValueMap;
+import org.springframework.util.MultiValueMap;
+import org.springframework.web.client.RestClient;
+
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.*;
+
+/**
+ * Parses a PDF with a single call to the Marker server using {@code output_format=json}.
+ *
+ * <p>The JSON response contains an {@code output} field that is itself a JSON string with a
+ * tree structure: the root has a {@code children} array where each item is a {@code Page} block.
+ * Each block carries an {@code html} field with {@code <content-ref src='blockId'>} placeholders
+ * that reference its {@code children} by ID.
+ *
+ * <p>{@link #jsonToHtml} mirrors the Marker Python {@code json_to_html} utility: it walks the
+ * tree recursively and resolves every {@code content-ref} with the rendered HTML of the
+ * referenced child block.
+ *
+ * <p>Returns a {@link ParsedBook} with:
+ * <ul>
+ *   <li>{@code pages} — one {@link PageResult} per non-empty page (drives embeddings)</li>
+ *   <li>{@code htmlByPage} — full resolved HTML per page (saved to S3 for the reader)</li>
+ * </ul>
+ */
+@Service
+public class MarkerPageParser {
+
+    private static final Logger log = LoggerFactory.getLogger(MarkerPageParser.class);
+
+    private static final Set<String> TEXT_BLOCK_TYPES = Set.of(
+            "Text", "TextInlineMath", "ListItem", "Table", "TableOfContents", "Code", "Equation",
+            "Footnote", "Caption", "PageHeader", "PageFooter", "Handwriting"
+    );
+    private static final Set<String> FIGURE_BLOCK_TYPES = Set.of("Figure", "Picture", "FigureGroup", "PictureGroup");
+
+    private static final ObjectMapper MAPPER = new ObjectMapper();
+
+    private final RestClient restClient;
+
+    public MarkerPageParser(@Qualifier("markerRestClient") RestClient restClient) {
+        this.restClient = restClient;
+    }
+
+    public ParsedBook parse(Path pdfPath) {
+        log.info("Submitting {} to Marker (json)", pdfPath.getFileName());
+
+        MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
+        body.add("file", new FileSystemResource(pdfPath));
+        body.add("output_format", "json");
+
+        JsonNode response = restClient.post()
+                .uri("/marker/upload")
+                .contentType(MediaType.MULTIPART_FORM_DATA)
+                .body(body)
+                .retrieve()
+                .body(JsonNode.class);
+
+        try {
+            Files.writeString(Path.of("/tmp/marker-response-json.json"), response.toPrettyString());
+        } catch (IOException e) {
+            log.warn("Could not save Marker response to /tmp/marker-response-json.json", e);
+        }
+
+        List<JsonNode> pageNodes = extractPages(response);
+        if (pageNodes.isEmpty()) {
+            log.warn("Marker returned no pages for {}", pdfPath.getFileName());
+            return new ParsedBook(List.of(), Map.of());
+        }
+        log.info("Marker returned {} pages for {}", pageNodes.size(), pdfPath.getFileName());
+
+        List<PageResult> pages = new ArrayList<>();
+        Map<Integer, String> htmlByPage = new LinkedHashMap<>();
+
+        for (int i = 0; i < pageNodes.size(); i++) {
+            JsonNode pageNode = pageNodes.get(i);
+            int pageNumber = i + 1; // 1-based
+
+            PageResult result = buildPageResult(pageNode, pageNumber);
+            String html = jsonToHtml(pageNode);
+
+            if (!result.orderedText().isBlank() || !result.figures().isEmpty()) {
+                pages.add(result);
+                htmlByPage.put(pageNumber, html);
+            }
+        }
+
+        log.info("Marker produced {} non-empty pages from {}", pages.size(), pdfPath.getFileName());
+        return new ParsedBook(pages, htmlByPage);
+    }
+
+    // ── Page extraction ───────────────────────────────────────────────────────
+
+    /**
+     * Parses the {@code output} JSON string and returns the list of page nodes
+     * (the top-level {@code children} of the document root).
+     */
+    private List<JsonNode> extractPages(JsonNode response) {
+        if (response == null) return List.of();
+        JsonNode outputNode = response.path("output");
+        if (outputNode.isMissingNode()) {
+            log.warn("Marker response has no 'output' field");
+            return List.of();
+        }
+        try {
+            JsonNode root = MAPPER.readTree(outputNode.stringValue());
+            JsonNode children = root.path("children");
+            if (children.isMissingNode() || !children.isArray()) {
+                log.warn("Marker output root has no 'children' array");
+                return List.of();
+            }
+            List<JsonNode> result = new ArrayList<>();
+            children.forEach(result::add);
+            return result;
+        } catch (Exception e) {
+            log.warn("Could not parse Marker 'output' string as JSON: {}", e.getMessage());
+            return List.of();
+        }
+    }
+
+    // ── HTML rendering ────────────────────────────────────────────────────────
+
+    /**
+     * Java equivalent of the Marker Python {@code json_to_html} utility.
+     *
+     * <p>Algorithm:
+     * <ol>
+     *   <li>If the block has no children, return its {@code html} as-is (leaf node).</li>
+     *   <li>Otherwise recursively render each child, then replace every
+     *       {@code <content-ref src='childId'>} placeholder in the block's own {@code html}
+     *       with the rendered child HTML.</li>
+     * </ol>
+     */
+    String jsonToHtml(JsonNode block) {
+        String html = str(block.path("html"));
+
+        // If the block carries image data, inject <img> data-URI tags.
+        // Marker stores base64 image bytes in block.images keyed by block ID.
+        // Picture/Figure leaf blocks have empty html, so this is the only way to
+        // get the image into the rendered output.
+        JsonNode images = block.path("images");
+        if (!images.isMissingNode() && !images.isNull() && !images.isEmpty()) {
+            StringBuilder imgTags = new StringBuilder();
+            images.properties().forEach(entry -> {
+                String base64 = str(entry.getValue());
+                if (!base64.isEmpty()) {
+                    String mime = detectImageMime(base64);
+                    imgTags.append("<img src=\"data:").append(mime)
+                           .append(";base64,").append(base64).append("\">");
+                }
+            });
+            if (!imgTags.isEmpty()) {
+                html = html + imgTags;
+            }
+        }
+
+        JsonNode children = block.path("children");
+        if (children.isMissingNode() || children.isNull() || !children.isArray() || children.isEmpty()) {
+            return html; // leaf node
+        }
+
+        // Build id → rendered-html map for all direct children
+        Map<String, String> childHtml = new LinkedHashMap<>();
+        for (JsonNode child : children) {
+            String id = str(child.path("id"));
+            childHtml.put(id, jsonToHtml(child));
+        }
+
+        // Replace every <content-ref src='id'></content-ref> with the child's HTML
+        for (Map.Entry<String, String> entry : childHtml.entrySet()) {
+            String ref = "<content-ref src='" + entry.getKey() + "'></content-ref>";
+            html = html.replace(ref, entry.getValue());
+        }
+
+        return html;
+    }
+
+    // ── PageResult (text + figures for embeddings) ────────────────────────────
+
+    private PageResult buildPageResult(JsonNode pageBlock, int pageNumber) {
+        StringBuilder text = new StringBuilder();
+        String[] headingTitle = {null};
+        List<PageResult.FigureData> figures = new ArrayList<>();
+
+        walkBlock(pageBlock, text, headingTitle, figures);
+        return new PageResult(pageNumber, text.toString().strip(), headingTitle[0], figures);
+    }
+
+    /** Recursively walks the block tree, collecting text and figures in reading order. */
+    private void walkBlock(JsonNode block, StringBuilder text, String[] headingTitle,
+                           List<PageResult.FigureData> figures) {
+        String type = str(block.path("block_type"));
+
+        if ("SectionHeader".equals(type)) {
+            String heading = stripHtml(str(block.path("html"))).strip();
+            if (!heading.isEmpty() && headingTitle[0] == null) headingTitle[0] = heading;
+            appendText(text, heading);
+
+        } else if (TEXT_BLOCK_TYPES.contains(type)) {
+            appendText(text, stripHtml(str(block.path("html"))));
+
+        } else if (FIGURE_BLOCK_TYPES.contains(type)) {
+            String caption = findCaption(block);
+            extractFigures(block, caption, figures);
+        }
+
+        // Recurse into children (content-ref ordering is implicit via tree order)
+        JsonNode children = block.path("children");
+        if (!children.isMissingNode() && !children.isNull() && children.isArray()) {
+            for (JsonNode child : children) {
+                walkBlock(child, text, headingTitle, figures);
+            }
+        }
+    }
+
+    /** Finds the first Caption child inside a figure block, if any. */
+    private String findCaption(JsonNode figureBlock) {
+        JsonNode children = figureBlock.path("children");
+        if (children.isMissingNode() || !children.isArray()) return null;
+        for (JsonNode child : children) {
+            if ("Caption".equals(str(child.path("block_type")))) {
+                String caption = stripHtml(str(child.path("html"))).strip();
+                return caption.isEmpty() ? null : caption;
+            }
+        }
+        return null;
+    }
+
+    private void extractFigures(JsonNode block, String caption, List<PageResult.FigureData> out) {
+        JsonNode images = block.path("images");
+        if (images.isMissingNode() || images.isEmpty()) return;
+
+        images.properties().forEach(entry -> {
+            String blockId = entry.getKey();
+            String base64 = str(entry.getValue());
+            if (base64.isEmpty()) return;
+            try {
+                byte[] bytes = Base64.getDecoder().decode(base64);
+                out.add(new PageResult.FigureData(bytes, caption, blockId));
+            } catch (IllegalArgumentException ex) {
+                log.warn("Could not decode base64 image for block {}: {}", blockId, ex.getMessage());
+            }
+        });
+    }
+
+    // ── Utilities ─────────────────────────────────────────────────────────────
+
+    private void appendText(StringBuilder sb, String text) {
+        if (text == null) return;
+        String stripped = text.strip();
+        if (stripped.isEmpty()) return;
+        if (sb.length() > 0) sb.append("\n\n");
+        sb.append(stripped);
+    }
+
+    private String stripHtml(String html) {
+        if (html == null || html.isEmpty()) return "";
+        return html.replaceAll("<[^>]*>", "").replaceAll("\\s{2,}", " ").strip();
+    }
+
+    /** Detects MIME type from the first characters of a base64-encoded image. */
+    private static String detectImageMime(String base64) {
+        if (base64.startsWith("/9j/"))   return "image/jpeg";
+        if (base64.startsWith("iVBOR"))  return "image/png";
+        if (base64.startsWith("R0lGO"))  return "image/gif";
+        if (base64.startsWith("UklGR"))  return "image/webp";
+        return "image/png"; // safe fallback
+    }
+
+    /** Null-safe string extraction from a JsonNode (Jackson 3: stringValue() returns null for non-strings). */
+    private static String str(JsonNode node) {
+        String v = node.stringValue();
+        return v != null ? v : "";
+    }
+}
@@ -0,0 +1,25 @@
+package com.aiteacher.document;
+
+import java.util.List;
+
+/**
+ * Internal DTO produced by MarkerPageParser for one PDF page.
+ * Decouples the Marker HTTP API from downstream services.
+ */
+public record PageResult(
+        int pageNumber,           // 1-based, derived from Marker page block index
+        String orderedText,       // full page text in correct reading order (blocks joined by \n\n)
+        String headingTitle,      // first SectionHeader block on page, or null
+        List<FigureData> figures  // extracted figure images (may be empty)
+) {
+
+    /**
+     * A figure extracted from the page.
+     * Image bytes are PNG data decoded from the Marker JSON {@code images} map.
+     */
+    public record FigureData(
+            byte[] imageBytes,       // PNG image data (base64-decoded from Marker response)
+            String nearestCaption,   // text of the adjacent Caption block, or null
+            String blockId           // Marker block ID (e.g. "/page/0/Figure/2") for traceability
+    ) {}
+}
@@ -0,0 +1,16 @@
+package com.aiteacher.document;
+
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Result of a full Marker parse: structured page data (from JSON) plus
+ * native per-page markdown (from the separate Markdown API call).
+ *
+ * @param pages       one entry per non-empty page, derived from the chunks response
+ * @param htmlByPage  concatenated block HTML keyed by 1-based page number
+ */
+public record ParsedBook(
+        List<PageResult> pages,
+        Map<Integer, String> htmlByPage
+) {}
@@ -1,13 +1,17 @@
 package com.aiteacher.document;

+import org.apache.pdfbox.Loader;
+import org.apache.pdfbox.pdmodel.PDDocument;
+import org.apache.pdfbox.pdmodel.PDPage;
+import org.apache.pdfbox.pdmodel.common.PDRectangle;
+import org.apache.pdfbox.text.PDFTextStripperByArea;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
-import org.springframework.ai.reader.pdf.PagePdfDocumentReader;
-import org.springframework.ai.reader.pdf.config.PdfDocumentReaderConfig;
-import org.springframework.core.io.FileSystemResource;
 import org.springframework.stereotype.Service;
 import org.springframework.transaction.annotation.Transactional;

+import java.awt.Rectangle;
+import java.io.IOException;
 import java.nio.file.Path;
 import java.util.ArrayList;
 import java.util.List;
@@ -15,13 +19,18 @@ import java.util.UUID;

 /**
 * Parses a PDF into page-level SectionEntity records stored in Postgres.
- * Each page becomes one section, grouped under a single chapter per book.
+ * Uses column-aware extraction via PDFTextStripperByArea: for two-column pages,
+ * left column is extracted first then right, preserving correct reading order.
+ * Text is also normalized (collapsed whitespace) before storage.
 */
@Service
 public class PdfStructureParser {

    private static final Logger log = LoggerFactory.getLogger(PdfStructureParser.class);

+    // Right column is considered empty (single-column page) if it has < 20% of left column's content
+    private static final double TWO_COLUMN_THRESHOLD = 0.2;
+
    private final ChapterRepository chapterRepository;
    private final SectionRepository sectionRepository;

@@ -35,37 +44,71 @@ public class PdfStructureParser {
    public List<SectionEntity> parse(UUID bookId, String bookTitle, Path pdfPath) {
        log.info("Parsing PDF structure for book {}", bookId);

-        // One chapter per book
        String chapterId = bookId + "-ch1";
        ChapterEntity chapter = new ChapterEntity(chapterId, bookId, 1, bookTitle, 1);
        chapterRepository.save(chapter);

-        // One section per page
-        PagePdfDocumentReader reader = new PagePdfDocumentReader(
-            new FileSystemResource(pdfPath.toFile()),
-            PdfDocumentReaderConfig.builder().withPagesPerDocument(1).build()
-        );
-
-        List<org.springframework.ai.document.Document> pages = reader.get();
        List<SectionEntity> sections = new ArrayList<>();

-        for (int i = 0; i < pages.size(); i++) {
-            int pageNum = i + 1;
-            String text = pages.get(i).getText();
-            if (text == null || text.isBlank()) continue;
+        try (PDDocument doc = Loader.loadPDF(pdfPath.toFile())) {
+            List<PDPage> pages = new ArrayList<>();
+            doc.getPages().forEach(pages::add);

-            String sectionId = bookId + "-p" + pageNum;
-            SectionEntity section = new SectionEntity(
-                sectionId, chapterId, bookId,
-                String.valueOf(pageNum),
-                "Page " + pageNum,
-                pageNum, pageNum,
-                text
-            );
-            sections.add(sectionRepository.save(section));
+            for (int i = 0; i < 25; i++) {
+                int pageNum = i + 1;
+                String text = normalizeWhitespace(extractPageText(pages.get(i)));
+                if (text.isBlank()) continue;
+
+                String sectionId = bookId + "-p" + pageNum;
+                SectionEntity section = new SectionEntity(
+                    sectionId, chapterId, bookId,
+                    String.valueOf(pageNum),
+                    "Page " + pageNum,
+                    pageNum, pageNum,
+                    text
+                );
+                sections.add(sectionRepository.save(section));
+            }
+        } catch (IOException e) {
+            throw new RuntimeException("Failed to parse PDF for book " + bookId, e);
        }

        log.info("Parsed {} sections for book {}", sections.size(), bookId);
        return sections;
    }
+
+    /**
+     * Extracts text from a single page using column-aware region extraction.
+     * Splits the page at the horizontal midpoint. If the right region has fewer
+     * than 20% of the characters of the left region, treats the page as single-column.
+     */
+    private String extractPageText(PDPage page) throws IOException {
+        PDRectangle mediaBox = page.getMediaBox();
+        int width  = (int) mediaBox.getWidth();
+        int height = (int) mediaBox.getHeight();
+        int mid    = width / 2;
+
+        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
+        stripper.setSortByPosition(true);
+        stripper.addRegion("left",  new Rectangle(0,   0, mid,         height));
+        stripper.addRegion("right", new Rectangle(mid, 0, width - mid, height));
+        stripper.extractRegions(page);
+
+        String left  = stripper.getTextForRegion("left").strip();
+        String right = stripper.getTextForRegion("right").strip();
+
+        if (right.length() < left.length() * TWO_COLUMN_THRESHOLD) {
+            // Single-column page — left holds all (or nearly all) content
+            return left.isEmpty() ? right : left;
+        }
+        return left + "\n\n" + right;
+    }
+
+    /** Collapses multi-space/tab runs and excessive blank lines. */
+    private String normalizeWhitespace(String text) {
+        return text
+            .replaceAll("[ \t]{2,}", " ")
+            .replaceAll("\n{3,}", "\n\n")
+            .trim();
+    }
 }
@@ -0,0 +1,97 @@
+package com.aiteacher.document;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.springframework.beans.factory.annotation.Value;
+import org.springframework.stereotype.Service;
+import software.amazon.awssdk.auth.credentials.AwsBasicCredentials;
+import software.amazon.awssdk.auth.credentials.StaticCredentialsProvider;
+import software.amazon.awssdk.core.sync.RequestBody;
+import software.amazon.awssdk.regions.Region;
+import software.amazon.awssdk.services.s3.S3Client;
+import software.amazon.awssdk.services.s3.S3Configuration;
+import software.amazon.awssdk.services.s3.model.*;
+
+import java.net.URI;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.UUID;
+
+@Service
+public class S3MarkdownStorageService implements MarkdownStorageService {
+
+    private static final Logger log = LoggerFactory.getLogger(S3MarkdownStorageService.class);
+
+    private final S3Client s3;
+    private final String bucket;
+
+    public S3MarkdownStorageService(
+            @Value("${app.figure-storage.endpoint}") String endpoint,
+            @Value("${app.figure-storage.region}") String region,
+            @Value("${app.figure-storage.bucket}") String bucket,
+            @Value("${app.figure-storage.access-key-id}") String accessKeyId,
+            @Value("${app.figure-storage.secret-access-key}") String secretKey) {
+        this.bucket = bucket;
+        URI endpointUri = URI.create(endpoint);
+        StaticCredentialsProvider credentials = StaticCredentialsProvider.create(
+                AwsBasicCredentials.create(accessKeyId, secretKey));
+        Region awsRegion = Region.of(region);
+        S3Configuration s3Config = S3Configuration.builder().pathStyleAccessEnabled(true).build();
+
+        this.s3 = S3Client.builder()
+                .endpointOverride(endpointUri)
+                .region(awsRegion)
+                .credentialsProvider(credentials)
+                .serviceConfiguration(s3Config)
+                .build();
+    }
+
+    @Override
+    public String save(UUID bookId, int pageNumber, String markdown) {
+        String key = key(bookId, pageNumber);
+        byte[] bytes = markdown.getBytes(StandardCharsets.UTF_8);
+        s3.putObject(
+                PutObjectRequest.builder().bucket(bucket).key(key)
+                        .contentType("text/html; charset=utf-8")
+                        .contentLength((long) bytes.length).build(),
+                RequestBody.fromBytes(bytes));
+        return key;
+    }
+
+    @Override
+    public String getText(UUID bookId, int pageNumber) {
+        byte[] bytes = s3.getObjectAsBytes(
+                GetObjectRequest.builder().bucket(bucket).key(key(bookId, pageNumber)).build()
+        ).asByteArray();
+        return new String(bytes, StandardCharsets.UTF_8);
+    }
+
+    @Override
+    public void deleteAll(UUID bookId) {
+        String prefix = "html/" + bookId + "/";
+        try {
+            List<ObjectIdentifier> toDelete = new ArrayList<>();
+            s3.listObjectsV2Paginator(ListObjectsV2Request.builder()
+                    .bucket(bucket).prefix(prefix).build()).stream()
+                    .flatMap(page -> page.contents().stream())
+                    .map(S3Object::key)
+                    .map(k -> ObjectIdentifier.builder().key(k).build())
+                    .forEach(toDelete::add);
+
+            if (toDelete.isEmpty()) return;
+
+            s3.deleteObjects(DeleteObjectsRequest.builder()
+                    .bucket(bucket)
+                    .delete(Delete.builder().objects(toDelete).build())
+                    .build());
+            log.info("Deleted {} markdown files from S3 for book {}", toDelete.size(), bookId);
+        } catch (S3Exception ex) {
+            log.warn("Could not fully delete markdown for book {} from S3: {}", bookId, ex.getMessage());
+        }
+    }
+
+    private static String key(UUID bookId, int pageNumber) {
+        return "html/" + bookId + "/page-" + pageNumber + ".html";
+    }
+}
@@ -38,14 +38,52 @@ public class TextChunkingService {
        List<String> windows = new ArrayList<>();
        int start = 0;
        while (start < text.length()) {
-            int end = Math.min(start + TARGET_CHARS, text.length());
-            windows.add(text.substring(start, end));
-            if (end == text.length()) break;
-            start = end - OVERLAP_CHARS;
+            int hardEnd = Math.min(start + TARGET_CHARS, text.length());
+            if (hardEnd == text.length()) {
+                String last = text.substring(start).strip();
+                if (!last.isEmpty()) windows.add(last);
+                break;
+            }
+            int splitAt = findSplitPoint(text, start, hardEnd);
+            String chunk = text.substring(start, splitAt).strip();
+            if (!chunk.isEmpty()) windows.add(chunk);
+            // Overlap: back up from split point, align to a word start
+            int overlapStart = Math.max(start + 1, splitAt - OVERLAP_CHARS);
+            while (overlapStart < splitAt && text.charAt(overlapStart) != ' ') overlapStart++;
+            start = overlapStart < splitAt ? overlapStart + 1 : splitAt;
        }
        return windows;
    }

+    /**
+     * Finds the best split point at or before hardEnd, preferring (in order):
+     * paragraph boundary, sentence boundary, word boundary, hard cut.
+     */
+    private int findSplitPoint(String text, int start, int hardEnd) {
+        int lookback = Math.min(400, (hardEnd - start) / 2);
+
+        // 1. Paragraph boundary
+        int paraIdx = text.lastIndexOf("\n\n", hardEnd);
+        if (paraIdx > hardEnd - lookback && paraIdx > start) return paraIdx + 2;
+
+        // 2. Sentence boundary (. ! ?) followed by space or newline
+        for (int i = hardEnd - 1; i > hardEnd - lookback && i > start; i--) {
+            char c = text.charAt(i);
+            if ((c == '.' || c == '!' || c == '?') && i + 1 < text.length()) {
+                char next = text.charAt(i + 1);
+                if (next == ' ' || next == '\n') return i + 1;
+            }
+        }
+
+        // 3. Word boundary
+        for (int i = hardEnd - 1; i > hardEnd - 100 && i > start; i--) {
+            if (text.charAt(i) == ' ') return i + 1;
+        }
+
+        // 4. Hard cut
+        return hardEnd;
+    }
+
    private Map<String, Object> buildMetadata(SectionEntity section, String bookTitle,
                                               int index, int total, String chunkId) {
        Map<String, Object> m = new HashMap<>();
@@ -3,25 +3,34 @@ package com.aiteacher.document;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 import org.springframework.ai.chat.client.ChatClient;
-import org.springframework.core.io.FileSystemResource;
+import org.springframework.core.io.ByteArrayResource;
 import org.springframework.stereotype.Service;
 import org.springframework.util.MimeTypeUtils;

-import java.nio.file.Path;
-
 /**
- * Generates a clinical text description for an extracted figure image
- * using the OpenAI vision model via Spring AI ChatClient.
+ * Analyses an extracted figure image using the OpenAI vision model.
+ *
+ * <p>Returns an {@link ImageAnalysis} record containing:
+ * <ul>
+ *   <li>{@code description} — 2-3 sentence clinical description of the image</li>
+ *   <li>{@code imageText} — all visible text, labels, and annotations copied verbatim
+ *       from the image (empty string when none present)</li>
+ * </ul>
+ *
+ * <p>Both fields are stored: {@code description} drives the embedding; {@code imageText}
+ * is added to chunk metadata so queries can match exact labels (e.g., "Circle of Willis").
 */
@Service
 public class VisionDescriptionService {

    private static final Logger log = LoggerFactory.getLogger(VisionDescriptionService.class);

-    private static final String PROMPT =
-        "You are a neurosurgery educator. Provide a brief 2-3 sentence clinical description of " +
-        "this image. Focus on anatomical structures, surgical landmarks, labels, and clinical " +
-        "significance. If text or labels are visible, include them verbatim.";
+    private static final String PROMPT = """
+            You are a neurosurgery educator analysing a medical image.
+            Respond in EXACTLY this format — no other text, no markdown:
+            DESCRIPTION: <2-3 sentence clinical description focusing on anatomical structures, surgical landmarks, and clinical significance>
+            IMAGE_TEXT: <all visible text, labels, measurements, and annotations copied verbatim, comma-separated; write NONE if no text visible>
+            """;

    private final ChatClient chatClient;

@@ -30,20 +39,53 @@ public class VisionDescriptionService {
    }

    /**
-     * Returns a description string. Falls back to the provided caption if vision fails.
+     * Holds the structured output of a vision model call on one figure image.
+     *
+     * @param description clinical description of the image content
+     * @param imageText   verbatim text visible inside the image; empty string if none
     */
-    public String describe(Path imagePath, String captionFallback) {
+    public record ImageAnalysis(String description, String imageText) {}
+
+    /**
+     * Analyses the image bytes and returns an {@link ImageAnalysis}.
+     * Falls back gracefully: if the vision call fails, the caption is used as description
+     * and imageText is left empty.
+     *
+     * @param imageBytes    PNG bytes of the extracted figure
+     * @param captionFallback caption detected from surrounding text, may be null
+     */
+    public ImageAnalysis analyze(byte[] imageBytes, String captionFallback) {
        try {
-            return chatClient.prompt()
-                .user(u -> u
-                    .text(PROMPT)
-                    .media(MimeTypeUtils.IMAGE_PNG, new FileSystemResource(imagePath.toFile())))
-                .call()
-                .content();
+            String raw = chatClient.prompt()
+                    .user(u -> u
+                            .text(PROMPT)
+                            .media(MimeTypeUtils.IMAGE_PNG, new ByteArrayResource(imageBytes)))
+                    .call()
+                    .content();
+            return parse(raw, captionFallback);
        } catch (Exception ex) {
-            log.warn("Vision description failed for {}: {} — using caption as fallback",
-                imagePath.getFileName(), ex.getMessage());
-            return captionFallback != null ? captionFallback : "Figure";
+            log.warn("Vision analysis failed: {} — using caption as fallback", ex.getMessage());
+            return new ImageAnalysis(
+                    captionFallback != null ? captionFallback : "Figure",
+                    "");
        }
    }
+
+    private ImageAnalysis parse(String raw, String captionFallback) {
+        String description = captionFallback != null ? captionFallback : "Figure";
+        String imageText = "";
+
+        if (raw != null) {
+            for (String line : raw.split("\n")) {
+                if (line.startsWith("DESCRIPTION:")) {
+                    String val = line.substring("DESCRIPTION:".length()).strip();
+                    if (!val.isEmpty()) description = val;
+                } else if (line.startsWith("IMAGE_TEXT:")) {
+                    String val = line.substring("IMAGE_TEXT:".length()).strip();
+                    if (!val.isEmpty() && !"NONE".equalsIgnoreCase(val)) imageText = val;
+                }
+            }
+        }
+        return new ImageAnalysis(description, imageText);
+    }
 }
@@ -1,24 +1,27 @@
 package com.aiteacher.figure;

 import java.awt.image.BufferedImage;
-import java.nio.file.Path;
 import java.util.UUID;

 public interface FigureStorageService {

    /**
-     * Saves an extracted image to the figure store and returns the relative path
-     * (relative to the configured base-path) stored in the database.
+     * Saves an extracted image to S3 and returns the object key stored in the database.
     */
    String save(UUID bookId, String figureId, BufferedImage image);

    /**
-     * Resolves a stored relative path to an absolute filesystem path.
+     * Downloads the image bytes for the given S3 object key.
     */
-    Path resolve(String relativePath);
+    byte[] getBytes(String key);

    /**
-     * Deletes all figure files for the given book.
+     * Returns a presigned GET URL valid for 1 hour for the given S3 object key.
+     */
+    String presignedUrl(String key);
+
+    /**
+     * Deletes all figure objects for the given book.
     */
    void deleteAll(UUID bookId);
 }
@@ -1,59 +0,0 @@
-package com.aiteacher.figure;
-
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-import org.springframework.beans.factory.annotation.Value;
-import org.springframework.stereotype.Service;
-
-import javax.imageio.ImageIO;
-import java.awt.image.BufferedImage;
-import java.io.IOException;
-import java.nio.file.Files;
-import java.nio.file.Path;
-import java.nio.file.Paths;
-import java.util.UUID;
-
-@Service
-public class LocalFigureStorageService implements FigureStorageService {
-
-    private static final Logger log = LoggerFactory.getLogger(LocalFigureStorageService.class);
-
-    private final Path basePath;
-
-    public LocalFigureStorageService(@Value("${app.figure-storage.base-path:./uploads}") String basePath) {
-        this.basePath = Paths.get(basePath).toAbsolutePath().normalize();
-    }
-
-    @Override
-    public String save(UUID bookId, String figureId, BufferedImage image) {
-        try {
-            Path dir = basePath.resolve("figures").resolve(bookId.toString());
-            Files.createDirectories(dir);
-            String filename = figureId + ".png";
-            Path file = dir.resolve(filename);
-            ImageIO.write(image, "PNG", file.toFile());
-            // Return relative path for storage in DB
-            return "figures/" + bookId + "/" + filename;
-        } catch (IOException ex) {
-            throw new RuntimeException("Failed to save figure " + figureId, ex);
-        }
-    }
-
-    @Override
-    public Path resolve(String relativePath) {
-        return basePath.resolve(relativePath);
-    }
-
-    @Override
-    public void deleteAll(UUID bookId) {
-        Path dir = basePath.resolve("figures").resolve(bookId.toString());
-        if (!Files.exists(dir)) return;
-        try (var walk = Files.walk(dir)) {
-            walk.sorted(java.util.Comparator.reverseOrder())
-                .map(Path::toFile)
-                .forEach(java.io.File::delete);
-        } catch (IOException ex) {
-            log.warn("Could not fully delete figures for book {}: {}", bookId, ex.getMessage());
-        }
-    }
-}
@@ -0,0 +1,132 @@
+package com.aiteacher.figure;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.springframework.beans.factory.annotation.Value;
+import org.springframework.stereotype.Service;
+import software.amazon.awssdk.auth.credentials.AwsBasicCredentials;
+import software.amazon.awssdk.auth.credentials.StaticCredentialsProvider;
+import software.amazon.awssdk.core.sync.RequestBody;
+import software.amazon.awssdk.regions.Region;
+import software.amazon.awssdk.services.s3.S3Client;
+import software.amazon.awssdk.services.s3.S3Configuration;
+import software.amazon.awssdk.services.s3.model.*;
+import software.amazon.awssdk.services.s3.presigner.S3Presigner;
+import software.amazon.awssdk.services.s3.presigner.model.GetObjectPresignRequest;
+import software.amazon.awssdk.services.s3.model.S3Object;
+
+import javax.imageio.ImageIO;
+import java.awt.image.BufferedImage;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URI;
+import java.time.Duration;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.UUID;
+
+@Service
+public class S3FigureStorageService implements FigureStorageService {
+
+    private static final Logger log = LoggerFactory.getLogger(S3FigureStorageService.class);
+
+    private final S3Client s3;
+    private final S3Presigner presigner;
+    private final String bucket;
+
+    public S3FigureStorageService(
+            @Value("${app.figure-storage.endpoint}") String endpoint,
+            @Value("${app.figure-storage.region}") String region,
+            @Value("${app.figure-storage.bucket}") String bucket,
+            @Value("${app.figure-storage.access-key-id}") String accessKeyId,
+            @Value("${app.figure-storage.secret-access-key}") String secretKey) {
+        this.bucket = bucket;
+        URI endpointUri = URI.create(endpoint);
+        StaticCredentialsProvider credentials = StaticCredentialsProvider.create(
+                AwsBasicCredentials.create(accessKeyId, secretKey));
+        Region awsRegion = Region.of(region);
+
+        S3Configuration s3Config = S3Configuration.builder()
+                .pathStyleAccessEnabled(true)
+                .build();
+
+        this.s3 = S3Client.builder()
+                .endpointOverride(endpointUri)
+                .region(awsRegion)
+                .credentialsProvider(credentials)
+                .serviceConfiguration(s3Config)
+                .build();
+
+        this.presigner = S3Presigner.builder()
+                .endpointOverride(endpointUri)
+                .region(awsRegion)
+                .credentialsProvider(credentials)
+                .serviceConfiguration(s3Config)
+                .build();
+    }
+
+    @Override
+    public String save(UUID bookId, String figureId, BufferedImage image) {
+        String key = "figures/" + bookId + "/" + figureId + ".png";
+        try {
+            ByteArrayOutputStream out = new ByteArrayOutputStream();
+            ImageIO.write(image, "PNG", out);
+            byte[] bytes = out.toByteArray();
+
+            s3.putObject(
+                    PutObjectRequest.builder().bucket(bucket).key(key)
+                            .contentType("image/png").contentLength((long) bytes.length).build(),
+                    RequestBody.fromBytes(bytes));
+            return key;
+        } catch (IOException ex) {
+            throw new RuntimeException("Failed to encode figure " + figureId, ex);
+        } catch (S3Exception ex) {
+            throw new RuntimeException("Failed to upload figure " + figureId + " to S3", ex);
+        }
+    }
+
+    @Override
+    public byte[] getBytes(String key) {
+        try {
+            return s3.getObjectAsBytes(
+                    GetObjectRequest.builder().bucket(bucket).key(key).build()).asByteArray();
+        } catch (S3Exception ex) {
+            throw new RuntimeException("Failed to download figure from S3: " + key, ex);
+        }
+    }
+
+    @Override
+    public String presignedUrl(String key) {
+        GetObjectPresignRequest request = GetObjectPresignRequest.builder()
+                .signatureDuration(Duration.ofHours(1))
+                .getObjectRequest(r -> r.bucket(bucket).key(key))
+                .build();
+        return presigner.presignGetObject(request).url().toString();
+    }
+
+    @Override
+    public void deleteAll(UUID bookId) {
+        String prefix = "figures/" + bookId + "/";
+        try {
+            List<ObjectIdentifier> toDelete = new ArrayList<>();
+            ListObjectsV2Request listRequest = ListObjectsV2Request.builder()
+                    .bucket(bucket).prefix(prefix).build();
+
+            s3.listObjectsV2Paginator(listRequest).stream()
+                    .flatMap(page -> page.contents().stream())
+                    .map(S3Object::key)
+                    .map(k -> ObjectIdentifier.builder().key(k).build())
+                    .forEach(toDelete::add);
+
+            if (toDelete.isEmpty()) return;
+
+            s3.deleteObjects(DeleteObjectsRequest.builder()
+                    .bucket(bucket)
+                    .delete(Delete.builder().objects(toDelete).build())
+                    .build());
+            log.info("Deleted {} figures from S3 for book {}", toDelete.size(), bookId);
+        } catch (S3Exception ex) {
+            log.warn("Could not fully delete figures for book {} from S3: {}", bookId, ex.getMessage());
+        }
+    }
+}
@@ -52,11 +52,21 @@ logging:
    "[org.apache.pdfbox]": ERROR

 app:
+  features:
+    upload-enabled: ${UPLOAD_ENABLED:true}
+    delete-enabled: ${DELETE_ENABLED:true}
  auth:
    password: ${APP_PASSWORD:changeme}
  figure-storage:
-    base-path: ${FIGURE_STORAGE_PATH:./uploads}
+    endpoint: https://s3.immich-ad.ovh
+    region: garage
+    bucket: ${S3_BUCKET:aiteacher}
+    access-key-id: ${S3_ACCESS_KEY_ID}
+    secret-access-key: ${S3_SECRET_ACCESS_KEY}
    min-image-size-px: 100
  embedding:
    batch-size: 20
    batch-delay-ms: 2000
+    skip-embedding: true
+  marker:
+    base-url: ${MARKER_BASE_URL:http://192.168.1.105:8000}
@@ -5,3 +5,9 @@ VITE_API_URL=/api/v1

 # Shared password for HTTP Basic auth (must match APP_PASSWORD on the backend).
 VITE_APP_PASSWORD=changeme
+
+# Set to 'false' to hide the upload UI (frontend). Also set UPLOAD_ENABLED=false on the backend to block the endpoint.
+VITE_UPLOAD_ENABLED=true
+
+# Set to 'false' to hide the delete button (frontend). Also set DELETE_ENABLED=false on the backend to block the endpoint.
+VITE_DELETE_ENABLED=true
@@ -64,11 +64,11 @@ body {
    Ubuntu, Cantarell, 'Fira Sans', 'Droid Sans', 'Helvetica Neue', sans-serif;
  background: #f0f4f8;
  color: #2d3748;
-  min-height: 100vh;
+  height: 100vh;
 }

 #app {
-  min-height: 100vh;
+  height: 100vh;
  display: flex;
  flex-direction: column;
 }
@@ -133,6 +133,9 @@ body {

 .main-content {
  flex: 1;
+  min-height: 0;
+  display: flex;
+  flex-direction: column;
  padding: 2rem;
  max-width: 1200px;
  margin: 0 auto;
@@ -33,7 +33,15 @@
    </div>

    <div class="book-actions">
+      <router-link
+        v-if="book.status === 'READY'"
+        :to="{ name: 'book-reader', params: { id: book.id } }"
+        class="btn btn-secondary"
+      >
+        Read
+      </router-link>
      <button
+        v-if="deleteEnabled"
        class="btn btn-danger"
        :disabled="book.status === 'PROCESSING' || deleting"
        @click="$emit('delete', book.id)"
@@ -52,6 +60,7 @@ import type { Book } from '@/stores/bookStore'
 const props = defineProps<{
  book: Book
  deleting?: boolean
+  deleteEnabled?: boolean
 }>()

 defineEmits<{
@@ -181,6 +190,7 @@ function formatDate(iso: string): string {
 .book-actions {
  display: flex;
  justify-content: flex-end;
+  gap: 0.5rem;
  margin-top: 0.25rem;
 }
 </style>
@@ -3,6 +3,8 @@
 interface ImportMetaEnv {
  readonly VITE_API_URL: string
  readonly VITE_APP_PASSWORD: string
+  readonly VITE_UPLOAD_ENABLED: string
+  readonly VITE_DELETE_ENABLED: string
 }

 interface ImportMeta {
@@ -2,6 +2,7 @@ import { createRouter, createWebHistory } from 'vue-router'
 import UploadView from '@/views/UploadView.vue'
 import TopicsView from '@/views/TopicsView.vue'
 import ChatView from '@/views/ChatView.vue'
+import BookReaderView from '@/views/BookReaderView.vue'

 const router = createRouter({
  history: createWebHistory(import.meta.env.BASE_URL),
@@ -20,6 +21,11 @@ const router = createRouter({
      path: '/chat',
      name: 'chat',
      component: ChatView
+    },
+    {
+      path: '/books/:id/read',
+      name: 'book-reader',
+      component: BookReaderView
    }
  ]
 })
@@ -0,0 +1,325 @@
+<template>
+  <div class="reader-view">
+    <!-- Header -->
+    <div class="reader-header">
+      <router-link to="/" class="back-link">← Library</router-link>
+      <div class="reader-title">
+        <h1 class="book-title">{{ book?.title ?? 'Loading…' }}</h1>
+      </div>
+      <div class="page-nav">
+        <button class="nav-btn" :disabled="currentPage <= 1" @click="goTo(currentPage - 1)">&#8592;</button>
+        <form class="page-jump" @submit.prevent="onJump">
+          <input
+            v-model.number="jumpInput"
+            type="number"
+            :min="1"
+            :max="book?.pageCount ?? 1"
+            class="page-input"
+          />
+          <span class="page-sep">/ {{ book?.pageCount ?? '…' }}</span>
+        </form>
+        <button class="nav-btn" :disabled="!book || currentPage >= book.pageCount!" @click="goTo(currentPage + 1)">&#8594;</button>
+      </div>
+    </div>
+
+    <!-- Content -->
+    <div class="reader-body">
+      <div v-if="loading" class="reader-loading">
+        <div class="spinner spinner-dark" style="width:28px;height:28px;margin:0 auto 0.75rem;"></div>
+        <p>Loading page {{ currentPage }}…</p>
+      </div>
+
+      <div v-else-if="error" class="reader-error card">
+        <strong>Could not load page {{ currentPage }}</strong><br />
+        {{ error }}
+      </div>
+
+      <div v-else class="reader-content card">
+        <div class="markdown-body" v-html="renderedHtml"></div>
+      </div>
+    </div>
+  </div>
+</template>
+
+<script setup lang="ts">
+import { ref, watch, onMounted } from 'vue'
+import { useRoute } from 'vue-router'
+import { api } from '@/services/api'
+import { useBookStore } from '@/stores/bookStore'
+import type { Book } from '@/stores/bookStore'
+
+const route = useRoute()
+const bookStore = useBookStore()
+
+const bookId = route.params.id as string
+const book = ref<Book | null>(null)
+const currentPage = ref(1)
+const jumpInput = ref(1)
+const loading = ref(false)
+const error = ref<string | null>(null)
+const renderedHtml = ref('')
+
+// Blob URLs created this session — revoked on next page load
+let activeBlobUrls: string[] = []
+
+onMounted(async () => {
+  book.value = bookStore.books.find(b => b.id === bookId) ?? null
+  if (!book.value) {
+    try {
+      const res = await api.get<Book>(`/books/${bookId}`)
+      book.value = res.data
+    } catch {
+      error.value = 'Book not found.'
+      return
+    }
+  }
+  await loadPage(1)
+})
+
+watch(currentPage, (page) => {
+  jumpInput.value = page
+  loadPage(page)
+})
+
+async function goTo(page: number) {
+  if (!book.value) return
+  const clamped = Math.max(1, Math.min(page, book.value.pageCount ?? 1))
+  if (clamped !== currentPage.value) {
+    currentPage.value = clamped
+  }
+}
+
+function onJump() {
+  goTo(jumpInput.value)
+}
+
+async function loadPage(page: number) {
+  loading.value = true
+  error.value = null
+  renderedHtml.value = ''
+
+  // Revoke previous blob URLs to free memory
+  activeBlobUrls.forEach(u => URL.revokeObjectURL(u))
+  activeBlobUrls = []
+
+  try {
+    const res = await api.get<string>(`/books/${bookId}/pages/${page}/html`, {
+      headers: { Accept: 'text/html' },
+      responseType: 'text'
+    })
+    let html = await resolveImages(res.data)
+    renderedHtml.value = html
+  } catch (e: any) {
+    error.value = e.message ?? 'Failed to load page.'
+  } finally {
+    loading.value = false
+  }
+}
+
+/**
+ * Finds <img src="/api/v1/figures/..."> in the HTML, fetches each image
+ * through the authenticated axios instance, and replaces the src with a
+ * temporary blob URL so the browser can render it without re-authenticating.
+ */
+async function resolveImages(html: string): Promise<string> {
+  const srcPattern = /src="(\/api\/v1\/figures\/[^"]+)"/g
+  const matches = [...html.matchAll(srcPattern)]
+  if (matches.length === 0) return html
+
+  const unique = [...new Set(matches.map(m => m[1]))]
+  const blobMap: Record<string, string> = {}
+
+  await Promise.all(
+    unique.map(async (src) => {
+      try {
+        const res = await api.get(src.replace(/^\/api\/v1/, ''), { responseType: 'blob' })
+        const blobUrl = URL.createObjectURL(res.data)
+        activeBlobUrls.push(blobUrl)
+        blobMap[src] = blobUrl
+      } catch {
+        // leave original src — browser will attempt (and likely fail silently)
+      }
+    })
+  )
+
+  return html.replace(/src="(\/api\/v1\/figures\/[^"]+)"/g, (_, src) =>
+    blobMap[src] ? `src="${blobMap[src]}"` : `src="${src}"`
+  )
+}
+</script>
+
+<style scoped>
+.reader-view {
+  display: flex;
+  flex-direction: column;
+  gap: 1rem;
+  max-width: 860px;
+  margin: 0 auto;
+  flex: 1;
+  min-height: 0;
+}
+
+.reader-header {
+  display: flex;
+  align-items: center;
+  gap: 1rem;
+  flex-wrap: wrap;
+}
+
+.back-link {
+  color: #3182ce;
+  text-decoration: none;
+  font-size: 0.9rem;
+  white-space: nowrap;
+}
+.back-link:hover { text-decoration: underline; }
+
+.reader-title {
+  flex: 1;
+  min-width: 0;
+}
+
+.book-title {
+  font-size: 1.1rem;
+  font-weight: 600;
+  color: #1a365d;
+  white-space: nowrap;
+  overflow: hidden;
+  text-overflow: ellipsis;
+}
+
+.page-nav {
+  display: flex;
+  align-items: center;
+  gap: 0.5rem;
+}
+
+.nav-btn {
+  width: 2rem;
+  height: 2rem;
+  border: 1px solid #cbd5e0;
+  border-radius: 6px;
+  background: #fff;
+  cursor: pointer;
+  font-size: 1rem;
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  transition: background 0.15s;
+}
+.nav-btn:hover:not(:disabled) { background: #ebf8ff; border-color: #3182ce; }
+.nav-btn:disabled { opacity: 0.4; cursor: not-allowed; }
+
+.page-jump {
+  display: flex;
+  align-items: center;
+  gap: 0.35rem;
+}
+
+.page-input {
+  width: 3.5rem;
+  text-align: center;
+  border: 1px solid #cbd5e0;
+  border-radius: 6px;
+  padding: 0.25rem 0.4rem;
+  font-size: 0.9rem;
+  color: #2d3748;
+}
+.page-input:focus { outline: none; border-color: #3182ce; }
+
+.page-sep {
+  font-size: 0.85rem;
+  color: #718096;
+  white-space: nowrap;
+}
+
+.reader-body {
+  flex: 1;
+  min-height: 0;
+  display: flex;
+  flex-direction: column;
+}
+
+.reader-loading {
+  text-align: center;
+  padding: 3rem;
+  color: #718096;
+}
+
+.reader-error {
+  padding: 1.25rem;
+  background: #fff5f5;
+  border: 1px solid #fed7d7;
+  color: #742a2a;
+  border-radius: 8px;
+}
+
+.reader-content {
+  flex: 1;
+  min-height: 0;
+  overflow-y: auto;
+  padding: 2rem;
+}
+
+/* Markdown rendering */
+.markdown-body {
+  font-size: 0.95rem;
+  line-height: 1.75;
+  color: #2d3748;
+}
+
+.markdown-body :deep(h1),
+.markdown-body :deep(h2),
+.markdown-body :deep(h3) {
+  color: #1a365d;
+  font-weight: 600;
+  margin: 1.5rem 0 0.75rem;
+}
+.markdown-body :deep(h2) { font-size: 1.15rem; border-bottom: 1px solid #e2e8f0; padding-bottom: 0.4rem; }
+.markdown-body :deep(h3) { font-size: 1rem; }
+
+.markdown-body :deep(p) { margin: 0.75rem 0; }
+
+.markdown-body :deep(img) {
+  max-width: 100%;
+  border-radius: 6px;
+  display: block;
+  margin: 1rem auto;
+  box-shadow: 0 1px 4px rgba(0,0,0,0.12);
+}
+
+.markdown-body :deep(ul),
+.markdown-body :deep(ol) {
+  padding-left: 1.5rem;
+  margin: 0.75rem 0;
+}
+
+.markdown-body :deep(code) {
+  background: #f7fafc;
+  border: 1px solid #e2e8f0;
+  border-radius: 3px;
+  padding: 0.1em 0.35em;
+  font-size: 0.88em;
+}
+
+.markdown-body :deep(blockquote) {
+  border-left: 3px solid #3182ce;
+  padding-left: 1rem;
+  color: #4a5568;
+  margin: 0.75rem 0;
+}
+
+.markdown-body :deep(table) {
+  width: 100%;
+  border-collapse: collapse;
+  font-size: 0.9em;
+  margin: 1rem 0;
+}
+.markdown-body :deep(th),
+.markdown-body :deep(td) {
+  border: 1px solid #e2e8f0;
+  padding: 0.4rem 0.75rem;
+  text-align: left;
+}
+.markdown-body :deep(th) { background: #f7fafc; font-weight: 600; }
+</style>
@@ -1,10 +1,10 @@
 <template>
  <div class="upload-view">
    <h1 class="page-title">Book Library</h1>
-    <p class="page-subtitle">Upload medical textbooks (PDF) to build the knowledge base.</p>
+    <p v-if="uploadEnabled" class="page-subtitle">Upload medical textbooks (PDF) to build the knowledge base.</p>

    <!-- Upload Section -->
-    <div class="upload-section card">
+    <div v-if="uploadEnabled" class="upload-section card">
      <h2 class="section-title">Upload a Book</h2>

      <div
@@ -87,6 +87,7 @@
          :key="book.id"
          :book="book"
          :deleting="deletingId === book.id"
+          :delete-enabled="deleteEnabled"
          @delete="handleDelete"
        />
      </div>
@@ -99,6 +100,9 @@ import { ref, onMounted, onUnmounted, inject } from 'vue'
 import { useBookStore } from '@/stores/bookStore'
 import BookCard from '@/components/BookCard.vue'

+const uploadEnabled = import.meta.env.VITE_UPLOAD_ENABLED !== 'false'
+const deleteEnabled = import.meta.env.VITE_DELETE_ENABLED !== 'false'
+
 const bookStore = useBookStore()
 const showToast = inject<(msg: string, type?: 'error' | 'success') => void>('showToast')

@@ -0,0 +1,79 @@
+# Internal Contract: DocumentAiPageParser → FigureExtractionService
+
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04  
+**Type**: Internal Java DTO (not an HTTP contract)
+
+---
+
+## Purpose
+
+`PageResult` is the internal data transfer object produced by `DocumentAiPageParser` for each
+PDF page. It decouples the Google Document AI SDK types from the rest of the pipeline so that
+`PdfStructureParser` can be replaced without cascading changes.
+
+---
+
+## Java Record
+
+```java
+package com.aiteacher.document;
+
+import java.util.List;
+
+/**
+ * Internal DTO produced by DocumentAiPageParser for one PDF page.
+ * Decouples the Document AI SDK types from downstream services.
+ */
+public record PageResult(
+    int pageNumber,           // 1-based, matches Document.Page.getPageNumber()
+    String orderedText,       // full page text in correct reading order (blocks joined by \n\n)
+    String headingTitle,      // first HEADING block on page, or null
+    List<FigureBbox> figures  // detected figure regions (may be empty)
+) {
+
+    /**
+     * Normalized bounding box for a detected figure region.
+     * Coordinates are in the [0.0, 1.0] range relative to page dimensions.
+     */
+    public record FigureBbox(
+        float x,       // left edge (normalized)
+        float y,       // top edge (normalized)
+        float width,   // width (normalized)
+        float height,  // height (normalized)
+        String nearestCaption  // text of adjacent paragraph block, or null
+    ) {}
+}
+```
+
+---
+
+## Production Rules
+
+| Field | Rule |
+|-------|------|
+| `orderedText` | Concatenation of all `PARAGRAPH` and `HEADING_*` blocks, joined with `\n\n`. Tables are represented as tab-separated text. |
+| `headingTitle` | First block whose `blockType` is `HEADING_1` through `HEADING_6`. `null` if no heading detected. |
+| `figures` | One entry per `VisualElement` with `type == "figure"` and `confidence ≥ 0.5`. Sorted top-to-bottom by `y`. |
+| `nearestCaption` | The `PARAGRAPH` block immediately following the figure bbox (by Y coordinate). May be `null` if no paragraph follows within 10% of page height. |
+
+---
+
+## Mapping from Document AI Proto
+
+```
+Document.Page.Block         → orderedText (concatenated)
+Document.Page.Block (HEADING_*) → headingTitle (first match)
+Document.Page.VisualElement → FigureBbox
+  └─ layout.bounding_poly.normalized_vertices[0] → (x, y) top-left
+  └─ normalized_vertices[2] → (x+w, y+h) bottom-right
+```
+
+---
+
+## Consumers
+
+| Consumer | What It Uses |
+|----------|-------------|
+| `BookEmbeddingService` | `orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title` |
+| `FigureExtractionService` | `figures` list → renders page via PDFBox, crops each bbox to `BufferedImage` |
+| `TextChunkingService` | Receives `SectionEntity` (indirectly uses `orderedText`) — **unchanged** |
@@ -0,0 +1,84 @@
+# Internal Contract: MarkerPageParser → FigureExtractionService / BookEmbeddingService
+
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04  
+**Type**: Internal Java DTO (not an HTTP contract)
+
+---
+
+## Purpose
+
+`PageResult` is the internal data transfer object produced by `MarkerPageParser` for each
+PDF page. It decouples the Marker HTTP API from the rest of the pipeline. Downstream consumers
+(`BookEmbeddingService`, `FigureExtractionService`, `TextChunkingService`) are unaware of
+Marker and depend only on this DTO.
+
+---
+
+## Java Record
+
+```java
+package com.aiteacher.document;
+
+import java.util.List;
+
+/**
+ * Internal DTO produced by MarkerPageParser for one PDF page.
+ * Decouples the Marker HTTP API from downstream services.
+ */
+public record PageResult(
+    int pageNumber,              // 1-based, derived from Marker page block index
+    String orderedText,          // full page text in correct reading order (blocks joined by \n\n)
+    String headingTitle,         // first SectionHeader block on page, or null
+    List<FigureData> figures     // extracted figure images (may be empty)
+) {
+
+    /**
+     * A figure extracted from the page.
+     * Image bytes are PNG data decoded from the Marker JSON `images` map.
+     */
+    public record FigureData(
+        byte[] imageBytes,       // PNG image data (base64-decoded from Marker response)
+        String nearestCaption,   // text of the adjacent Caption block, or null
+        String blockId           // Marker block ID (e.g. "/page/0/Figure/2") for traceability
+    ) {}
+}
+```
+
+---
+
+## Production Rules
+
+| Field | Rule |
+|-------|------|
+| `pageNumber` | 1-based index derived from the Marker page block's position in the `children` array (index + 1). |
+| `orderedText` | HTML-stripped text from all `Text`, `TextInlineMath`, `SectionHeader`, `ListItem`, and `Table` blocks, joined with `\n\n`. Marker already returns them in reading order. |
+| `headingTitle` | Plain text of the first `SectionHeader` block on the page. `null` if no heading detected. |
+| `figures` | One `FigureData` per `Figure` or `Picture` block that has a non-empty `images` entry. Blocks with no image data are skipped. |
+| `imageBytes` | Base64-decoded bytes from `block.images[blockId]`. Marker returns PNG. |
+| `nearestCaption` | Plain text of the first `Caption` block that is a sibling appearing immediately after the figure block. `null` if absent. |
+
+---
+
+## Mapping from Marker JSON
+
+```
+Marker JSON → PageResult
+
+Page block ("/page/N/Page/M")       → PageResult(pageNumber = N + 1)
+  SectionHeader child                → headingTitle (first match, HTML-stripped)
+  Text / TextInlineMath children    → orderedText (HTML-stripped, joined \n\n)
+  Figure / Picture child            → FigureData
+    images[blockId]                  → FigureData.imageBytes (base64-decoded)
+    next Caption sibling             → FigureData.nearestCaption (HTML-stripped)
+    blockId                          → FigureData.blockId
+```
+
+---
+
+## Consumers
+
+| Consumer | What It Uses |
+|----------|-------------|
+| `BookEmbeddingService` | `orderedText` → `SectionEntity.fullText`; `headingTitle` → `SectionEntity.title` |
+| `FigureExtractionService` | `figures` list → decodes `imageBytes`, checks min size, saves to S3 |
+| `TextChunkingService` | Receives `SectionEntity` (uses `orderedText` indirectly) — **unchanged** |
@@ -1,40 +1,42 @@
 # Implementation Plan: Enhanced Embedding with Image Parsing and Metadata

-**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03 | **Spec**: [spec.md](spec.md)  
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 | **Spec**: [spec.md](spec.md)  
 **Input**: Feature specification from `/specs/002-image-aware-embedding/spec.md`

 ## Summary

-Enhance the book embedding pipeline to extract images from every PDF page, generate descriptive
-text for each image, and store all content (text chunks + figure captions) with rich, consistent
-metadata in the vector store. A new document hierarchy (Book → Chapter → Section → TextChunk +
-Figure) is introduced. Postgres holds the full-text sections and figure metadata; the vector
-store holds chunk and figure caption embeddings; the local file store holds extracted image files.
-At query time, both the text-chunk store and figure-caption store are searched in parallel and
-results are merged before being sent to the LLM.
+Enhance the PDF embedding pipeline to extract figures and generate AI descriptions for them,
+making image content semantically searchable alongside text. PDF parsing and figure extraction
+are delegated to a local **Marker** server (`http://localhost:8000/marker/upload`), which
+returns reading-order text and pre-cropped figure images (base64) in a single JSON response,
+eliminating the need for PDFBox column heuristics and figure bbox rendering.

 ## Technical Context

 **Language/Version**: Java 25 (backend), TypeScript / Node 20 (frontend)  
-**Primary Dependencies**: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings + chat), PDFBox (via Spring AI PDF reader dependency)  
-**Storage**: PostgreSQL (JPA + Flyway), pgvector (Spring AI `VectorStore`), local file system (extracted images — `/uploads/figures/`)  
-**Testing**: Spring Boot Test, JUnit 5, Mockito  
-**Target Platform**: Linux server (Docker Compose)  
-**Project Type**: Web application — backend REST API + Vue 3 frontend  
-**Performance Goals**: Full book (up to 500 pages with images) processed in ≤ 30 minutes; query response unchanged from existing baseline  
-**Constraints**: No new deployable units; all changes within the existing `backend/` module; image storage on local disk (S3 migration is a future concern, behind an interface)  
-**Scale/Scope**: POC — <10 concurrent users; single shared book library
+**Primary Dependencies**: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings +
+GPT-4o vision), PDFBox 3.0.3 (via `spring-ai-pdf-document-reader` — retained transitively,
+no longer used directly), Marker local HTTP API (`http://localhost:8000/marker/upload`)  
+**Storage**: PostgreSQL (JPA + Flyway), pgvector (Spring AI `VectorStore`), S3-compatible
+object store (figure images via `FigureStorageService`)  
+**Testing**: Maven / JUnit 5 (`spring-boot-starter-test`)  
+**Target Platform**: Linux server  
+**Project Type**: Web application (backend API + frontend client)  
+**Performance Goals**: SC-003 — book processing time ≤ 3× text-only for ≤ 500 pages  
+**Constraints**: REST API only (Constitution III); Marker server must be running locally;
+S3-compatible storage configured via env vars  
+**Scale/Scope**: POC — handful of books, <10 users

 ## Constitution Check

-*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
+*GATE: Must pass before Phase 0 research. Re-checked after Phase 1 design.*

 | Principle | Status | Notes |
 |-----------|--------|-------|
-| I — KISS | ⚠️ Justified violation — see Complexity Tracking | Hierarchical model + dual search adds complexity; justified by precision requirement |
-| II — Easy to Change | ✅ | Figure storage wrapped behind `FigureStorageService` interface; can swap local disk for S3 |
-| III — Web-First | ✅ | All new capabilities exposed via existing REST API; no new deployable units |
-| IV — Docs as Architecture | ⚠️ Required | README Mermaid diagram MUST be updated in this PR to show new storage tiers |
+| **I. KISS** | ✅ Justified | Marker replaces a bespoke PDFBox column heuristic + Google Cloud SDK with one HTTP call. Net complexity reduction vs. the Document AI approach. |
+| **II. Easy to Change** | ✅ | `MarkerPageParser` is the only class that knows about Marker; swap the implementation to replace Marker with any other parser. `PageResult` DTO remains unchanged. |
+| **III. Web-First** | ✅ | Internal pipeline change; no public API contract change. |
+| **IV. Documentation** | ✅ | README must be updated to show Marker as a local external service. |

 ## Project Structure

@@ -46,60 +48,38 @@ specs/002-image-aware-embedding/
 ├── research.md          # Phase 0 output
 ├── data-model.md        # Phase 1 output
 ├── quickstart.md        # Phase 1 output
-├── contracts/           # Phase 1 output
-└── tasks.md             # Phase 2 output (/speckit.tasks)
+├── contracts/
+│   ├── api.md           # HTTP API contracts (unchanged from initial plan)
+│   └── marker-page-result.md  # Internal DTO contract (MarkerPageParser → downstream)
+└── tasks.md             # Phase 2 output (/speckit.tasks — not created here)
 ```

-### Source Code (repository root)
+### Source Code

 ```text
 backend/
 ├── src/main/java/com/aiteacher/
+│   ├── config/
+│   │   └── MarkerConfig.java          # NEW: RestClient bean + base-url property
+│   ├── document/
+│   │   ├── MarkerPageParser.java      # NEW: replaces DocumentAiPageParser + PdfStructureParser
+│   │   ├── PageResult.java            # UPDATED: FigureBbox → FigureData (bytes not bbox)
+│   │   ├── FigureExtractionService.java  # UPDATED: no PDFBox render; decode bytes directly
+│   │   ├── TextChunkingService.java   # UNCHANGED
+│   │   ├── VisionDescriptionService.java # UNCHANGED
+│   │   └── [removed] DocumentAiPageParser.java
 │   ├── book/
-│   │   ├── Book.java                         (existing)
-│   │   ├── BookController.java               (existing)
-│   │   ├── BookService.java                  (existing)
-│   │   ├── BookRepository.java               (existing)
-│   │   ├── BookStatus.java                   (existing)
-│   │   ├── BookEmbeddingService.java         (existing — enhanced)
-│   │   └── NoKnowledgeSourceException.java   (existing)
-│   ├── document/                             (new package)
-│   │   ├── BookNode.java
-│   │   ├── ChapterNode.java
-│   │   ├── SectionNode.java
-│   │   ├── SectionRepository.java
-│   │   ├── TextChunkNode.java
-│   │   ├── FigureNode.java
-│   │   ├── FigureRepository.java
-│   │   ├── FigureType.java
-│   │   ├── ChunkFigureRef.java
-│   │   └── ChunkFigureRefRepository.java
-│   ├── figure/                               (new package)
-│   │   ├── FigureStorageService.java         (interface)
-│   │   └── LocalFigureStorageService.java    (implementation)
-│   ├── retrieval/                            (new package)
-│   │   └── NeurosurgeryRetriever.java
-│   ├── chat/
-│   │   └── ChatService.java                  (updated — uses NeurosurgeryRetriever)
-│   └── config/
-│       └── FigureStorageConfig.java          (new — configures upload dir)
-└── src/main/resources/
-    └── db/migration/
-        ├── V4__document_hierarchy.sql        (new)
-        └── V5__figures_and_refs.sql          (new)
-
-uploads/
-└── figures/                                  (runtime — extracted images; gitignored)
+│   │   └── BookEmbeddingService.java  # MINOR UPDATE: inject MarkerPageParser, drop DocumentAiPageParser
+│   └── [removed] config/DocumentAiConfig.java
+├── src/main/resources/
+│   └── application.yaml               # UPDATED: remove document-ai.*, add marker.base-url
+└── pom.xml                            # UPDATED: remove google-cloud-document-ai
 ```

-**Structure Decision**: Option 2 (Web Application) confirmed. All backend changes stay within
-`backend/`. Two new packages (`document/`, `retrieval/`) plus one interface package (`figure/`)
-keep concerns separated without adding a deployable unit.
+**Structure Decision**: Option 2 (backend + frontend) per constitution Technology Constraints.
+Frontend changes are display-only (render figure citations inline).

 ## Complexity Tracking

-| Violation | Why Needed | Simpler Alternative Rejected Because |
-|-----------|------------|-------------------------------------|
-| Document hierarchy (BookNode → ChapterNode → SectionNode) | Parent-child retrieval: chunks reference their parent section so the LLM receives full section context, not just the matching fragment. This is the established solution for RAG precision. | Flat page-per-doc model (current) loses inter-sentence context; chunk-only retrieval produces incomplete answers for multi-paragraph clinical questions |
-| Dual vector search (text chunks + figure captions) | Figure captions must be independently searchable — a query about "cavernous sinus anatomy" must surface the diagram even if no text chunk scores highly | Single vector store search would miss figures whose captions don't happen to be the highest-similarity hit; this is the core deliverable of the feature |
-| Third storage tier (local file store for images) | Extracted images cannot live in Postgres (binary blobs degrade query performance) or the vector store (only vectors). A file-per-image approach is standard. | Storing images as base64 in Postgres JSONB would bloat the DB and complicate backup/restore; the `FigureStorageService` interface keeps the implementation swappable |
+> No constitution violations — Marker reduces complexity compared to the previous
+> Google Document AI approach (fewer dependencies, no GCP credentials, no 15-page batching).
@@ -1,34 +1,67 @@
 # Quickstart: Enhanced Embedding with Image Parsing and Metadata

-**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)

 ---

 ## Prerequisites

 - Docker Compose running (PostgreSQL + pgvector)
- OpenAI API key set in `backend/src/main/resources/application.properties` or as env var `OPENAI_API_KEY`
+- OpenAI API key set as env var `OPENAI_API_KEY`
 - Java 25 + Maven on PATH
+- **Marker server running** on `http://localhost:8000` (see setup below)
+- S3-compatible bucket configured (existing setup)

 ---

-## New Configuration
+## Marker Server Setup (one-time)

-Add to `backend/src/main/resources/application.properties`:
+Marker is a local Python service — no cloud credentials required.

-```properties
-# Figure storage
-app.figure-storage.base-path=./uploads
-app.figure-storage.min-image-size-px=100
+```bash
+# Install (Python 3.10+ required)
+pip install marker-pdf
+
+# Start the server on port 8000
+marker_server --port 8000
 ```

-The `uploads/figures/` directory is created automatically on first use. Add it to `.gitignore`.
+The server is ready when you see:
+```
+INFO:     Uvicorn running on http://0.0.0.0:8000
+```
+
+Keep the server running in the background (or use a process manager like `systemd` or `screen`).
+
+---
+
+## Backend Configuration
+
+Add or update `backend/src/main/resources/application.yaml`:
+
+```yaml
+app:
+  figure-storage:
+    endpoint: https://your-s3-endpoint
+    region: your-region
+    bucket: ${S3_BUCKET:aiteacher}
+    access-key-id: ${S3_ACCESS_KEY_ID}
+    secret-access-key: ${S3_SECRET_ACCESS_KEY}
+    min-image-size-px: 100   # skip decorative images smaller than 100×100 px
+  marker:
+    base-url: ${MARKER_BASE_URL:http://localhost:8000}
+  embedding:
+    batch-size: 20
+    batch-delay-ms: 2000
+```
+
+No GCP credentials or project IDs are needed.

 ---

 ## Database Migration

-Two new Flyway migrations run automatically on startup:
+Two Flyway migrations run automatically on startup:

 - `V4__document_hierarchy.sql` — adds `chapter` and `section` tables
 - `V5__figures_and_refs.sql` — adds `figure` and `chunk_figure_ref` tables
@@ -54,10 +87,11 @@ image-aware pipeline runs. Status can be polled via `GET /api/v1/books`.

 ## Verifying Image Extraction

-1. Upload a PDF with diagrams: `POST /api/v1/books/upload`
-2. Wait for `status: "READY"` via `GET /api/v1/books`
-3. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page
-4. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry
+1. Ensure Marker is running: `curl http://localhost:8000` should respond.
+2. Upload a PDF with diagrams: `POST /api/v1/books/upload`
+3. Wait for `status: "READY"` via `GET /api/v1/books`
+4. List figures: `GET /api/v1/books/{id}/figures` — should return at least one entry per image page
+5. Ask a diagram-specific question in chat — response `sources` should include a `type: "FIGURE"` entry

 ---

@@ -80,7 +114,8 @@ mvn test
 ```

 Key new test classes:
- `FigureExtractionServiceTest` — unit tests for image extraction and classification
+- `MarkerPageParserTest` — unit tests for JSON parsing and block-to-PageResult mapping
+- `FigureExtractionServiceTest` — unit tests for base64 decode, size filtering, classification
 - `NeurosurgeryRetrieverTest` — unit tests for dual-search merge and deduplication
 - `BookEmbeddingServiceIntegrationTest` — integration test: upload PDF with known figures,
  verify figures appear in `GET /api/v1/books/{id}/figures`
@@ -1,10 +1,10 @@
 # Research: Enhanced Embedding with Image Parsing and Metadata

-**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-04 (updated: Marker replaces Google Document AI)

-This document resolves all technical unknowns identified during planning. The primary source for
-decisions is the detailed architecture provided directly by the project owner, supplemented by
-Spring AI 2.0.0-M4 API specifics.
+This document resolves all technical unknowns identified during planning. Decisions 1–10 cover
+the core pipeline. The **Marker Study** section at the bottom explains why Marker was chosen
+over Google Document AI to drive PDF parsing and figure extraction.

 ---

@@ -28,19 +28,29 @@ association explicit and queryable.

 ---

-## Decision 2: Image Extraction Strategy
+## Decision 2: Document Parsing Strategy

-**Decision**: Use PDFBox (already on classpath via `spring-ai-pdf-document-reader`) to extract
-images per page. Each image is tagged with `page`, `figure_id` (derived from caption, e.g.
-"Fig. 12-4"), and the parent `sectionId`. Images are saved to local disk under
-`/uploads/figures/{bookId}/`.
+**Decision**: Use **Marker** (local HTTP server, `http://localhost:8000/marker/upload`) as the
+single entry point for PDF parsing. A single `POST` with `output_format=json` returns:
+- Reading-order text blocks (headings, paragraphs) — no column-split heuristic needed
+- Pre-cropped figure images as base64-encoded PNG in the `images` map of each `Figure` block
+- Table, equation, and code blocks as structured HTML

-**Rationale**: PDFBox is already present (Spring AI bundles it). No new dependency needed.
-Per-page extraction ensures every image is captured regardless of PDF structure.
+`MarkerPageParser` translates the Marker JSON response into `List<PageResult>`, which is the
+same internal DTO used by the rest of the pipeline.
+
+**Rationale**: Marker handles column reordering, scanned-page OCR, and figure cropping in one
+call, eliminating the PDFBox column heuristic (`PdfStructureParser`) and the PDFBox
+render+crop loop in `FigureExtractionService`. Net result: fewer classes, no cloud dependency,
+no GCP credentials.

 **Alternatives considered**:
- iText / iText7 → additional commercial dependency; overkill for extraction
- Screenshot each page as PNG, then OCR → far slower; loses vector quality
+- PDFBox column heuristic (previous approach) → rejected: 50/50 split fails on asymmetric
+  columns and scanned pages
+- Google Document AI Layout Parser → rejected: adds GCP credentials, per-page billing, 15-page
+  batch limit, and still requires PDFBox to render+crop figure regions from bounding boxes.
+  See Marker Study below for detailed comparison.
+- Screenshot each page + OCR → far slower; loses digital text quality

 ---

@@ -103,18 +113,19 @@ search. This is the higher-recall path; dual search (Decision 4) is the higher-p

 ## Decision 6: Image Storage

-**Decision**: Extracted images are saved as PNG files to a local directory
-(`${app.figure-storage.base-path}`, defaults to `./uploads/figures/{bookId}/`). The path is
-stored in `figure.image_path` in Postgres. A `FigureStorageService` interface wraps all disk
-I/O so the implementation can be swapped to S3 or another object store without changing
-callers.
+**Decision**: Marker returns figure images as base64-encoded PNG bytes in the JSON response.
+`FigureExtractionService` decodes these bytes and passes them to `FigureStorageService`, which
+persists them to an S3-compatible bucket (`${app.figure-storage.bucket}`). The image path/URL
+is stored in `figure.image_path` in Postgres.

-**Rationale**: Local disk is the simplest viable option for a POC with <10 users. The interface
-boundary satisfies Constitution Principle II (Easy to Change).
+The `FigureStorageService` interface is unchanged; only the caller changes (from PDFBox crop
+to base64 decode).
+
+**Rationale**: Marker's pre-cropped images remove the need for PDFBox rendering.
+`FigureStorageService` interface boundary satisfies Constitution Principle II (Easy to Change).

 **Alternatives considered**:
- S3 from day 1 → operational overhead not justified at POC scale
- Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
+- Store base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades

 ---

@@ -123,7 +134,8 @@ boundary satisfies Constitution Principle II (Easy to Change).
 **Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
 TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
 1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
-2. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
+2. Marker `block_type` hint (`"Table"` → TABLE, `"Figure"` / `"Picture"` → ANATOMICAL_DIAGRAM default)
+3. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable

 **Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
 Heuristic classification avoids a separate model call per image at extraction time.
@@ -175,14 +187,225 @@ the process fails mid-way. An explicit, idempotent trigger is safer and more obs

 ## Decision 10: Minimum Image Size Threshold

-**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. This
-threshold filters out decorative elements (bullets, dividers, publisher logos) without a
-classification model.
+**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. Marker
+returns PNG bytes; `FigureExtractionService` decodes to `BufferedImage` solely to check
+dimensions. This threshold filters out decorative elements without a classification model.

 **Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
-The threshold is configurable via `app.figure-storage.min-image-size-px` in
-`application.properties`.
+The threshold is configurable via `app.figure-storage.min-image-size-px`.

 **Alternatives considered**:
 - No threshold → decorative icons pollute the figure index
 - ML-based classification → accurate but adds model dependency; not needed at POC scale
+
+---
+
+# Marker Study — Why Marker Replaces Google Document AI
+
+*Added 2026-04-04.*
+
+## What Marker Offers
+
+Marker is an open-source, locally-runnable PDF-to-structured-content converter that uses a
+pipeline of deep-learning models (surya for OCR + layout detection, texify for equations).
+Key capabilities relevant to this project:
+
+| Capability | Marker | Google Document AI |
+|-----------|--------|--------------------|
+| Multi-column reading order | ✅ | ✅ |
+| OCR on scanned pages | ✅ | ✅ |
+| Figure detection | ✅ returns pre-cropped images | ⚠️ returns bbox only; PDFBox still needed |
+| Table extraction | ✅ HTML tables | ✅ |
+| JSON output with image bytes | ✅ base64 in `images` map | ❌ |
+| No cloud credentials | ✅ | ❌ GCP service account required |
+| No per-page billing | ✅ | ❌ ~$10/1,000 pages |
+| Batch size limits | None (local) | 15 pages / 20 MB per sync call |
+| Setup | `pip install marker-pdf && marker_server` | GCP project + processor + IAM |
+
+---
+
+## Does Marker Solve the Current Pain Points?
+
+### Pain Point 1: Naive 50/50 Column Split
+
+**Answer: Yes, Marker fixes this completely.**
+
+`PdfStructureParser.extractPageText()` splits pages at the horizontal midpoint with a 20%
+threshold. This fails on asymmetric columns and scanned pages. Marker's surya layout model
+returns blocks in natural reading order — no heuristic needed.
+
+### Pain Point 2: Figure Detection Misses Rasterized Figures
+
+**Answer: Yes, Marker fixes this for most cases.**
+
+`FigureExtractionService` previously iterated PDF XObjects (only finds embedded XObject images,
+misses rasterized figures and vector-path drawings). Marker's layout model detects visual
+elements by type and returns the cropped image bytes directly — no PDFBox page rendering needed.
+
+### Pain Point 3: OCR on Scanned Pages
+
+**Answer: Yes, Marker handles scanned pages transparently via surya OCR.**
+
+### Pain Point 4: Caption Detection
+
+**Answer: Improved — Marker groups caption blocks with their figure block.**
+
+The `block_type = "Caption"` block appears as a sibling or child adjacent to the `"Figure"`
+block in the Marker JSON, making caption association structural rather than regex-based.
+
+---
+
+## Marker API Integration
+
+### Local Server Setup
+
+```bash
+pip install marker-pdf
+marker_server --port 8000
+```
+
+The server exposes `POST /marker/upload` (the user's configured endpoint).
+
+### Request
+
+```
+POST http://localhost:8000/marker/upload
+Content-Type: multipart/form-data
+
+file=@document.pdf
+output_format=json
+```
+
+### Response (abbreviated)
+
+```json
+{
+  "output_format": "json",
+  "output": {
+    "block_type": "Document",
+    "children": [
+      {
+        "block_type": "Page",
+        "id": "/page/0/Page/0",
+        "children": [
+          {
+            "block_type": "SectionHeader",
+            "id": "/page/0/SectionHeader/0",
+            "html": "<h1>Cavernous Sinus Anatomy</h1>"
+          },
+          {
+            "block_type": "Text",
+            "id": "/page/0/Text/1",
+            "html": "<p>The cavernous sinus contains...</p>"
+          },
+          {
+            "block_type": "Figure",
+            "id": "/page/0/Figure/2",
+            "html": "<figure><img src='/page/0/Figure/2'/></figure>",
+            "images": {
+              "/page/0/Figure/2": "iVBORw0KGgo..."
+            }
+          },
+          {
+            "block_type": "Caption",
+            "id": "/page/0/Caption/3",
+            "html": "<p>Fig. 12-4. Coronal cross-section...</p>"
+          }
+        ]
+      }
+    ],
+    "metadata": { "page_stats": [...] }
+  }
+}
+```
+
+### Java Integration Pattern
+
+```java
+// MarkerPageParser — core call
+MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
+body.add("file", new FileSystemResource(pdfPath));
+body.add("output_format", "json");
+
+JsonNode response = restClient.post()
+    .uri(baseUrl + "/marker/upload")
+    .contentType(MediaType.MULTIPART_FORM_DATA)
+    .body(body)
+    .retrieve()
+    .body(JsonNode.class);
+
+JsonNode document = response.get("output");
+```
+
+### Mapping Marker Blocks to PageResult
+
+```
+Page block (id "/page/N/Page/M") → PageResult(pageNumber = N+1)
+  SectionHeader children           → headingTitle (first match)
+  Text, TextInlineMath children    → orderedText (HTML stripped, joined \n\n)
+  Figure children with images map  → FigureData(imageBytes = base64decode(images[id]))
+  Caption sibling of Figure        → FigureData.nearestCaption
+```
+
+---
+
+## Architecture Change
+
+```
+Before (Document AI — removed):
+  DocumentAiPageParser
+      → Google Document AI API (GCP, 15-page batches, credentials)
+      → returns text blocks + figure bboxes
+  PdfStructureParser (PDFBox column heuristic)
+  FigureExtractionService
+      → renders page via PDFBox at 150 DPI
+      → crops bbox region
+
+After (Marker):
+  MarkerPageParser
+      → POST PDF to http://localhost:8000/marker/upload (output_format=json)
+      → returns text blocks (correct reading order) + Figure blocks with base64 images
+      → produces List<PageResult> (same DTO, FigureData carries bytes not bbox)
+  FigureExtractionService (simplified)
+      → base64-decodes image bytes from PageResult.FigureData
+      → checks min size (ImageIO.read → getWidth/getHeight)
+      → saves to S3 via FigureStorageService (UNCHANGED)
+  VisionDescriptionService (UNCHANGED)
+  BookEmbeddingService orchestration (MINOR: inject MarkerPageParser)
+```
+
+**What is removed**:
+- `DocumentAiPageParser` — replaced by `MarkerPageParser`
+- `DocumentAiConfig` — replaced by `MarkerConfig`
+- `PdfStructureParser` — Marker handles reading order
+- `google-cloud-document-ai` Maven dependency
+- `app.document-ai.*` configuration properties
+
+**What stays the same**:
+- `PageResult` DTO structure (fields renamed, not restructured)
+- `FigureExtractionService` public interface
+- `TextChunkingService`, `VisionDescriptionService`, `BookEmbeddingService` orchestration
+- All JPA entities, repositories, vector store, S3 storage
+
+---
+
+## Constitution Compliance
+
+| Principle | Assessment |
+|-----------|------------|
+| **I. KISS** | ✅ Simpler than Document AI — one HTTP call replaces GCP SDK + PDFBox render loop. No new dependency beyond an HTTP client (Spring RestClient, already available). |
+| **II. Easy to Change** | ✅ `MarkerPageParser` is the only Marker-aware class. Swap it to use any other parser. `PageResult` DTO unchanged in contract. |
+| **III. Web-First** | ✅ Internal pipeline change; no API contract change. |
+| **IV. Documentation** | ✅ README must show Marker as a local external service dependency. |
+
+---
+
+## Risks & Mitigations
+
+| Risk | Likelihood | Mitigation |
+|------|-----------|------------|
+| Marker server not running when book is uploaded | Medium | `BookEmbeddingService` catches exception from `MarkerPageParser`, marks book as `FAILED`, logs full error. |
+| Marker misses some figures (complex PDFs) | Medium | `app.figure-storage.min-image-size-px` threshold can be tuned. Add fallback: if Marker returns 0 figures for a page with known images, log a warning. |
+| SC-003 (≤ 3× processing time) violated | Low | Marker runs locally (no network latency to cloud). Benchmark with a real 500-page book early. |
+| Large PDF upload to Marker (>100MB) | Low | Marker server handles the full file; no batching needed. Multipart upload limit configurable. |
+| Marker image quality vs PDFBox crop | Low | Marker crops at native resolution; quality is equivalent or better than 150 DPI PDFBox render. |
@@ -48,12 +48,13 @@

 **Independent Test**: Upload a PDF containing at least one page with a labelled anatomical diagram. After status shows `READY`, call `GET /api/v1/books/{id}/figures` — response must contain at least one entry with `figureType`, `caption`, `page`, and `imageUrl` populated. Verify the PNG file exists at the path in `imagePath`.

- [X] T013 [US2] Create `PdfStructureParser` service in `backend/src/main/java/com/aiteacher/document/PdfStructureParser.java` — uses Spring AI's `PagePdfDocumentReader` to extract per-page text; groups pages into `SectionEntity` records using heading-detection heuristics (lines matching `^\d+(\.\d+)*\s+[A-Z]`); groups sections into `ChapterEntity` records; persists both to Postgres via `ChapterRepository` and `SectionRepository`; returns `List<SectionEntity>` for the book
- [X] T014 [US2] Create `FigureExtractionService` in `backend/src/main/java/com/aiteacher/document/FigureExtractionService.java` — opens PDF with PDFBox `PDDocument`; iterates pages; extracts `PDImageXObject` instances; skips images whose width or height are below `min-image-size-px`; classifies `FigureType` using the keyword-matching table from data-model.md §FigureType; parses caption from the nearest text line matching `CAPTION_PATTERN`; saves PNG via `FigureStorageService`; persists `FigureEntity` to `FigureRepository`; returns `List<FigureEntity>` per book
+- [X] T013 [US2] ~~Create `PdfStructureParser`~~ → **SUPERSEDED**: PDF parsing is handled by `MarkerPageParser` (see T013b). `PdfStructureParser` exists but is not wired into the pipeline.
+- [X] T013b [US2] Create `MarkerPageParser` in `backend/src/main/java/com/aiteacher/document/MarkerPageParser.java` — POSTs PDF to `http://localhost:8000/marker/upload?output_format=json` via Spring `RestClient`; parses JSON response into `List<PageResult>` (one per page block); extracts heading, ordered text, and pre-cropped figure PNG bytes per page
+- [X] T014 [US2] Update `FigureExtractionService` in `backend/src/main/java/com/aiteacher/document/FigureExtractionService.java` — **Marker migration**: removed PDFBox rendering + bbox-crop loop; decodes PNG bytes from `PageResult.FigureData` via `ImageIO.read()`; skips images below `min-image-size-px`; classifies `FigureType`; saves via `FigureStorageService`; persists `FigureEntity`
 - [X] T015 [US2] Create `VisionDescriptionService` in `backend/src/main/java/com/aiteacher/document/VisionDescriptionService.java` — accepts a `Path` to a PNG and a caption String; calls the OpenAI vision model (via Spring AI `ChatClient` with image media type) to generate a 2–4 sentence clinical description; returns the generated description string; handles API failures by returning the caption as fallback
 - [X] T016 [US2] Create `TextChunkingService` in `backend/src/main/java/com/aiteacher/document/TextChunkingService.java` — accepts a `SectionEntity`; splits `fullText` into overlapping 400–600 token windows (20-token overlap); wraps each window in a Spring AI `Document` with the flat metadata map defined in data-model.md §Text chunk document; returns `List<Document>`
 - [X] T017 [US2] Create `ChunkFigureRefService` in `backend/src/main/java/com/aiteacher/document/ChunkFigureRefService.java` — accepts a Spring AI `Document` (with its `id` as `chunkId`) and a `List<FigureEntity>` for the book; scans chunk text for patterns `Fig\.\s*\d+[\-\.]\d+` and `Figure\s+\d+[\-\.]\d+`; matches against figure labels; persists `ChunkFigureRefEntity` rows via `ChunkFigureRefRepository`
- [X] T018 [US2] Rewrite `BookEmbeddingService.embedBook()` in `backend/src/main/java/com/aiteacher/book/BookEmbeddingService.java` to orchestrate the full pipeline: (1) `PdfStructureParser` → sections; (2) parallel: `FigureExtractionService` + `TextChunkingService` for each section; (3) `VisionDescriptionService` for each figure; (4) embed figure captions+descriptions as `Document`s (metadata per data-model.md §Figure caption document) into `vectorStore`; (5) embed text chunks into `vectorStore`; (6) `ChunkFigureRefService` for each chunk; update `captionEmbeddingId` on `FigureEntity` after embedding
+- [X] T018 [US2] Update `BookEmbeddingService.embedBook()` — **Marker migration**: injected `MarkerPageParser` replacing `DocumentAiPageParser`; updated `figureExtractionService.extract()` call (removed `pdfPath` arg); updated log message. Pipeline: (1) `MarkerPageParser` → `List<PageResult>`; (2) `buildAndSaveSections()` → sections; (3) `TextChunkingService` → chunks → embed; (4) `FigureExtractionService.extract()` → figures; (5) `VisionDescriptionService` → embed figure chunks; (6) `ChunkFigureRefService` → refs
 - [X] T019 [US2] Extend `BookEmbeddingService.deleteBookChunks()` to also delete: all `ChunkFigureRefEntity` rows (via `findByFigureIdIn`), all `FigureEntity` rows (via `deleteAllByBookId`), all figure PNG files (via `FigureStorageService.delete(bookId)`), all `SectionEntity` and `ChapterEntity` rows for the book
 - [X] T020 [US2] Add `POST /api/v1/books/{id}/reembed` endpoint to `BookController` in `backend/src/main/java/com/aiteacher/book/BookController.java` — returns `202` with `{ bookId, status: "PROCESSING" }`; returns `404` if not found; returns `409` if already `PROCESSING`; calls `deleteBookChunks()` then `embedBook()` asynchronously
Author	SHA1	Message	Date
Adrien	e5d53b4e80	add possibility to disable delete and upload of books	2026-04-06 14:09:17 +02:00
Adrien	5c641f4bcc	enhance page parsing using json output and html	2026-04-05 21:55:30 +02:00
Adrien	ea1276dc2e	adding Marker to parse effectively pdf	2026-04-04 21:30:18 +02:00
Adrien	b154e29f2d	s3 bucket integration for image storage	2026-04-04 13:26:55 +02:00