This commit is contained in:
Adrien
2026-03-31 15:42:49 +02:00
parent 3507ce27e5
commit dc0bcab36e
10 changed files with 1246 additions and 0 deletions

29
CLAUDE.md Normal file
View File

@@ -0,0 +1,29 @@
# ai-teacher Development Guidelines
Auto-generated from all feature plans. Last updated: 2026-03-31
## Active Technologies
- Java 21 (backend), TypeScript / Node 20 (frontend) (001-neuro-rag-learning)
## Project Structure
```text
src/
tests/
```
## Commands
npm test && npm run lint
## Code Style
Java 21 (backend), TypeScript / Node 20 (frontend): Follow standard conventions
## Recent Changes
- 001-neuro-rag-learning: Added Java 21 (backend), TypeScript / Node 20 (frontend)
<!-- MANUAL ADDITIONS START -->
<!-- MANUAL ADDITIONS END -->

View File

@@ -0,0 +1,35 @@
# Specification Quality Checklist: Neurosurgeon RAG Learning Platform
**Purpose**: Validate specification completeness and quality before proceeding to planning
**Created**: 2026-03-31
**Feature**: [spec.md](../spec.md)
## Content Quality
- [x] No implementation details (languages, frameworks, APIs)
- [x] Focused on user value and business needs
- [x] Written for non-technical stakeholders
- [x] All mandatory sections completed
## Requirement Completeness
- [x] No [NEEDS CLARIFICATION] markers remain — resolved 2026-03-31
- [x] Requirements are testable and unambiguous
- [x] Success criteria are measurable
- [x] Success criteria are technology-agnostic (no implementation details)
- [x] All acceptance scenarios are defined
- [x] Edge cases are identified
- [x] Scope is clearly bounded
- [x] Dependencies and assumptions identified
## Feature Readiness
- [x] All functional requirements have clear acceptance criteria
- [x] User scenarios cover primary flows
- [x] Feature meets measurable outcomes defined in Success Criteria
- [x] No implementation details leak into specification
## Notes
- All items pass. Spec is ready for `/speckit.plan`.
- Resolved 2026-03-31: shared global library (Q1-A), simple shared-password auth (Q2-B).

View File

@@ -0,0 +1,103 @@
# Contract: Books API
**Base path**: `/api/v1/books`
**Auth**: HTTP Basic (shared credential) required on all endpoints.
---
## POST /api/v1/books
Upload a new book for embedding.
**Request**: `multipart/form-data`
| Field | Type | Required | Notes |
|-------|------|----------|-------|
| file | File | Yes | PDF only; max 100 MB |
**Response 202 Accepted**:
```json
{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"title": "Principles of Neurosurgery",
"fileName": "principles-of-neurosurgery.pdf",
"status": "PENDING",
"uploadedAt": "2026-03-31T10:00:00Z"
}
```
**Response 400 Bad Request** (unsupported format):
```json
{ "error": "Only PDF files are accepted." }
```
**Response 413 Payload Too Large**:
```json
{ "error": "File exceeds maximum size of 100 MB." }
```
---
## GET /api/v1/books
List all uploaded books with their current processing status.
**Response 200 OK**:
```json
[
{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"title": "Principles of Neurosurgery",
"fileName": "principles-of-neurosurgery.pdf",
"fileSizeBytes": 45234567,
"pageCount": 842,
"status": "READY",
"uploadedAt": "2026-03-31T10:00:00Z",
"processedAt": "2026-03-31T10:07:23Z"
}
]
```
---
## GET /api/v1/books/{id}
Get status and metadata for a single book.
**Path param**: `id` — UUID
**Response 200 OK**: Same shape as a single item in the list above, plus optional `errorMessage` field when `status = FAILED`.
**Response 404 Not Found**:
```json
{ "error": "Book not found." }
```
---
## DELETE /api/v1/books/{id}
Delete a book and all its embedding chunks.
**Response 204 No Content**: Book deleted.
**Response 404 Not Found**:
```json
{ "error": "Book not found." }
```
**Response 409 Conflict** (book currently processing):
```json
{ "error": "Cannot delete a book that is currently being processed." }
```
---
## Status Lifecycle
```
PENDING → returned immediately after upload
PROCESSING → set when embedding pipeline starts
READY → set when all chunks are embedded
FAILED → set on any unrecoverable error; errorMessage populated
```

View File

@@ -0,0 +1,133 @@
# Contract: Chat API
**Base path**: `/api/v1/chat`
**Auth**: HTTP Basic (shared credential) required on all endpoints.
---
## POST /api/v1/chat/sessions
Create a new chat session, optionally tied to a topic.
**Request body** (`application/json`):
```json
{
"topicId": "cerebral-aneurysm"
}
```
`topicId` is optional. If omitted, the session is free-form (any neurosurgery question).
**Response 201 Created**:
```json
{
"sessionId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"topicId": "cerebral-aneurysm",
"createdAt": "2026-03-31T10:20:00Z"
}
```
---
## GET /api/v1/chat/sessions/{sessionId}/messages
Retrieve the full message history for a session, ordered chronologically.
**Path param**: `sessionId` — UUID
**Response 200 OK**:
```json
[
{
"id": "msg-uuid-1",
"role": "USER",
"content": "What are the Hunt and Hess grading criteria?",
"createdAt": "2026-03-31T10:21:00Z"
},
{
"id": "msg-uuid-2",
"role": "ASSISTANT",
"content": "The Hunt and Hess scale grades subarachnoid haemorrhage severity...",
"sources": [
{ "bookTitle": "Principles of Neurosurgery", "page": 318 }
],
"createdAt": "2026-03-31T10:21:04Z"
}
]
```
**Response 404 Not Found**:
```json
{ "error": "Session not found." }
```
---
## POST /api/v1/chat/sessions/{sessionId}/messages
Send a user message and receive an AI response grounded in uploaded books.
**Path param**: `sessionId` — UUID
**Request body** (`application/json`):
```json
{
"content": "What are the Hunt and Hess grading criteria?"
}
```
**Response 200 OK**:
```json
{
"id": "msg-uuid-2",
"role": "ASSISTANT",
"content": "The Hunt and Hess scale grades subarachnoid haemorrhage severity on a scale of IV...",
"sources": [
{ "bookTitle": "Principles of Neurosurgery", "page": 318 }
],
"createdAt": "2026-03-31T10:21:04Z"
}
```
**When no source found** — response still 200 OK, but `sources` is empty and `content` explicitly states the limitation:
```json
{
"id": "msg-uuid-3",
"role": "ASSISTANT",
"content": "I could not find relevant information about this topic in the uploaded books.",
"sources": [],
"createdAt": "2026-03-31T10:22:00Z"
}
```
**Response 404 Not Found**:
```json
{ "error": "Session not found." }
```
**Response 503 Service Unavailable** (no READY books):
```json
{ "error": "No books are available as knowledge sources." }
```
---
## DELETE /api/v1/chat/sessions/{sessionId}
Delete a session and all its messages (clear conversation history).
**Response 204 No Content**: Session deleted.
**Response 404 Not Found**:
```json
{ "error": "Session not found." }
```
---
## Notes
- Responses are synchronous for the POC; no streaming or SSE required at this stage.
- The backend includes the full conversation history (all prior messages in the session)
in the LLM context window to maintain multi-turn coherence.
- The AI is instructed via system prompt to answer **only** from retrieved book chunks
and to explicitly state when no relevant context was found.

View File

@@ -0,0 +1,74 @@
# Contract: Topics API
**Base path**: `/api/v1/topics`
**Auth**: HTTP Basic (shared credential) required on all endpoints.
---
## GET /api/v1/topics
List all predefined neurosurgery topics.
**Response 200 OK**:
```json
[
{
"id": "cerebral-aneurysm",
"name": "Cerebral Aneurysm Management",
"description": "Diagnosis, grading, and surgical/endovascular treatment of cerebral aneurysms.",
"category": "Vascular"
},
{
"id": "glioblastoma",
"name": "Glioblastoma (GBM)",
"description": "Pathophysiology, surgical resection strategies, and adjuvant therapy for GBM.",
"category": "Oncology"
}
]
```
---
## POST /api/v1/topics/{id}/summary
Generate an AI summary for the selected topic by cross-referencing all READY books.
**Path param**: `id` — topic slug (e.g., `cerebral-aneurysm`)
**Response 200 OK**:
```json
{
"topicId": "cerebral-aneurysm",
"topicName": "Cerebral Aneurysm Management",
"summary": "Cerebral aneurysms are focal dilations of intracranial arteries...",
"sources": [
{
"bookTitle": "Principles of Neurosurgery",
"page": 312
},
{
"bookTitle": "Youmans and Winn Neurological Surgery",
"page": 1048
}
],
"generatedAt": "2026-03-31T10:15:00Z"
}
```
**Response 404 Not Found** (unknown topic):
```json
{ "error": "Topic not found." }
```
**Response 503 Service Unavailable** (no READY books):
```json
{ "error": "No books are available as knowledge sources. Please upload and process at least one book." }
```
---
## Notes
- Topics are read-only; they cannot be created or deleted via the API.
- The topic list is loaded from `topics.json` at application startup.
- Summary generation is synchronous for the POC (< 30 s per SC-002); no polling needed.

View File

@@ -0,0 +1,165 @@
# Data Model: Neurosurgeon RAG Learning Platform
**Branch**: `001-neuro-rag-learning`
**Date**: 2026-03-31
## Entities
### Book
Represents an uploaded medical textbook.
| Field | Type | Constraints | Notes |
|-------|------|-------------|-------|
| id | UUID | PK, generated | |
| title | VARCHAR(500) | NOT NULL | Extracted from PDF metadata or filename |
| file_name | VARCHAR(500) | NOT NULL | Original upload filename |
| file_size_bytes | BIGINT | NOT NULL | |
| page_count | INT | nullable | Populated after processing |
| status | ENUM | NOT NULL | `PENDING`, `PROCESSING`, `READY`, `FAILED` |
| error_message | TEXT | nullable | Populated if status = FAILED |
| uploaded_at | TIMESTAMPTZ | NOT NULL, default now() | |
| processed_at | TIMESTAMPTZ | nullable | When embedding completed |
**State machine**:
```
PENDING → PROCESSING → READY
↘ FAILED
```
**Business rules**:
- Only books in `READY` status are used as RAG sources.
- A `FAILED` book can be deleted and re-uploaded.
- `title` defaults to the filename (without extension) if PDF metadata is absent.
---
### EmbeddingChunk
A semantically coherent segment of a book's content stored as a vector embedding.
Managed by Spring AI's pgvector `VectorStore` — the table is auto-created by Spring AI.
| Field | Type | Notes |
|-------|------|-------|
| id | UUID | PK |
| content | TEXT | Raw text of the chunk (passage or diagram caption + surrounding text) |
| embedding | VECTOR(1536) | pgvector column; dimension matches the embedding model |
| metadata | JSONB | `{ "book_id": "…", "book_title": "…", "page": N, "chunk_type": "text\|diagram" }` |
**Notes**:
- `chunk_type = diagram` means the chunk was derived from a diagram caption and adjacent descriptive text.
- All chunks for a given book are deleted when the book is deleted.
- Spring AI manages this table; direct access is through `VectorStore.similaritySearch(…)`.
---
### Topic
A predefined neurosurgery learning subject. **Not stored in the database** for the POC —
loaded at startup from `backend/src/main/resources/topics.json`.
| Field | Type | Notes |
|-------|------|-------|
| id | String | Slug, e.g., `cerebral-aneurysm` |
| name | String | Display name, e.g., "Cerebral Aneurysm Management" |
| description | String | One-sentence description |
| category | String | Grouping label, e.g., "Vascular", "Oncology" |
**Business rules**:
- Topics are read-only from the application's perspective.
- The project owner edits `topics.json` to add/remove topics; no admin UI is needed.
---
### ChatSession
A conversation thread, optionally associated with a topic.
| Field | Type | Constraints | Notes |
|-------|------|-------------|-------|
| id | UUID | PK, generated | |
| topic_id | VARCHAR(100) | nullable | References a topic slug; null = free-form chat |
| created_at | TIMESTAMPTZ | NOT NULL, default now() | |
---
### Message
A single turn in a chat session.
| Field | Type | Constraints | Notes |
|-------|------|-------------|-------|
| id | UUID | PK, generated | |
| session_id | UUID | FK → ChatSession.id, ON DELETE CASCADE | |
| role | ENUM | NOT NULL | `USER`, `ASSISTANT` |
| content | TEXT | NOT NULL | |
| sources | JSONB | nullable | Array of `{ "book_title": "…", "page": N }` for ASSISTANT messages |
| created_at | TIMESTAMPTZ | NOT NULL, default now() | |
**Business rules**:
- Messages are ordered by `created_at` ASC within a session.
- `sources` is only populated on `ASSISTANT` messages.
- Deleting a session cascades to delete all its messages.
---
## Relationships
```
Book (1) ──────────────── (N) EmbeddingChunk
(via metadata.book_id)
Topic (config file) ──────── (N) ChatSession [optional association]
ChatSession (1) ──────────── (N) Message
```
---
## Database Schema (DDL summary)
```sql
-- Spring AI creates the vector table automatically.
-- Application-managed tables:
CREATE TABLE book (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
title VARCHAR(500) NOT NULL,
file_name VARCHAR(500) NOT NULL,
file_size_bytes BIGINT NOT NULL,
page_count INT,
status VARCHAR(20) NOT NULL DEFAULT 'PENDING',
error_message TEXT,
uploaded_at TIMESTAMPTZ NOT NULL DEFAULT now(),
processed_at TIMESTAMPTZ
);
CREATE TABLE chat_session (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
topic_id VARCHAR(100),
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE message (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
session_id UUID NOT NULL REFERENCES chat_session(id) ON DELETE CASCADE,
role VARCHAR(10) NOT NULL,
content TEXT NOT NULL,
sources JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_message_session ON message(session_id, created_at);
```
---
## Validation Rules
| Entity | Field | Rule |
|--------|-------|------|
| Book | status | MUST be one of `PENDING`, `PROCESSING`, `READY`, `FAILED` |
| Book | file_name | MUST end in `.pdf` (case-insensitive) |
| Message | role | MUST be `USER` or `ASSISTANT` |
| EmbeddingChunk | metadata.chunk_type | MUST be `text` or `diagram` |
| ChatSession | topic_id | If non-null, MUST match a known topic slug from `topics.json` |

View File

@@ -0,0 +1,103 @@
# Implementation Plan: Neurosurgeon RAG Learning Platform
**Branch**: `001-neuro-rag-learning` | **Date**: 2026-03-31 | **Spec**: [spec.md](spec.md)
**Input**: Feature specification from `/specs/001-neuro-rag-learning/spec.md`
## Summary
Build a web application for neurosurgeons to upload medical textbooks (PDF), have them
precisely embedded (text + diagram captions) into a pgvector store, then select from a
predefined topic list to receive an AI-generated cross-book summary, and engage in a
grounded RAG chat. Monorepo: Vue.js 3 frontend + Spring Boot 4 / Spring AI backend +
PostgreSQL with pgvector.
## Technical Context
**Language/Version**: Java 21 (backend), TypeScript / Node 20 (frontend)
**Primary Dependencies**:
- Backend: Spring Boot **4.0.5**, Spring AI **1.1.4** (BOM), Spring Security, `spring-ai-pdf-document-reader`, Spring Data JPA, Flyway
- Frontend: Vue.js 3, Vite, Pinia, Vue Router, Axios
**Storage**: PostgreSQL 16 with pgvector extension (provided externally)
**Testing**: JUnit 5 + Spring Boot Test (backend); Vitest + Vue Test Utils (frontend)
**Target Platform**: Linux server / Docker-compose local dev
**Project Type**: Web application — monorepo with `backend/` and `frontend/` at repo root
**Performance Goals**:
- PDF processing: < 10 min per 500-page textbook (SC-001)
- Topic summary generation: < 30 s (SC-002)
**Constraints**: POC scale (< 10 concurrent users); shared-password auth (no per-user accounts)
**Scale/Scope**: Single shared book library; HTTP Basic with single in-memory user
## Constitution Check
*GATE: Must pass before Phase 0 research. Re-checked after Phase 1 design.*
| Principle | Status | Notes |
|-----------|--------|-------|
| I. KISS | ✅ PASS | Two deployable units only (backend + frontend). No microservices. Shared library — no per-user isolation complexity. |
| II. Easy to Change | ✅ PASS | Spring AI abstracts the LLM provider behind `ChatClient` / `EmbeddingModel` interfaces — swappable. PDF parser isolated in a service class. |
| III. Web-First Architecture | ✅ PASS | REST API (`/api/v1/…`) backend; Vue.js SPA frontend communicating via API only. |
| IV. Documentation as Architecture | ⚠ PENDING | README.md with Mermaid system-context diagram MUST be created in this PR. See quickstart.md. |
| V. POC Validation Gate | ✅ PASS | All 3 user stories have defined manual smoke tests in spec.md. |
## Project Structure
### Documentation (this feature)
```text
specs/001-neuro-rag-learning/
├── plan.md # This file
├── research.md # Phase 0 output
├── data-model.md # Phase 1 output
├── quickstart.md # Phase 1 output
├── contracts/ # Phase 1 output
│ ├── books-api.md
│ ├── topics-api.md
│ └── chat-api.md
└── tasks.md # Phase 2 output (/speckit.tasks — NOT created here)
```
### Source Code (repository root)
```text
backend/
├── src/
│ ├── main/
│ │ ├── java/com/aiteacher/
│ │ │ ├── book/ # Book upload, processing, embedding
│ │ │ ├── topic/ # Topic list and summary generation
│ │ │ ├── chat/ # Chat session and RAG chat
│ │ │ ├── config/ # Spring AI, Security, Web config
│ │ │ └── AiTeacherApplication.java
│ │ └── resources/
│ │ ├── application.properties
│ │ └── topics.json # Predefined topic list (config file)
│ └── test/
│ └── java/com/aiteacher/
├── pom.xml
└── Dockerfile
frontend/
├── src/
│ ├── components/
│ ├── views/ # UploadView, TopicsView, ChatView
│ ├── stores/ # Pinia: bookStore, topicStore, chatStore
│ ├── services/ # api.ts — Axios wrapper
│ ├── router/
│ └── main.ts
├── index.html
├── vite.config.ts
├── package.json
└── Dockerfile
README.md # Architecture + Mermaid diagrams (required by Principle IV)
docker-compose.yml # Local dev: backend + frontend + postgres
```
**Structure Decision**: Web application layout (Option 2). Two deployable units at repo root
(`backend/`, `frontend/`). No third project. Internal packages organised by domain slice
(book / topic / chat) rather than layer, so each slice is self-contained and easy to change
(Principle II).
## Complexity Tracking
> No Constitution violations requiring justification.

View File

@@ -0,0 +1,142 @@
# Quickstart: Neurosurgeon RAG Learning Platform
**Branch**: `001-neuro-rag-learning`
**Date**: 2026-03-31
This guide covers how to run the full stack locally and validate each user story manually
(per Principle V — POC Validation Gate).
---
## Prerequisites
| Tool | Version | Notes |
|------|---------|-------|
| Java | 21+ | Check: `java -version` |
| Maven | 3.9+ | Check: `mvn -version` |
| Node.js | 20+ | Check: `node -version` |
| Docker + Compose | any recent | PostgreSQL with pgvector is provided via Docker |
| PostgreSQL + pgvector | provided | See environment setup below |
---
## Environment Setup
1. **Start the database** (pgvector already provided — configure connection):
```properties
# backend/src/main/resources/application.properties
spring.datasource.url=jdbc:postgresql://localhost:5432/aiteacher
spring.datasource.username=aiteacher
spring.datasource.password=<your-password>
# Spring AI — vector store
spring.ai.vectorstore.pgvector.dimensions=1536
spring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCE
# Spring AI — LLM provider (example: OpenAI)
spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-4o
spring.ai.openai.embedding.options.model=text-embedding-3-small
# Shared password auth
spring.security.user.name=neurosurgeon
spring.security.user.password=${APP_PASSWORD}
```
2. **Set environment variables**:
```bash
export OPENAI_API_KEY=sk-...
export APP_PASSWORD=changeme
```
---
## Running Locally
### Backend
```bash
cd backend
mvn spring-boot:run
# API available at http://localhost:8080
```
### Frontend
```bash
cd frontend
npm install
npm run dev
# UI available at http://localhost:5173
```
---
## Smoke Tests (Principle V Validation)
Run these after the full stack is up to confirm each user story works end-to-end.
### Smoke Test 1 — Book Upload & Embedding (US1 / P1)
1. Open `http://localhost:5173` in a browser.
2. Navigate to the **Library** section.
3. Upload a PDF textbook (any medical PDF, 10+ pages with diagrams).
4. Observe status changes: `PENDING` → `PROCESSING` → `READY`.
5. **Pass criteria**: Book appears as `READY` within 10 minutes.
6. **Diagram check**: Ask a topic question in Smoke Test 2 that references a diagram
caption from the book — confirm it surfaces.
### Smoke Test 2 — Topic Summary (US2 / P2)
1. At least one book MUST be in `READY` state.
2. Navigate to the **Topics** section.
3. Select any topic from the list.
4. Click **Generate Summary**.
5. **Pass criteria**:
- Summary appears within 30 seconds.
- At least one source citation is shown with a book title and page number.
- The summary content is visibly related to the selected topic.
6. **No-source check**: Select a topic completely unrelated to your uploaded book.
Confirm the system responds with a clear "no relevant content found" message rather
than a hallucinated answer.
### Smoke Test 3 — Knowledge Chat (US3 / P3)
1. Navigate to the **Chat** section.
2. Start a new session (optionally tied to a topic).
3. Ask a specific clinical question answerable from the uploaded book.
4. **Pass criteria**:
- Response arrives within 30 seconds.
- Response cites a source book and page number.
5. Ask a follow-up question that references the previous answer
(e.g., "Can you expand on the grading scale you mentioned?").
6. **Pass criteria**: The response is coherent and contextually connected to the prior turn.
7. Ask something completely outside the books' content.
8. **Pass criteria**: The system explicitly states it cannot find relevant information
(not a hallucinated answer).
9. Use **Clear conversation** and verify the session resets.
---
## README Architecture Requirement (Principle IV)
The `README.md` at the repo root MUST contain at minimum this system-context diagram
(update as the architecture evolves):
```mermaid
graph TD
User["Neurosurgeon (Browser)"]
FE["Frontend\nVue.js 3 / Vite\n:5173"]
BE["Backend\nSpring Boot 4 / Spring AI\n:8080"]
DB["PostgreSQL + pgvector\n(provided)"]
LLM["LLM Provider\n(OpenAI / configurable)"]
User -->|HTTP| FE
FE -->|REST /api/v1/...| BE
BE -->|JDBC / pgvector| DB
BE -->|Embedding + Chat API| LLM
```
Copy this diagram into `README.md` to satisfy Principle IV before the PR is merged.

View File

@@ -0,0 +1,281 @@
# Research: Neurosurgeon RAG Learning Platform
**Branch**: `001-neuro-rag-learning`
**Date**: 2026-03-31
---
## 1. Spring Boot 4 + Spring AI Versions & BOM
**Decision**: Spring Boot **4.0.5** + Spring AI **1.1.4**.
**Rationale**: Spring Boot 4.0.5 is GA (released February 2026) — this matches the user's
original requirement. Spring AI 1.1.4 is the current stable release compatible with
Spring Boot 4.x. Spring AI 2.0.0-M4 is available in preview but not used (KISS — no
preview dependencies).
**Maven BOM** (in `<dependencyManagement>`):
```xml
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-bom</artifactId>
<version>1.1.4</version>
<type>pom</type>
<scope>import</scope>
</dependency>
```
**Key starters** (versions managed by BOM):
```xml
<!-- pgvector vector store -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
</dependency>
<!-- OpenAI (embedding + chat; swap for any other provider) -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-openai</artifactId>
</dependency>
<!-- PDF document reader (Spring AI native, Apache PDFBox-based) -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>
<!-- PostgreSQL JDBC driver -->
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<scope>runtime</scope>
</dependency>
```
**Alternatives considered**: Spring AI 2.0.0-M4 — rejected (milestone, KISS principle).
---
## 2. Spring AI RAG Pipeline
**Decision**: Use Spring AI's `PagePdfDocumentReader`, `EmbeddingModel`, `PgVectorStore`,
and `ChatClient` with `QuestionAnswerAdvisor` for RAG.
**Key classes**:
| Component | Class / Interface | Purpose |
|-----------|-------------------|---------|
| Document ingestion | `PagePdfDocumentReader` | Parse PDF pages to `Document` objects |
| Chunking | `TokenCountBatchingStrategy` | Split docs to respect token limits |
| Embedding | `EmbeddingModel` | Convert text chunks to vectors |
| Storage | `VectorStore` / `PgVectorStore` | Persist and search embeddings |
| RAG query | `QuestionAnswerAdvisor` | Augments prompt with retrieved context |
| Chat | `ChatClient` | Fluent API for LLM interactions |
**RAG pipeline flow**:
```
PDF file
→ PagePdfDocumentReader (extract text per page as Document)
→ TokenCountBatchingStrategy (chunk to embedding token limit)
→ EmbeddingModel.embed() (vectorise each chunk)
→ PgVectorStore.add() (persist chunk + vector + metadata)
User query
→ ChatClient.prompt()
.advisors(new QuestionAnswerAdvisor(vectorStore))
.user(question)
.call()
→ QuestionAnswerAdvisor runs similaritySearch, injects context
→ ChatModel generates response grounded in retrieved chunks
```
**application.properties**:
```properties
spring.ai.vectorstore.pgvector.dimensions=1536
spring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCE
spring.ai.vectorstore.pgvector.index-type=HNSW
spring.ai.vectorstore.pgvector.initialize-schema=true
spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-4o
spring.ai.openai.embedding.options.model=text-embedding-3-small
```
**Rationale**: `QuestionAnswerAdvisor` is the Spring AI-idiomatic RAG pattern — zero
boilerplate. `EmbeddingModel` and `ChatClient` are interfaces; swapping the LLM provider
is a single property change (Principle II).
---
## 3. PDF Ingestion & Chunking
**Decision**: `PagePdfDocumentReader` (from `spring-ai-pdf-document-reader`) for text
extraction; default `TokenCountBatchingStrategy` for chunking.
**Approach**:
```java
PagePdfDocumentReader reader = new PagePdfDocumentReader(
new FileSystemResource("textbook.pdf"),
PdfDocumentReaderConfig.builder()
.withPagesPerDocument(1) // one Document per page
.build()
);
List<Document> pages = reader.get();
vectorStore.add(pages); // batching + embedding handled internally
```
- Each `Document` carries metadata: source filename, page number.
- `TokenCountBatchingStrategy` ensures chunks fit the embedding model's context window
(~8 000 tokens for OpenAI models).
- Custom metadata (`book_id`, `book_title`, `chunk_type`) is added before calling
`vectorStore.add()`.
**Rationale**: `PagePdfDocumentReader` is the recommended Spring AI PDF reader for
text-focused RAG — lighter than the Tika reader and purpose-built for PDFs.
**Alternatives considered**: `TikaDocumentReader` — provides multi-format support but is
heavier; rejected for POC (only PDFs are in scope).
---
## 4. Diagram / Visual Content Handling
**Decision**: Extract diagram captions and surrounding text as text chunks tagged
`chunk_type=diagram`. No pixel-level image embedding for the POC.
**Approach**:
- `PagePdfDocumentReader` extracts all text including figure captions
(e.g., `"Figure 3.2: Circle of Willis anatomy..."`).
- A post-processing step identifies lines matching caption patterns
(`^(Figure|Fig\.|Table|Diagram)\s+[\d.]+`) and tags those `Document` objects with
`metadata.put("chunk_type", "diagram")`.
- The caption text plus the surrounding descriptive paragraph are included in the chunk,
making the diagram content semantically searchable.
**Rationale**: This is the simplest approach that satisfies FR-003 within KISS constraints.
The spec explicitly excludes pixel-level image search from the POC scope.
**Future upgrade path**: Use a vision model (GPT-4o vision) to generate text descriptions
of extracted images and add them as additional `chunk_type=diagram` documents — no
architectural change needed, just a new processing step.
---
## 5. Simple Shared-Password Authentication
**Decision**: Spring Security HTTP Basic with a single in-memory user.
**Spring Security config**:
```java
@Configuration
@EnableWebSecurity
public class SecurityConfig {
@Bean
public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
http.authorizeHttpRequests(a -> a.anyRequest().authenticated())
.httpBasic(Customizer.withDefaults())
.csrf(AbstractHttpConfigurer::disable); // REST API — no CSRF needed
return http.build();
}
@Bean
public UserDetailsService userDetailsService(
@Value("${app.auth.password}") String password) {
UserDetails user = User.builder()
.username("neurosurgeon")
.password("{noop}" + password) // {noop} = plain text for POC
.roles("USER")
.build();
return new InMemoryUserDetailsManager(user);
}
}
```
```properties
# application.properties
app.auth.password=${APP_PASSWORD}
```
**Rationale**: Zero database dependency; zero token management. The Vue.js frontend
sets `Authorization: Basic <base64>` via Axios `auth` config. Fully sufficient for
< 10 trusted users on a private network (POC constraint).
**Alternatives considered**: JWT — rejected (requires token endpoint, more code);
custom API key filter — rejected (HTTP Basic is simpler and just as secure for this scale).
---
## 6. Vue.js 3 Project Structure
**Decision**: Vite + Vue 3 + TypeScript + Pinia + Vue Router + Axios.
**Standard layout** (`npm create vue@latest`):
```
frontend/src/
├── components/ # Reusable UI components (BookCard, ChatMessage, etc.)
├── views/ # Route-level pages: UploadView, TopicsView, ChatView
├── stores/ # Pinia: bookStore, topicStore, chatStore
├── services/ # api.ts — Axios instance with base URL + Basic auth header
├── router/ # index.ts — Vue Router routes
└── main.ts
```
**Axios setup** (`services/api.ts`):
```typescript
import axios from 'axios'
export const api = axios.create({
baseURL: import.meta.env.VITE_API_URL ?? 'http://localhost:8080/api/v1',
auth: {
username: 'neurosurgeon',
password: import.meta.env.VITE_APP_PASSWORD
}
})
```
**Rationale**: Axios handles HTTP Basic auth via its `auth` config — no manual
`btoa()` needed. Pinia is Vue's official state manager (replaced Vuex).
---
## 7. pgvector Configuration & Schema
**Decision**: Spring AI auto-creates the `vector_store` table via `initialize-schema=true`.
Application tables use Flyway migrations.
**Required PostgreSQL extensions** (run once on the provided database):
```sql
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS hstore;
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
```
**Spring AI auto-created table**:
```sql
CREATE TABLE IF NOT EXISTS vector_store (
id uuid DEFAULT uuid_generate_v4() PRIMARY KEY,
content TEXT NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}',
embedding VECTOR(1536) NOT NULL
);
CREATE INDEX ON vector_store USING HNSW (embedding vector_cosine_ops);
```
**Key properties**:
| Property | Value | Notes |
|----------|-------|-------|
| `dimensions` | `1536` | Matches `text-embedding-3-small`; update if provider changes |
| `distance-type` | `COSINE_DISTANCE` | Standard for normalised text embeddings |
| `index-type` | `HNSW` | O(log N) search; best default for POC |
| `initialize-schema` | `true` | Auto-create table on startup (safe for POC) |
**Embedding dimensions note**: if the LLM provider is switched (e.g., to Ollama with a
768-dim model), update `dimensions` in properties **and** re-embed all books — the
`vector_store` table must be recreated with the new dimension.
**Alternatives considered**: IVFFlat index — more memory-efficient but slower; NONE for very
small datasets. HNSW is the best default for a POC where correctness matters more than
storage.

View File

@@ -0,0 +1,181 @@
# Feature Specification: Neurosurgeon RAG Learning Platform
**Feature Branch**: `001-neuro-rag-learning`
**Created**: 2026-03-31
**Status**: Draft
**Input**: User description: "I want to build a web application to help learn complex topics for neurosurgeons. I will provide to the system a predefined list of topics. The system let user upload books. The books should be embedded to be used latter for LLM (RAG). Embedding is crucial, it MUST be precise, with diagram. The user can select a topics, a LLM will provide a summary by crossing information from the uploaded books. User can have a chat to deepen their knowledge"
## User Scenarios & Testing *(mandatory)*
### User Story 1 - Book Upload & Precise Embedding (Priority: P1)
A neurosurgeon uploads a medical textbook (PDF) to the platform. The system processes the
book, extracting both textual content and embedded diagrams/figures with high fidelity.
Once processing is complete, the book is available as a knowledge source for topic summaries
and chat. The user can see the upload status and a confirmation that the book is ready.
**Why this priority**: Without embedded books there is no knowledge base. All other stories
depend on this story being complete and functional.
**Independent Test**: Upload a single PDF textbook. Verify the upload is accepted, processing
completes, and the book appears in the library as "Ready". Then ask a topic question whose
answer appears only in that book and confirm the correct answer surfaces.
**Acceptance Scenarios**:
1. **Given** a user on the upload page, **When** they select a valid PDF file and confirm
upload, **Then** the system accepts the file, shows a processing progress indicator, and
eventually marks the book as "Ready" in the library.
2. **Given** a book that contains anatomical diagrams, **When** the system finishes
embedding, **Then** diagram content (labels, captions, spatial relationships described
in the diagram) is searchable and retrievable alongside the surrounding text.
3. **Given** an upload of an unsupported file format (e.g., DOCX), **When** the user
submits, **Then** the system rejects the file with a clear error message explaining
accepted formats.
4. **Given** a book is currently being processed, **When** the user navigates to the
library, **Then** the book appears with a "Processing" status and cannot yet be used
as a knowledge source.
---
### User Story 2 - Topic-Guided Summary (Priority: P2)
The user browses a predefined list of neurosurgery topics, selects one (e.g., "Cerebral
Aneurysm Management"), and receives an AI-generated summary that cross-references all
uploaded books. The summary synthesizes information from multiple sources, citing which
book each piece of information comes from.
**Why this priority**: This is the core learning feature — the primary reason a neurosurgeon
uses the platform. It delivers immediate value once at least one book is embedded.
**Independent Test**: With at least one book covering the target topic embedded, select that
topic from the list and confirm: (a) a coherent summary is generated, (b) the summary
references content present in the uploaded book(s), and (c) source citations are visible.
**Acceptance Scenarios**:
1. **Given** at least one book is in "Ready" state, **When** the user selects a topic from
the predefined list, **Then** the system generates a summary within 30 seconds that
draws on content from the uploaded books.
2. **Given** multiple books are uploaded, **When** the user requests a topic summary,
**Then** the summary synthesizes information from all relevant books and indicates
which source each key point came from.
3. **Given** no uploaded book contains content relevant to the selected topic, **When**
the user requests a summary, **Then** the system clearly communicates that its knowledge
is limited and no relevant source was found.
4. **Given** a topic summary is displayed, **When** the user inspects a cited passage,
**Then** they can identify the originating book title and approximate location
(chapter or page range).
---
### User Story 3 - Knowledge Deepening Chat (Priority: P3)
After reading a topic summary (or independently), the user enters a conversational chat
to ask follow-up questions. The AI answers using the embedded books as its exclusive
knowledge source, enabling the user to drill into specific areas, request clarifications,
or explore edge cases.
**Why this priority**: The chat extends the value of the summary by enabling personalised,
interactive learning. It builds on the RAG infrastructure established by P1 and P2.
**Independent Test**: Start a chat session on a specific topic. Ask a specific clinical
question whose answer is in an uploaded book. Confirm the response references the correct
source and that the conversation maintains context across at least 3 turns.
**Acceptance Scenarios**:
1. **Given** a user in a chat session, **When** they ask a question relevant to an
uploaded book, **Then** the system responds with a grounded answer and cites the
source book.
2. **Given** an ongoing chat, **When** the user asks a follow-up question that refers
to a previous turn ("What about the complication you just mentioned?"), **Then** the
system maintains conversational context and provides a coherent answer.
3. **Given** a user asks a question outside the scope of any uploaded book, **When**
the system responds, **Then** it clearly states that no relevant source was found
rather than generating unsupported claims.
4. **Given** a chat session is ongoing, **When** the user wishes to start fresh,
**Then** they can clear the conversation history and begin a new session.
---
### Edge Cases
- What happens when a PDF is corrupted or password-protected?
- How does the system handle books that are very large (500+ pages)?
- What if two uploaded books contain contradictory information on the same topic?
- How does diagram embedding behave if a diagram has no caption or label?
## Requirements *(mandatory)*
### Functional Requirements
- **FR-001**: System MUST allow users to upload books in PDF format.
- **FR-002**: System MUST extract and embed textual content from uploaded books with
high precision for use in semantic search.
- **FR-003**: System MUST extract and embed visual content (diagrams, figures, and their
captions/labels) from uploaded books so diagram information is retrievable by the RAG system.
- **FR-004**: System MUST display a predefined, curated list of neurosurgery topics
for user selection.
- **FR-005**: System MUST generate a topic summary by cross-referencing all "Ready"
uploaded books and synthesizing relevant passages.
- **FR-006**: System MUST cite the source book (and approximate location) for each key
claim in a generated summary.
- **FR-007**: System MUST provide a conversational chat interface where AI answers are
grounded exclusively in uploaded book content.
- **FR-008**: System MUST maintain conversational context within a chat session across
multiple turns.
- **FR-009**: System MUST display the embedding/processing status of each uploaded book
(Pending, Processing, Ready, Failed).
- **FR-010**: System MUST reject uploaded files that are not in a supported format and
provide a clear error message.
- **FR-011**: System MUST communicate clearly when a query cannot be answered from the
available book content, rather than generating unsupported claims.
- **FR-012**: The book library MUST be a single shared global library; all users see and
benefit from the same uploaded books. Per-user isolation is out of scope for the POC.
### Key Entities
- **Book**: Uploaded document. Attributes: title, file name, upload date, processing
status (Pending / Processing / Ready / Failed), page count.
- **Topic**: Predefined learning subject. Attributes: name, description, category.
Managed via configuration (not user-editable in the POC).
- **Embedding Chunk**: A semantically coherent unit of book content (text passage or
diagram + caption) with its vector representation and source reference.
- **Chat Session**: A conversation thread. Attributes: creation date, associated topic
(optional), message history.
- **Message**: A single turn in a chat session. Attributes: role (user / assistant),
content, source citations (for assistant messages).
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: A user can upload a book and have it fully processed and searchable within
10 minutes for a standard-length medical textbook (up to 500 pages).
- **SC-002**: Topic summaries are generated within 30 seconds of user request.
- **SC-003**: At least 90% of generated summary claims can be traced back to a cited
passage in an uploaded book.
- **SC-004**: A user completing the primary flow (upload → topic summary → chat) requires
no external instructions — the interface is self-explanatory.
- **SC-005**: Diagram content from uploaded books is retrievable by the RAG system;
at least one diagram-sourced fact surfaces correctly in a controlled test query.
- **SC-006**: The system correctly declines to answer (and explains why) when a question
has no grounding in uploaded books, in at least 9 out of 10 out-of-scope test queries.
## Assumptions
- Books are uploaded as PDF files; other formats (EPUB, DOCX) are out of scope for the POC.
- The predefined topic list is small (1050 topics) and curated manually by the project
owner via a configuration file; no admin UI is needed for the POC.
- Access is protected by a simple shared password or API token (no individual user accounts);
anyone who knows the credential can access the application. Full account management is out
of scope for the POC.
- The LLM used for summary generation and chat is accessed via an external API (not
self-hosted); the specific provider is a technical implementation decision.
- Diagram embedding means extracting diagram images and their associated captions/labels
as descriptive text for semantic search; pixel-level image similarity search is out of
scope for the POC.
- The system is designed for a small number of concurrent users (POC scale: < 10
simultaneous users); horizontal scaling is not a requirement at this stage.
- Internet connectivity is assumed for both the user and the server (external LLM API calls).