This commit is contained in:
Adrien
2026-03-31 15:42:49 +02:00
parent 3507ce27e5
commit dc0bcab36e
10 changed files with 1246 additions and 0 deletions

View File

@@ -0,0 +1,165 @@
# Data Model: Neurosurgeon RAG Learning Platform
**Branch**: `001-neuro-rag-learning`
**Date**: 2026-03-31
## Entities
### Book
Represents an uploaded medical textbook.
| Field | Type | Constraints | Notes |
|-------|------|-------------|-------|
| id | UUID | PK, generated | |
| title | VARCHAR(500) | NOT NULL | Extracted from PDF metadata or filename |
| file_name | VARCHAR(500) | NOT NULL | Original upload filename |
| file_size_bytes | BIGINT | NOT NULL | |
| page_count | INT | nullable | Populated after processing |
| status | ENUM | NOT NULL | `PENDING`, `PROCESSING`, `READY`, `FAILED` |
| error_message | TEXT | nullable | Populated if status = FAILED |
| uploaded_at | TIMESTAMPTZ | NOT NULL, default now() | |
| processed_at | TIMESTAMPTZ | nullable | When embedding completed |
**State machine**:
```
PENDING → PROCESSING → READY
↘ FAILED
```
**Business rules**:
- Only books in `READY` status are used as RAG sources.
- A `FAILED` book can be deleted and re-uploaded.
- `title` defaults to the filename (without extension) if PDF metadata is absent.
---
### EmbeddingChunk
A semantically coherent segment of a book's content stored as a vector embedding.
Managed by Spring AI's pgvector `VectorStore` — the table is auto-created by Spring AI.
| Field | Type | Notes |
|-------|------|-------|
| id | UUID | PK |
| content | TEXT | Raw text of the chunk (passage or diagram caption + surrounding text) |
| embedding | VECTOR(1536) | pgvector column; dimension matches the embedding model |
| metadata | JSONB | `{ "book_id": "…", "book_title": "…", "page": N, "chunk_type": "text\|diagram" }` |
**Notes**:
- `chunk_type = diagram` means the chunk was derived from a diagram caption and adjacent descriptive text.
- All chunks for a given book are deleted when the book is deleted.
- Spring AI manages this table; direct access is through `VectorStore.similaritySearch(…)`.
---
### Topic
A predefined neurosurgery learning subject. **Not stored in the database** for the POC —
loaded at startup from `backend/src/main/resources/topics.json`.
| Field | Type | Notes |
|-------|------|-------|
| id | String | Slug, e.g., `cerebral-aneurysm` |
| name | String | Display name, e.g., "Cerebral Aneurysm Management" |
| description | String | One-sentence description |
| category | String | Grouping label, e.g., "Vascular", "Oncology" |
**Business rules**:
- Topics are read-only from the application's perspective.
- The project owner edits `topics.json` to add/remove topics; no admin UI is needed.
---
### ChatSession
A conversation thread, optionally associated with a topic.
| Field | Type | Constraints | Notes |
|-------|------|-------------|-------|
| id | UUID | PK, generated | |
| topic_id | VARCHAR(100) | nullable | References a topic slug; null = free-form chat |
| created_at | TIMESTAMPTZ | NOT NULL, default now() | |
---
### Message
A single turn in a chat session.
| Field | Type | Constraints | Notes |
|-------|------|-------------|-------|
| id | UUID | PK, generated | |
| session_id | UUID | FK → ChatSession.id, ON DELETE CASCADE | |
| role | ENUM | NOT NULL | `USER`, `ASSISTANT` |
| content | TEXT | NOT NULL | |
| sources | JSONB | nullable | Array of `{ "book_title": "…", "page": N }` for ASSISTANT messages |
| created_at | TIMESTAMPTZ | NOT NULL, default now() | |
**Business rules**:
- Messages are ordered by `created_at` ASC within a session.
- `sources` is only populated on `ASSISTANT` messages.
- Deleting a session cascades to delete all its messages.
---
## Relationships
```
Book (1) ──────────────── (N) EmbeddingChunk
(via metadata.book_id)
Topic (config file) ──────── (N) ChatSession [optional association]
ChatSession (1) ──────────── (N) Message
```
---
## Database Schema (DDL summary)
```sql
-- Spring AI creates the vector table automatically.
-- Application-managed tables:
CREATE TABLE book (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
title VARCHAR(500) NOT NULL,
file_name VARCHAR(500) NOT NULL,
file_size_bytes BIGINT NOT NULL,
page_count INT,
status VARCHAR(20) NOT NULL DEFAULT 'PENDING',
error_message TEXT,
uploaded_at TIMESTAMPTZ NOT NULL DEFAULT now(),
processed_at TIMESTAMPTZ
);
CREATE TABLE chat_session (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
topic_id VARCHAR(100),
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE message (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
session_id UUID NOT NULL REFERENCES chat_session(id) ON DELETE CASCADE,
role VARCHAR(10) NOT NULL,
content TEXT NOT NULL,
sources JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_message_session ON message(session_id, created_at);
```
---
## Validation Rules
| Entity | Field | Rule |
|--------|-------|------|
| Book | status | MUST be one of `PENDING`, `PROCESSING`, `READY`, `FAILED` |
| Book | file_name | MUST end in `.pdf` (case-insensitive) |
| Message | role | MUST be `USER` or `ASSISTANT` |
| EmbeddingChunk | metadata.chunk_type | MUST be `text` or `diagram` |
| ChatSession | topic_id | If non-null, MUST match a known topic slug from `topics.json` |