plan

2026-03-31 15:42:49 +02:00
parent 3507ce27e5
commit dc0bcab36e
10 changed files with 1246 additions and 0 deletions
@@ -0,0 +1,165 @@
+# Data Model: Neurosurgeon RAG Learning Platform
+
+**Branch**: `001-neuro-rag-learning`
+**Date**: 2026-03-31
+
+## Entities
+
+### Book
+
+Represents an uploaded medical textbook.
+
+| Field | Type | Constraints | Notes |
+|-------|------|-------------|-------|
+| id | UUID | PK, generated | |
+| title | VARCHAR(500) | NOT NULL | Extracted from PDF metadata or filename |
+| file_name | VARCHAR(500) | NOT NULL | Original upload filename |
+| file_size_bytes | BIGINT | NOT NULL | |
+| page_count | INT | nullable | Populated after processing |
+| status | ENUM | NOT NULL | `PENDING`, `PROCESSING`, `READY`, `FAILED` |
+| error_message | TEXT | nullable | Populated if status = FAILED |
+| uploaded_at | TIMESTAMPTZ | NOT NULL, default now() | |
+| processed_at | TIMESTAMPTZ | nullable | When embedding completed |
+
+**State machine**:
+```
+PENDING → PROCESSING → READY
+                     ↘ FAILED
+```
+
+**Business rules**:
+- Only books in `READY` status are used as RAG sources.
+- A `FAILED` book can be deleted and re-uploaded.
+- `title` defaults to the filename (without extension) if PDF metadata is absent.
+
+---
+
+### EmbeddingChunk
+
+A semantically coherent segment of a book's content stored as a vector embedding.
+Managed by Spring AI's pgvector `VectorStore` — the table is auto-created by Spring AI.
+
+| Field | Type | Notes |
+|-------|------|-------|
+| id | UUID | PK |
+| content | TEXT | Raw text of the chunk (passage or diagram caption + surrounding text) |
+| embedding | VECTOR(1536) | pgvector column; dimension matches the embedding model |
+| metadata | JSONB | `{ "book_id": "…", "book_title": "…", "page": N, "chunk_type": "text\|diagram" }` |
+
+**Notes**:
+- `chunk_type = diagram` means the chunk was derived from a diagram caption and adjacent descriptive text.
+- All chunks for a given book are deleted when the book is deleted.
+- Spring AI manages this table; direct access is through `VectorStore.similaritySearch(…)`.
+
+---
+
+### Topic
+
+A predefined neurosurgery learning subject. **Not stored in the database** for the POC —
+loaded at startup from `backend/src/main/resources/topics.json`.
+
+| Field | Type | Notes |
+|-------|------|-------|
+| id | String | Slug, e.g., `cerebral-aneurysm` |
+| name | String | Display name, e.g., "Cerebral Aneurysm Management" |
+| description | String | One-sentence description |
+| category | String | Grouping label, e.g., "Vascular", "Oncology" |
+
+**Business rules**:
+- Topics are read-only from the application's perspective.
+- The project owner edits `topics.json` to add/remove topics; no admin UI is needed.
+
+---
+
+### ChatSession
+
+A conversation thread, optionally associated with a topic.
+
+| Field | Type | Constraints | Notes |
+|-------|------|-------------|-------|
+| id | UUID | PK, generated | |
+| topic_id | VARCHAR(100) | nullable | References a topic slug; null = free-form chat |
+| created_at | TIMESTAMPTZ | NOT NULL, default now() | |
+
+---
+
+### Message
+
+A single turn in a chat session.
+
+| Field | Type | Constraints | Notes |
+|-------|------|-------------|-------|
+| id | UUID | PK, generated | |
+| session_id | UUID | FK → ChatSession.id, ON DELETE CASCADE | |
+| role | ENUM | NOT NULL | `USER`, `ASSISTANT` |
+| content | TEXT | NOT NULL | |
+| sources | JSONB | nullable | Array of `{ "book_title": "…", "page": N }` for ASSISTANT messages |
+| created_at | TIMESTAMPTZ | NOT NULL, default now() | |
+
+**Business rules**:
+- Messages are ordered by `created_at` ASC within a session.
+- `sources` is only populated on `ASSISTANT` messages.
+- Deleting a session cascades to delete all its messages.
+
+---
+
+## Relationships
+
+```
+Book (1) ──────────────── (N) EmbeddingChunk
+                                (via metadata.book_id)
+
+Topic (config file) ──────── (N) ChatSession  [optional association]
+
+ChatSession (1) ──────────── (N) Message
+```
+
+---
+
+## Database Schema (DDL summary)
+
+```sql
+-- Spring AI creates the vector table automatically.
+-- Application-managed tables:
+
+CREATE TABLE book (
+    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    title         VARCHAR(500) NOT NULL,
+    file_name     VARCHAR(500) NOT NULL,
+    file_size_bytes BIGINT NOT NULL,
+    page_count    INT,
+    status        VARCHAR(20) NOT NULL DEFAULT 'PENDING',
+    error_message TEXT,
+    uploaded_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
+    processed_at  TIMESTAMPTZ
+);
+
+CREATE TABLE chat_session (
+    id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    topic_id   VARCHAR(100),
+    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
+);
+
+CREATE TABLE message (
+    id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    session_id UUID NOT NULL REFERENCES chat_session(id) ON DELETE CASCADE,
+    role       VARCHAR(10) NOT NULL,
+    content    TEXT NOT NULL,
+    sources    JSONB,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
+);
+
+CREATE INDEX idx_message_session ON message(session_id, created_at);
+```
+
+---
+
+## Validation Rules
+
+| Entity | Field | Rule |
+|--------|-------|------|
+| Book | status | MUST be one of `PENDING`, `PROCESSING`, `READY`, `FAILED` |
+| Book | file_name | MUST end in `.pdf` (case-insensitive) |
+| Message | role | MUST be `USER` or `ASSISTANT` |
+| EmbeddingChunk | metadata.chunk_type | MUST be `text` or `diagram` |
+| ChatSession | topic_id | If non-null, MUST match a known topic slug from `topics.json` |