Files
ai-teacher/specs/001-neuro-rag-learning/data-model.md
Adrien dc0bcab36e plan
2026-03-31 15:42:49 +02:00

5.3 KiB

Data Model: Neurosurgeon RAG Learning Platform

Branch: 001-neuro-rag-learning Date: 2026-03-31

Entities

Book

Represents an uploaded medical textbook.

Field Type Constraints Notes
id UUID PK, generated
title VARCHAR(500) NOT NULL Extracted from PDF metadata or filename
file_name VARCHAR(500) NOT NULL Original upload filename
file_size_bytes BIGINT NOT NULL
page_count INT nullable Populated after processing
status ENUM NOT NULL PENDING, PROCESSING, READY, FAILED
error_message TEXT nullable Populated if status = FAILED
uploaded_at TIMESTAMPTZ NOT NULL, default now()
processed_at TIMESTAMPTZ nullable When embedding completed

State machine:

PENDING → PROCESSING → READY
                     ↘ FAILED

Business rules:

  • Only books in READY status are used as RAG sources.
  • A FAILED book can be deleted and re-uploaded.
  • title defaults to the filename (without extension) if PDF metadata is absent.

EmbeddingChunk

A semantically coherent segment of a book's content stored as a vector embedding. Managed by Spring AI's pgvector VectorStore — the table is auto-created by Spring AI.

Field Type Notes
id UUID PK
content TEXT Raw text of the chunk (passage or diagram caption + surrounding text)
embedding VECTOR(1536) pgvector column; dimension matches the embedding model
metadata JSONB { "book_id": "…", "book_title": "…", "page": N, "chunk_type": "text|diagram" }

Notes:

  • chunk_type = diagram means the chunk was derived from a diagram caption and adjacent descriptive text.
  • All chunks for a given book are deleted when the book is deleted.
  • Spring AI manages this table; direct access is through VectorStore.similaritySearch(…).

Topic

A predefined neurosurgery learning subject. Not stored in the database for the POC — loaded at startup from backend/src/main/resources/topics.json.

Field Type Notes
id String Slug, e.g., cerebral-aneurysm
name String Display name, e.g., "Cerebral Aneurysm Management"
description String One-sentence description
category String Grouping label, e.g., "Vascular", "Oncology"

Business rules:

  • Topics are read-only from the application's perspective.
  • The project owner edits topics.json to add/remove topics; no admin UI is needed.

ChatSession

A conversation thread, optionally associated with a topic.

Field Type Constraints Notes
id UUID PK, generated
topic_id VARCHAR(100) nullable References a topic slug; null = free-form chat
created_at TIMESTAMPTZ NOT NULL, default now()

Message

A single turn in a chat session.

Field Type Constraints Notes
id UUID PK, generated
session_id UUID FK → ChatSession.id, ON DELETE CASCADE
role ENUM NOT NULL USER, ASSISTANT
content TEXT NOT NULL
sources JSONB nullable Array of { "book_title": "…", "page": N } for ASSISTANT messages
created_at TIMESTAMPTZ NOT NULL, default now()

Business rules:

  • Messages are ordered by created_at ASC within a session.
  • sources is only populated on ASSISTANT messages.
  • Deleting a session cascades to delete all its messages.

Relationships

Book (1) ──────────────── (N) EmbeddingChunk
                                (via metadata.book_id)

Topic (config file) ──────── (N) ChatSession  [optional association]

ChatSession (1) ──────────── (N) Message

Database Schema (DDL summary)

-- Spring AI creates the vector table automatically.
-- Application-managed tables:

CREATE TABLE book (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title         VARCHAR(500) NOT NULL,
    file_name     VARCHAR(500) NOT NULL,
    file_size_bytes BIGINT NOT NULL,
    page_count    INT,
    status        VARCHAR(20) NOT NULL DEFAULT 'PENDING',
    error_message TEXT,
    uploaded_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
    processed_at  TIMESTAMPTZ
);

CREATE TABLE chat_session (
    id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    topic_id   VARCHAR(100),
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE message (
    id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    session_id UUID NOT NULL REFERENCES chat_session(id) ON DELETE CASCADE,
    role       VARCHAR(10) NOT NULL,
    content    TEXT NOT NULL,
    sources    JSONB,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_message_session ON message(session_id, created_at);

Validation Rules

Entity Field Rule
Book status MUST be one of PENDING, PROCESSING, READY, FAILED
Book file_name MUST end in .pdf (case-insensitive)
Message role MUST be USER or ASSISTANT
EmbeddingChunk metadata.chunk_type MUST be text or diagram
ChatSession topic_id If non-null, MUST match a known topic slug from topics.json