ai-teacher/specs/001-neuro-rag-learning/spec.md

# Feature Specification: Neurosurgeon RAG Learning Platform

**Feature Branch**: `001-neuro-rag-learning`
**Created**: 2026-03-31
**Status**: Draft
**Input**: User description: "I want to build a web application to help learn complex topics for neurosurgeons. I will provide to the system a predefined list of topics. The system let user upload books. The books should be embedded to be used latter for LLM (RAG). Embedding is crucial, it MUST be precise, with diagram. The user can select a topics, a LLM will provide a summary by crossing information from the uploaded books. User can have a chat to deepen their knowledge"

## User Scenarios & Testing *(mandatory)*

### User Story 1 - Book Upload & Precise Embedding (Priority: P1)

A neurosurgeon uploads a medical textbook (PDF) to the platform. The system processes the
book, extracting both textual content and embedded diagrams/figures with high fidelity.
Once processing is complete, the book is available as a knowledge source for topic summaries
and chat. The user can see the upload status and a confirmation that the book is ready.

**Why this priority**: Without embedded books there is no knowledge base. All other stories
depend on this story being complete and functional.

**Independent Test**: Upload a single PDF textbook. Verify the upload is accepted, processing
completes, and the book appears in the library as "Ready". Then ask a topic question whose
answer appears only in that book and confirm the correct answer surfaces.

**Acceptance Scenarios**:

1. **Given** a user on the upload page, **When** they select a valid PDF file and confirm
   upload, **Then** the system accepts the file, shows a processing progress indicator, and
   eventually marks the book as "Ready" in the library.
2. **Given** a book that contains anatomical diagrams, **When** the system finishes
   embedding, **Then** diagram content (labels, captions, spatial relationships described
   in the diagram) is searchable and retrievable alongside the surrounding text.
3. **Given** an upload of an unsupported file format (e.g., DOCX), **When** the user
   submits, **Then** the system rejects the file with a clear error message explaining
   accepted formats.
4. **Given** a book is currently being processed, **When** the user navigates to the
   library, **Then** the book appears with a "Processing" status and cannot yet be used
   as a knowledge source.

---

### User Story 2 - Topic-Guided Summary (Priority: P2)

The user browses a predefined list of neurosurgery topics, selects one (e.g., "Cerebral
Aneurysm Management"), and receives an AI-generated summary that cross-references all
uploaded books. The summary synthesizes information from multiple sources, citing which
book each piece of information comes from.

**Why this priority**: This is the core learning feature — the primary reason a neurosurgeon
uses the platform. It delivers immediate value once at least one book is embedded.

**Independent Test**: With at least one book covering the target topic embedded, select that
topic from the list and confirm: (a) a coherent summary is generated, (b) the summary
references content present in the uploaded book(s), and (c) source citations are visible.

**Acceptance Scenarios**:

1. **Given** at least one book is in "Ready" state, **When** the user selects a topic from
   the predefined list, **Then** the system generates a summary within 30 seconds that
   draws on content from the uploaded books.
2. **Given** multiple books are uploaded, **When** the user requests a topic summary,
   **Then** the summary synthesizes information from all relevant books and indicates
   which source each key point came from.
3. **Given** no uploaded book contains content relevant to the selected topic, **When**
   the user requests a summary, **Then** the system clearly communicates that its knowledge
   is limited and no relevant source was found.
4. **Given** a topic summary is displayed, **When** the user inspects a cited passage,
   **Then** they can identify the originating book title and approximate location
   (chapter or page range).

---

### User Story 3 - Knowledge Deepening Chat (Priority: P3)

After reading a topic summary (or independently), the user enters a conversational chat
to ask follow-up questions. The AI answers using the embedded books as its exclusive
knowledge source, enabling the user to drill into specific areas, request clarifications,
or explore edge cases.

**Why this priority**: The chat extends the value of the summary by enabling personalised,
interactive learning. It builds on the RAG infrastructure established by P1 and P2.

**Independent Test**: Start a chat session on a specific topic. Ask a specific clinical
question whose answer is in an uploaded book. Confirm the response references the correct
source and that the conversation maintains context across at least 3 turns.

**Acceptance Scenarios**:

1. **Given** a user in a chat session, **When** they ask a question relevant to an
   uploaded book, **Then** the system responds with a grounded answer and cites the
   source book.
2. **Given** an ongoing chat, **When** the user asks a follow-up question that refers
   to a previous turn ("What about the complication you just mentioned?"), **Then** the
   system maintains conversational context and provides a coherent answer.
3. **Given** a user asks a question outside the scope of any uploaded book, **When**
   the system responds, **Then** it clearly states that no relevant source was found
   rather than generating unsupported claims.
4. **Given** a chat session is ongoing, **When** the user wishes to start fresh,
   **Then** they can clear the conversation history and begin a new session.

---

### Edge Cases

- What happens when a PDF is corrupted or password-protected?
- How does the system handle books that are very large (500+ pages)?
- What if two uploaded books contain contradictory information on the same topic?
- How does diagram embedding behave if a diagram has no caption or label?

## Requirements *(mandatory)*

### Functional Requirements

- **FR-001**: System MUST allow users to upload books in PDF format.
- **FR-002**: System MUST extract and embed textual content from uploaded books with
  high precision for use in semantic search.
- **FR-003**: System MUST extract and embed visual content (diagrams, figures, and their
  captions/labels) from uploaded books so diagram information is retrievable by the RAG system.
- **FR-004**: System MUST display a predefined, curated list of neurosurgery topics
  for user selection.
- **FR-005**: System MUST generate a topic summary by cross-referencing all "Ready"
  uploaded books and synthesizing relevant passages.
- **FR-006**: System MUST cite the source book (and approximate location) for each key
  claim in a generated summary.
- **FR-007**: System MUST provide a conversational chat interface where AI answers are
  grounded exclusively in uploaded book content.
- **FR-008**: System MUST maintain conversational context within a chat session across
  multiple turns.
- **FR-009**: System MUST display the embedding/processing status of each uploaded book
  (Pending, Processing, Ready, Failed).
- **FR-010**: System MUST reject uploaded files that are not in a supported format and
  provide a clear error message.
- **FR-011**: System MUST communicate clearly when a query cannot be answered from the
  available book content, rather than generating unsupported claims.
- **FR-012**: The book library MUST be a single shared global library; all users see and
  benefit from the same uploaded books. Per-user isolation is out of scope for the POC.

### Key Entities

- **Book**: Uploaded document. Attributes: title, file name, upload date, processing
  status (Pending / Processing / Ready / Failed), page count.
- **Topic**: Predefined learning subject. Attributes: name, description, category.
  Managed via configuration (not user-editable in the POC).
- **Embedding Chunk**: A semantically coherent unit of book content (text passage or
  diagram + caption) with its vector representation and source reference.
- **Chat Session**: A conversation thread. Attributes: creation date, associated topic
  (optional), message history.
- **Message**: A single turn in a chat session. Attributes: role (user / assistant),
  content, source citations (for assistant messages).

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001**: A user can upload a book and have it fully processed and searchable within
  10 minutes for a standard-length medical textbook (up to 500 pages).
- **SC-002**: Topic summaries are generated within 30 seconds of user request.
- **SC-003**: At least 90% of generated summary claims can be traced back to a cited
  passage in an uploaded book.
- **SC-004**: A user completing the primary flow (upload → topic summary → chat) requires
  no external instructions — the interface is self-explanatory.
- **SC-005**: Diagram content from uploaded books is retrievable by the RAG system;
  at least one diagram-sourced fact surfaces correctly in a controlled test query.
- **SC-006**: The system correctly declines to answer (and explains why) when a question
  has no grounding in uploaded books, in at least 9 out of 10 out-of-scope test queries.

## Assumptions

- Books are uploaded as PDF files; other formats (EPUB, DOCX) are out of scope for the POC.
- The predefined topic list is small (10–50 topics) and curated manually by the project
  owner via a configuration file; no admin UI is needed for the POC.
- Access is protected by a simple shared password or API token (no individual user accounts);
  anyone who knows the credential can access the application. Full account management is out
  of scope for the POC.
- The LLM used for summary generation and chat is accessed via an external API (not
  self-hosted); the specific provider is a technical implementation decision.
- Diagram embedding means extracting diagram images and their associated captions/labels
  as descriptive text for semantic search; pixel-level image similarity search is out of
  scope for the POC.
- The system is designed for a small number of concurrent users (POC scale: < 10
  simultaneous users); horizontal scaling is not a requirement at this stage.
- Internet connectivity is assumed for both the user and the server (external LLM API calls).