plan

2026-03-31 15:42:49 +02:00
parent 3507ce27e5
commit dc0bcab36e
10 changed files with 1246 additions and 0 deletions
@@ -0,0 +1,281 @@
+# Research: Neurosurgeon RAG Learning Platform
+
+**Branch**: `001-neuro-rag-learning`
+**Date**: 2026-03-31
+
+---
+
+## 1. Spring Boot 4 + Spring AI Versions & BOM
+
+**Decision**: Spring Boot **4.0.5** + Spring AI **1.1.4**.
+
+**Rationale**: Spring Boot 4.0.5 is GA (released February 2026) — this matches the user's
+original requirement. Spring AI 1.1.4 is the current stable release compatible with
+Spring Boot 4.x. Spring AI 2.0.0-M4 is available in preview but not used (KISS — no
+preview dependencies).
+
+**Maven BOM** (in `<dependencyManagement>`):
+```xml
+<dependency>
+  <groupId>org.springframework.ai</groupId>
+  <artifactId>spring-ai-bom</artifactId>
+  <version>1.1.4</version>
+  <type>pom</type>
+  <scope>import</scope>
+</dependency>
+```
+
+**Key starters** (versions managed by BOM):
+```xml
+<!-- pgvector vector store -->
+<dependency>
+  <groupId>org.springframework.ai</groupId>
+  <artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
+</dependency>
+
+<!-- OpenAI (embedding + chat; swap for any other provider) -->
+<dependency>
+  <groupId>org.springframework.ai</groupId>
+  <artifactId>spring-ai-starter-openai</artifactId>
+</dependency>
+
+<!-- PDF document reader (Spring AI native, Apache PDFBox-based) -->
+<dependency>
+  <groupId>org.springframework.ai</groupId>
+  <artifactId>spring-ai-pdf-document-reader</artifactId>
+</dependency>
+
+<!-- PostgreSQL JDBC driver -->
+<dependency>
+  <groupId>org.postgresql</groupId>
+  <artifactId>postgresql</artifactId>
+  <scope>runtime</scope>
+</dependency>
+```
+
+**Alternatives considered**: Spring AI 2.0.0-M4 — rejected (milestone, KISS principle).
+
+---
+
+## 2. Spring AI RAG Pipeline
+
+**Decision**: Use Spring AI's `PagePdfDocumentReader`, `EmbeddingModel`, `PgVectorStore`,
+and `ChatClient` with `QuestionAnswerAdvisor` for RAG.
+
+**Key classes**:
+
+| Component | Class / Interface | Purpose |
+|-----------|-------------------|---------|
+| Document ingestion | `PagePdfDocumentReader` | Parse PDF pages to `Document` objects |
+| Chunking | `TokenCountBatchingStrategy` | Split docs to respect token limits |
+| Embedding | `EmbeddingModel` | Convert text chunks to vectors |
+| Storage | `VectorStore` / `PgVectorStore` | Persist and search embeddings |
+| RAG query | `QuestionAnswerAdvisor` | Augments prompt with retrieved context |
+| Chat | `ChatClient` | Fluent API for LLM interactions |
+
+**RAG pipeline flow**:
+```
+PDF file
+  → PagePdfDocumentReader     (extract text per page as Document)
+  → TokenCountBatchingStrategy (chunk to embedding token limit)
+  → EmbeddingModel.embed()    (vectorise each chunk)
+  → PgVectorStore.add()       (persist chunk + vector + metadata)
+
+User query
+  → ChatClient.prompt()
+      .advisors(new QuestionAnswerAdvisor(vectorStore))
+      .user(question)
+      .call()
+  → QuestionAnswerAdvisor runs similaritySearch, injects context
+  → ChatModel generates response grounded in retrieved chunks
+```
+
+**application.properties**:
+```properties
+spring.ai.vectorstore.pgvector.dimensions=1536
+spring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCE
+spring.ai.vectorstore.pgvector.index-type=HNSW
+spring.ai.vectorstore.pgvector.initialize-schema=true
+
+spring.ai.openai.api-key=${OPENAI_API_KEY}
+spring.ai.openai.chat.options.model=gpt-4o
+spring.ai.openai.embedding.options.model=text-embedding-3-small
+```
+
+**Rationale**: `QuestionAnswerAdvisor` is the Spring AI-idiomatic RAG pattern — zero
+boilerplate. `EmbeddingModel` and `ChatClient` are interfaces; swapping the LLM provider
+is a single property change (Principle II).
+
+---
+
+## 3. PDF Ingestion & Chunking
+
+**Decision**: `PagePdfDocumentReader` (from `spring-ai-pdf-document-reader`) for text
+extraction; default `TokenCountBatchingStrategy` for chunking.
+
+**Approach**:
+```java
+PagePdfDocumentReader reader = new PagePdfDocumentReader(
+    new FileSystemResource("textbook.pdf"),
+    PdfDocumentReaderConfig.builder()
+        .withPagesPerDocument(1)   // one Document per page
+        .build()
+);
+List<Document> pages = reader.get();
+vectorStore.add(pages);  // batching + embedding handled internally
+```
+
+- Each `Document` carries metadata: source filename, page number.
+- `TokenCountBatchingStrategy` ensures chunks fit the embedding model's context window
+  (~8 000 tokens for OpenAI models).
+- Custom metadata (`book_id`, `book_title`, `chunk_type`) is added before calling
+  `vectorStore.add()`.
+
+**Rationale**: `PagePdfDocumentReader` is the recommended Spring AI PDF reader for
+text-focused RAG — lighter than the Tika reader and purpose-built for PDFs.
+
+**Alternatives considered**: `TikaDocumentReader` — provides multi-format support but is
+heavier; rejected for POC (only PDFs are in scope).
+
+---
+
+## 4. Diagram / Visual Content Handling
+
+**Decision**: Extract diagram captions and surrounding text as text chunks tagged
+`chunk_type=diagram`. No pixel-level image embedding for the POC.
+
+**Approach**:
+- `PagePdfDocumentReader` extracts all text including figure captions
+  (e.g., `"Figure 3.2: Circle of Willis anatomy..."`).
+- A post-processing step identifies lines matching caption patterns
+  (`^(Figure|Fig\.|Table|Diagram)\s+[\d.]+`) and tags those `Document` objects with
+  `metadata.put("chunk_type", "diagram")`.
+- The caption text plus the surrounding descriptive paragraph are included in the chunk,
+  making the diagram content semantically searchable.
+
+**Rationale**: This is the simplest approach that satisfies FR-003 within KISS constraints.
+The spec explicitly excludes pixel-level image search from the POC scope.
+
+**Future upgrade path**: Use a vision model (GPT-4o vision) to generate text descriptions
+of extracted images and add them as additional `chunk_type=diagram` documents — no
+architectural change needed, just a new processing step.
+
+---
+
+## 5. Simple Shared-Password Authentication
+
+**Decision**: Spring Security HTTP Basic with a single in-memory user.
+
+**Spring Security config**:
+```java
+@Configuration
+@EnableWebSecurity
+public class SecurityConfig {
+
+    @Bean
+    public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
+        http.authorizeHttpRequests(a -> a.anyRequest().authenticated())
+            .httpBasic(Customizer.withDefaults())
+            .csrf(AbstractHttpConfigurer::disable);  // REST API — no CSRF needed
+        return http.build();
+    }
+
+    @Bean
+    public UserDetailsService userDetailsService(
+            @Value("${app.auth.password}") String password) {
+        UserDetails user = User.builder()
+            .username("neurosurgeon")
+            .password("{noop}" + password)   // {noop} = plain text for POC
+            .roles("USER")
+            .build();
+        return new InMemoryUserDetailsManager(user);
+    }
+}
+```
+
+```properties
+# application.properties
+app.auth.password=${APP_PASSWORD}
+```
+
+**Rationale**: Zero database dependency; zero token management. The Vue.js frontend
+sets `Authorization: Basic <base64>` via Axios `auth` config. Fully sufficient for
+< 10 trusted users on a private network (POC constraint).
+
+**Alternatives considered**: JWT — rejected (requires token endpoint, more code);
+custom API key filter — rejected (HTTP Basic is simpler and just as secure for this scale).
+
+---
+
+## 6. Vue.js 3 Project Structure
+
+**Decision**: Vite + Vue 3 + TypeScript + Pinia + Vue Router + Axios.
+
+**Standard layout** (`npm create vue@latest`):
+```
+frontend/src/
+├── components/      # Reusable UI components (BookCard, ChatMessage, etc.)
+├── views/           # Route-level pages: UploadView, TopicsView, ChatView
+├── stores/          # Pinia: bookStore, topicStore, chatStore
+├── services/        # api.ts — Axios instance with base URL + Basic auth header
+├── router/          # index.ts — Vue Router routes
+└── main.ts
+```
+
+**Axios setup** (`services/api.ts`):
+```typescript
+import axios from 'axios'
+
+export const api = axios.create({
+  baseURL: import.meta.env.VITE_API_URL ?? 'http://localhost:8080/api/v1',
+  auth: {
+    username: 'neurosurgeon',
+    password: import.meta.env.VITE_APP_PASSWORD
+  }
+})
+```
+
+**Rationale**: Axios handles HTTP Basic auth via its `auth` config — no manual
+`btoa()` needed. Pinia is Vue's official state manager (replaced Vuex).
+
+---
+
+## 7. pgvector Configuration & Schema
+
+**Decision**: Spring AI auto-creates the `vector_store` table via `initialize-schema=true`.
+Application tables use Flyway migrations.
+
+**Required PostgreSQL extensions** (run once on the provided database):
+```sql
+CREATE EXTENSION IF NOT EXISTS vector;
+CREATE EXTENSION IF NOT EXISTS hstore;
+CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
+```
+
+**Spring AI auto-created table**:
+```sql
+CREATE TABLE IF NOT EXISTS vector_store (
+    id       uuid DEFAULT uuid_generate_v4() PRIMARY KEY,
+    content  TEXT NOT NULL,
+    metadata JSONB NOT NULL DEFAULT '{}',
+    embedding VECTOR(1536) NOT NULL
+);
+CREATE INDEX ON vector_store USING HNSW (embedding vector_cosine_ops);
+```
+
+**Key properties**:
+
+| Property | Value | Notes |
+|----------|-------|-------|
+| `dimensions` | `1536` | Matches `text-embedding-3-small`; update if provider changes |
+| `distance-type` | `COSINE_DISTANCE` | Standard for normalised text embeddings |
+| `index-type` | `HNSW` | O(log N) search; best default for POC |
+| `initialize-schema` | `true` | Auto-create table on startup (safe for POC) |
+
+**Embedding dimensions note**: if the LLM provider is switched (e.g., to Ollama with a
+768-dim model), update `dimensions` in properties **and** re-embed all books — the
+`vector_store` table must be recreated with the new dimension.
+
+**Alternatives considered**: IVFFlat index — more memory-efficient but slower; NONE for very
+small datasets. HNSW is the best default for a POC where correctness matters more than
+storage.