Files
ai-teacher/specs/001-neuro-rag-learning/research.md
Adrien dc0bcab36e plan
2026-03-31 15:42:49 +02:00

9.5 KiB

Research: Neurosurgeon RAG Learning Platform

Branch: 001-neuro-rag-learning Date: 2026-03-31


1. Spring Boot 4 + Spring AI Versions & BOM

Decision: Spring Boot 4.0.5 + Spring AI 1.1.4.

Rationale: Spring Boot 4.0.5 is GA (released February 2026) — this matches the user's original requirement. Spring AI 1.1.4 is the current stable release compatible with Spring Boot 4.x. Spring AI 2.0.0-M4 is available in preview but not used (KISS — no preview dependencies).

Maven BOM (in <dependencyManagement>):

<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-bom</artifactId>
  <version>1.1.4</version>
  <type>pom</type>
  <scope>import</scope>
</dependency>

Key starters (versions managed by BOM):

<!-- pgvector vector store -->
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
</dependency>

<!-- OpenAI (embedding + chat; swap for any other provider) -->
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-starter-openai</artifactId>
</dependency>

<!-- PDF document reader (Spring AI native, Apache PDFBox-based) -->
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>

<!-- PostgreSQL JDBC driver -->
<dependency>
  <groupId>org.postgresql</groupId>
  <artifactId>postgresql</artifactId>
  <scope>runtime</scope>
</dependency>

Alternatives considered: Spring AI 2.0.0-M4 — rejected (milestone, KISS principle).


2. Spring AI RAG Pipeline

Decision: Use Spring AI's PagePdfDocumentReader, EmbeddingModel, PgVectorStore, and ChatClient with QuestionAnswerAdvisor for RAG.

Key classes:

Component Class / Interface Purpose
Document ingestion PagePdfDocumentReader Parse PDF pages to Document objects
Chunking TokenCountBatchingStrategy Split docs to respect token limits
Embedding EmbeddingModel Convert text chunks to vectors
Storage VectorStore / PgVectorStore Persist and search embeddings
RAG query QuestionAnswerAdvisor Augments prompt with retrieved context
Chat ChatClient Fluent API for LLM interactions

RAG pipeline flow:

PDF file
  → PagePdfDocumentReader     (extract text per page as Document)
  → TokenCountBatchingStrategy (chunk to embedding token limit)
  → EmbeddingModel.embed()    (vectorise each chunk)
  → PgVectorStore.add()       (persist chunk + vector + metadata)

User query
  → ChatClient.prompt()
      .advisors(new QuestionAnswerAdvisor(vectorStore))
      .user(question)
      .call()
  → QuestionAnswerAdvisor runs similaritySearch, injects context
  → ChatModel generates response grounded in retrieved chunks

application.properties:

spring.ai.vectorstore.pgvector.dimensions=1536
spring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCE
spring.ai.vectorstore.pgvector.index-type=HNSW
spring.ai.vectorstore.pgvector.initialize-schema=true

spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-4o
spring.ai.openai.embedding.options.model=text-embedding-3-small

Rationale: QuestionAnswerAdvisor is the Spring AI-idiomatic RAG pattern — zero boilerplate. EmbeddingModel and ChatClient are interfaces; swapping the LLM provider is a single property change (Principle II).


3. PDF Ingestion & Chunking

Decision: PagePdfDocumentReader (from spring-ai-pdf-document-reader) for text extraction; default TokenCountBatchingStrategy for chunking.

Approach:

PagePdfDocumentReader reader = new PagePdfDocumentReader(
    new FileSystemResource("textbook.pdf"),
    PdfDocumentReaderConfig.builder()
        .withPagesPerDocument(1)   // one Document per page
        .build()
);
List<Document> pages = reader.get();
vectorStore.add(pages);  // batching + embedding handled internally
  • Each Document carries metadata: source filename, page number.
  • TokenCountBatchingStrategy ensures chunks fit the embedding model's context window (~8 000 tokens for OpenAI models).
  • Custom metadata (book_id, book_title, chunk_type) is added before calling vectorStore.add().

Rationale: PagePdfDocumentReader is the recommended Spring AI PDF reader for text-focused RAG — lighter than the Tika reader and purpose-built for PDFs.

Alternatives considered: TikaDocumentReader — provides multi-format support but is heavier; rejected for POC (only PDFs are in scope).


4. Diagram / Visual Content Handling

Decision: Extract diagram captions and surrounding text as text chunks tagged chunk_type=diagram. No pixel-level image embedding for the POC.

Approach:

  • PagePdfDocumentReader extracts all text including figure captions (e.g., "Figure 3.2: Circle of Willis anatomy...").
  • A post-processing step identifies lines matching caption patterns (^(Figure|Fig\.|Table|Diagram)\s+[\d.]+) and tags those Document objects with metadata.put("chunk_type", "diagram").
  • The caption text plus the surrounding descriptive paragraph are included in the chunk, making the diagram content semantically searchable.

Rationale: This is the simplest approach that satisfies FR-003 within KISS constraints. The spec explicitly excludes pixel-level image search from the POC scope.

Future upgrade path: Use a vision model (GPT-4o vision) to generate text descriptions of extracted images and add them as additional chunk_type=diagram documents — no architectural change needed, just a new processing step.


5. Simple Shared-Password Authentication

Decision: Spring Security HTTP Basic with a single in-memory user.

Spring Security config:

@Configuration
@EnableWebSecurity
public class SecurityConfig {

    @Bean
    public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
        http.authorizeHttpRequests(a -> a.anyRequest().authenticated())
            .httpBasic(Customizer.withDefaults())
            .csrf(AbstractHttpConfigurer::disable);  // REST API — no CSRF needed
        return http.build();
    }

    @Bean
    public UserDetailsService userDetailsService(
            @Value("${app.auth.password}") String password) {
        UserDetails user = User.builder()
            .username("neurosurgeon")
            .password("{noop}" + password)   // {noop} = plain text for POC
            .roles("USER")
            .build();
        return new InMemoryUserDetailsManager(user);
    }
}
# application.properties
app.auth.password=${APP_PASSWORD}

Rationale: Zero database dependency; zero token management. The Vue.js frontend sets Authorization: Basic <base64> via Axios auth config. Fully sufficient for < 10 trusted users on a private network (POC constraint).

Alternatives considered: JWT — rejected (requires token endpoint, more code); custom API key filter — rejected (HTTP Basic is simpler and just as secure for this scale).


6. Vue.js 3 Project Structure

Decision: Vite + Vue 3 + TypeScript + Pinia + Vue Router + Axios.

Standard layout (npm create vue@latest):

frontend/src/
├── components/      # Reusable UI components (BookCard, ChatMessage, etc.)
├── views/           # Route-level pages: UploadView, TopicsView, ChatView
├── stores/          # Pinia: bookStore, topicStore, chatStore
├── services/        # api.ts — Axios instance with base URL + Basic auth header
├── router/          # index.ts — Vue Router routes
└── main.ts

Axios setup (services/api.ts):

import axios from 'axios'

export const api = axios.create({
  baseURL: import.meta.env.VITE_API_URL ?? 'http://localhost:8080/api/v1',
  auth: {
    username: 'neurosurgeon',
    password: import.meta.env.VITE_APP_PASSWORD
  }
})

Rationale: Axios handles HTTP Basic auth via its auth config — no manual btoa() needed. Pinia is Vue's official state manager (replaced Vuex).


7. pgvector Configuration & Schema

Decision: Spring AI auto-creates the vector_store table via initialize-schema=true. Application tables use Flyway migrations.

Required PostgreSQL extensions (run once on the provided database):

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS hstore;
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

Spring AI auto-created table:

CREATE TABLE IF NOT EXISTS vector_store (
    id       uuid DEFAULT uuid_generate_v4() PRIMARY KEY,
    content  TEXT NOT NULL,
    metadata JSONB NOT NULL DEFAULT '{}',
    embedding VECTOR(1536) NOT NULL
);
CREATE INDEX ON vector_store USING HNSW (embedding vector_cosine_ops);

Key properties:

Property Value Notes
dimensions 1536 Matches text-embedding-3-small; update if provider changes
distance-type COSINE_DISTANCE Standard for normalised text embeddings
index-type HNSW O(log N) search; best default for POC
initialize-schema true Auto-create table on startup (safe for POC)

Embedding dimensions note: if the LLM provider is switched (e.g., to Ollama with a 768-dim model), update dimensions in properties and re-embed all books — the vector_store table must be recreated with the new dimension.

Alternatives considered: IVFFlat index — more memory-efficient but slower; NONE for very small datasets. HNSW is the best default for a POC where correctness matters more than storage.