plan
This commit is contained in:
281
specs/001-neuro-rag-learning/research.md
Normal file
281
specs/001-neuro-rag-learning/research.md
Normal file
@@ -0,0 +1,281 @@
|
||||
# Research: Neurosurgeon RAG Learning Platform
|
||||
|
||||
**Branch**: `001-neuro-rag-learning`
|
||||
**Date**: 2026-03-31
|
||||
|
||||
---
|
||||
|
||||
## 1. Spring Boot 4 + Spring AI Versions & BOM
|
||||
|
||||
**Decision**: Spring Boot **4.0.5** + Spring AI **1.1.4**.
|
||||
|
||||
**Rationale**: Spring Boot 4.0.5 is GA (released February 2026) — this matches the user's
|
||||
original requirement. Spring AI 1.1.4 is the current stable release compatible with
|
||||
Spring Boot 4.x. Spring AI 2.0.0-M4 is available in preview but not used (KISS — no
|
||||
preview dependencies).
|
||||
|
||||
**Maven BOM** (in `<dependencyManagement>`):
|
||||
```xml
|
||||
<dependency>
|
||||
<groupId>org.springframework.ai</groupId>
|
||||
<artifactId>spring-ai-bom</artifactId>
|
||||
<version>1.1.4</version>
|
||||
<type>pom</type>
|
||||
<scope>import</scope>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
**Key starters** (versions managed by BOM):
|
||||
```xml
|
||||
<!-- pgvector vector store -->
|
||||
<dependency>
|
||||
<groupId>org.springframework.ai</groupId>
|
||||
<artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
|
||||
</dependency>
|
||||
|
||||
<!-- OpenAI (embedding + chat; swap for any other provider) -->
|
||||
<dependency>
|
||||
<groupId>org.springframework.ai</groupId>
|
||||
<artifactId>spring-ai-starter-openai</artifactId>
|
||||
</dependency>
|
||||
|
||||
<!-- PDF document reader (Spring AI native, Apache PDFBox-based) -->
|
||||
<dependency>
|
||||
<groupId>org.springframework.ai</groupId>
|
||||
<artifactId>spring-ai-pdf-document-reader</artifactId>
|
||||
</dependency>
|
||||
|
||||
<!-- PostgreSQL JDBC driver -->
|
||||
<dependency>
|
||||
<groupId>org.postgresql</groupId>
|
||||
<artifactId>postgresql</artifactId>
|
||||
<scope>runtime</scope>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
**Alternatives considered**: Spring AI 2.0.0-M4 — rejected (milestone, KISS principle).
|
||||
|
||||
---
|
||||
|
||||
## 2. Spring AI RAG Pipeline
|
||||
|
||||
**Decision**: Use Spring AI's `PagePdfDocumentReader`, `EmbeddingModel`, `PgVectorStore`,
|
||||
and `ChatClient` with `QuestionAnswerAdvisor` for RAG.
|
||||
|
||||
**Key classes**:
|
||||
|
||||
| Component | Class / Interface | Purpose |
|
||||
|-----------|-------------------|---------|
|
||||
| Document ingestion | `PagePdfDocumentReader` | Parse PDF pages to `Document` objects |
|
||||
| Chunking | `TokenCountBatchingStrategy` | Split docs to respect token limits |
|
||||
| Embedding | `EmbeddingModel` | Convert text chunks to vectors |
|
||||
| Storage | `VectorStore` / `PgVectorStore` | Persist and search embeddings |
|
||||
| RAG query | `QuestionAnswerAdvisor` | Augments prompt with retrieved context |
|
||||
| Chat | `ChatClient` | Fluent API for LLM interactions |
|
||||
|
||||
**RAG pipeline flow**:
|
||||
```
|
||||
PDF file
|
||||
→ PagePdfDocumentReader (extract text per page as Document)
|
||||
→ TokenCountBatchingStrategy (chunk to embedding token limit)
|
||||
→ EmbeddingModel.embed() (vectorise each chunk)
|
||||
→ PgVectorStore.add() (persist chunk + vector + metadata)
|
||||
|
||||
User query
|
||||
→ ChatClient.prompt()
|
||||
.advisors(new QuestionAnswerAdvisor(vectorStore))
|
||||
.user(question)
|
||||
.call()
|
||||
→ QuestionAnswerAdvisor runs similaritySearch, injects context
|
||||
→ ChatModel generates response grounded in retrieved chunks
|
||||
```
|
||||
|
||||
**application.properties**:
|
||||
```properties
|
||||
spring.ai.vectorstore.pgvector.dimensions=1536
|
||||
spring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCE
|
||||
spring.ai.vectorstore.pgvector.index-type=HNSW
|
||||
spring.ai.vectorstore.pgvector.initialize-schema=true
|
||||
|
||||
spring.ai.openai.api-key=${OPENAI_API_KEY}
|
||||
spring.ai.openai.chat.options.model=gpt-4o
|
||||
spring.ai.openai.embedding.options.model=text-embedding-3-small
|
||||
```
|
||||
|
||||
**Rationale**: `QuestionAnswerAdvisor` is the Spring AI-idiomatic RAG pattern — zero
|
||||
boilerplate. `EmbeddingModel` and `ChatClient` are interfaces; swapping the LLM provider
|
||||
is a single property change (Principle II).
|
||||
|
||||
---
|
||||
|
||||
## 3. PDF Ingestion & Chunking
|
||||
|
||||
**Decision**: `PagePdfDocumentReader` (from `spring-ai-pdf-document-reader`) for text
|
||||
extraction; default `TokenCountBatchingStrategy` for chunking.
|
||||
|
||||
**Approach**:
|
||||
```java
|
||||
PagePdfDocumentReader reader = new PagePdfDocumentReader(
|
||||
new FileSystemResource("textbook.pdf"),
|
||||
PdfDocumentReaderConfig.builder()
|
||||
.withPagesPerDocument(1) // one Document per page
|
||||
.build()
|
||||
);
|
||||
List<Document> pages = reader.get();
|
||||
vectorStore.add(pages); // batching + embedding handled internally
|
||||
```
|
||||
|
||||
- Each `Document` carries metadata: source filename, page number.
|
||||
- `TokenCountBatchingStrategy` ensures chunks fit the embedding model's context window
|
||||
(~8 000 tokens for OpenAI models).
|
||||
- Custom metadata (`book_id`, `book_title`, `chunk_type`) is added before calling
|
||||
`vectorStore.add()`.
|
||||
|
||||
**Rationale**: `PagePdfDocumentReader` is the recommended Spring AI PDF reader for
|
||||
text-focused RAG — lighter than the Tika reader and purpose-built for PDFs.
|
||||
|
||||
**Alternatives considered**: `TikaDocumentReader` — provides multi-format support but is
|
||||
heavier; rejected for POC (only PDFs are in scope).
|
||||
|
||||
---
|
||||
|
||||
## 4. Diagram / Visual Content Handling
|
||||
|
||||
**Decision**: Extract diagram captions and surrounding text as text chunks tagged
|
||||
`chunk_type=diagram`. No pixel-level image embedding for the POC.
|
||||
|
||||
**Approach**:
|
||||
- `PagePdfDocumentReader` extracts all text including figure captions
|
||||
(e.g., `"Figure 3.2: Circle of Willis anatomy..."`).
|
||||
- A post-processing step identifies lines matching caption patterns
|
||||
(`^(Figure|Fig\.|Table|Diagram)\s+[\d.]+`) and tags those `Document` objects with
|
||||
`metadata.put("chunk_type", "diagram")`.
|
||||
- The caption text plus the surrounding descriptive paragraph are included in the chunk,
|
||||
making the diagram content semantically searchable.
|
||||
|
||||
**Rationale**: This is the simplest approach that satisfies FR-003 within KISS constraints.
|
||||
The spec explicitly excludes pixel-level image search from the POC scope.
|
||||
|
||||
**Future upgrade path**: Use a vision model (GPT-4o vision) to generate text descriptions
|
||||
of extracted images and add them as additional `chunk_type=diagram` documents — no
|
||||
architectural change needed, just a new processing step.
|
||||
|
||||
---
|
||||
|
||||
## 5. Simple Shared-Password Authentication
|
||||
|
||||
**Decision**: Spring Security HTTP Basic with a single in-memory user.
|
||||
|
||||
**Spring Security config**:
|
||||
```java
|
||||
@Configuration
|
||||
@EnableWebSecurity
|
||||
public class SecurityConfig {
|
||||
|
||||
@Bean
|
||||
public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
|
||||
http.authorizeHttpRequests(a -> a.anyRequest().authenticated())
|
||||
.httpBasic(Customizer.withDefaults())
|
||||
.csrf(AbstractHttpConfigurer::disable); // REST API — no CSRF needed
|
||||
return http.build();
|
||||
}
|
||||
|
||||
@Bean
|
||||
public UserDetailsService userDetailsService(
|
||||
@Value("${app.auth.password}") String password) {
|
||||
UserDetails user = User.builder()
|
||||
.username("neurosurgeon")
|
||||
.password("{noop}" + password) // {noop} = plain text for POC
|
||||
.roles("USER")
|
||||
.build();
|
||||
return new InMemoryUserDetailsManager(user);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```properties
|
||||
# application.properties
|
||||
app.auth.password=${APP_PASSWORD}
|
||||
```
|
||||
|
||||
**Rationale**: Zero database dependency; zero token management. The Vue.js frontend
|
||||
sets `Authorization: Basic <base64>` via Axios `auth` config. Fully sufficient for
|
||||
< 10 trusted users on a private network (POC constraint).
|
||||
|
||||
**Alternatives considered**: JWT — rejected (requires token endpoint, more code);
|
||||
custom API key filter — rejected (HTTP Basic is simpler and just as secure for this scale).
|
||||
|
||||
---
|
||||
|
||||
## 6. Vue.js 3 Project Structure
|
||||
|
||||
**Decision**: Vite + Vue 3 + TypeScript + Pinia + Vue Router + Axios.
|
||||
|
||||
**Standard layout** (`npm create vue@latest`):
|
||||
```
|
||||
frontend/src/
|
||||
├── components/ # Reusable UI components (BookCard, ChatMessage, etc.)
|
||||
├── views/ # Route-level pages: UploadView, TopicsView, ChatView
|
||||
├── stores/ # Pinia: bookStore, topicStore, chatStore
|
||||
├── services/ # api.ts — Axios instance with base URL + Basic auth header
|
||||
├── router/ # index.ts — Vue Router routes
|
||||
└── main.ts
|
||||
```
|
||||
|
||||
**Axios setup** (`services/api.ts`):
|
||||
```typescript
|
||||
import axios from 'axios'
|
||||
|
||||
export const api = axios.create({
|
||||
baseURL: import.meta.env.VITE_API_URL ?? 'http://localhost:8080/api/v1',
|
||||
auth: {
|
||||
username: 'neurosurgeon',
|
||||
password: import.meta.env.VITE_APP_PASSWORD
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
**Rationale**: Axios handles HTTP Basic auth via its `auth` config — no manual
|
||||
`btoa()` needed. Pinia is Vue's official state manager (replaced Vuex).
|
||||
|
||||
---
|
||||
|
||||
## 7. pgvector Configuration & Schema
|
||||
|
||||
**Decision**: Spring AI auto-creates the `vector_store` table via `initialize-schema=true`.
|
||||
Application tables use Flyway migrations.
|
||||
|
||||
**Required PostgreSQL extensions** (run once on the provided database):
|
||||
```sql
|
||||
CREATE EXTENSION IF NOT EXISTS vector;
|
||||
CREATE EXTENSION IF NOT EXISTS hstore;
|
||||
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
|
||||
```
|
||||
|
||||
**Spring AI auto-created table**:
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS vector_store (
|
||||
id uuid DEFAULT uuid_generate_v4() PRIMARY KEY,
|
||||
content TEXT NOT NULL,
|
||||
metadata JSONB NOT NULL DEFAULT '{}',
|
||||
embedding VECTOR(1536) NOT NULL
|
||||
);
|
||||
CREATE INDEX ON vector_store USING HNSW (embedding vector_cosine_ops);
|
||||
```
|
||||
|
||||
**Key properties**:
|
||||
|
||||
| Property | Value | Notes |
|
||||
|----------|-------|-------|
|
||||
| `dimensions` | `1536` | Matches `text-embedding-3-small`; update if provider changes |
|
||||
| `distance-type` | `COSINE_DISTANCE` | Standard for normalised text embeddings |
|
||||
| `index-type` | `HNSW` | O(log N) search; best default for POC |
|
||||
| `initialize-schema` | `true` | Auto-create table on startup (safe for POC) |
|
||||
|
||||
**Embedding dimensions note**: if the LLM provider is switched (e.g., to Ollama with a
|
||||
768-dim model), update `dimensions` in properties **and** re-embed all books — the
|
||||
`vector_store` table must be recreated with the new dimension.
|
||||
|
||||
**Alternatives considered**: IVFFlat index — more memory-efficient but slower; NONE for very
|
||||
small datasets. HNSW is the best default for a POC where correctness matters more than
|
||||
storage.
|
||||
Reference in New Issue
Block a user