Vibe-coded with love

IS469 Final Project

From raw Microsoft filings to grounded financial QA.

FinSight is a domain-specific retrieval-augmented generation system built to answer questions over Microsoft 10-K and 10-Q filings with citations, measurable grounding, and a controlled comparison of seven pipeline designs.

See the findings Compare the versions

9SEC filings indexed
7pipeline variants
20benchmark questions

Best grounding V2 Faithfulness 0.843

Best coverage V4 Relevancy 0.784

User query Retrieval pipeline Evidence selection Answer with citations

Presentation Flow

The project story from start to finish.

The problem

SEC filings are information-rich but hard to navigate. Financial questions often require exact figures, fiscal-period awareness, and cross-document comparison.

The system

FinSight ingests Microsoft filings, chunks them into searchable evidence, indexes them with dense and sparse retrieval, and generates citation-backed answers.

The experiment

Seven variants were evaluated to isolate the impact of retrieval, reranking, hybrid search, query rewriting, metadata filtering, and context compression.

The finding

No single pipeline wins every metric. Simpler versions ground answers better, while broader pipelines answer complex questions more completely at higher latency.

System Flow

How the application turns filings into answers.

Ingest

Parse 10-K and 10-Q filings into structured text with fiscal metadata.

Chunk

Split filings into evidence units sized for retrieval quality and citation traceability.

Index

Build a dense ChromaDB index and a BM25 sparse index for complementary search signals.

Retrieve

Use dense, sparse, hybrid, filtered, or rewritten-query retrieval depending on the pipeline.

Refine

Optionally rerank or compress context so the generator sees cleaner, more targeted evidence.

Generate

Produce grounded answers with citations, then evaluate them for faithfulness, relevancy, and numerical accuracy.

Pipeline Variants

Each version adds one more retrieval capability.

LLM-only baseline

No retrieval. The model answers from parametric memory, which makes it fluent but weakly grounded.

Pipeline

Generate only

Primary use

Establish the hallucination floor.

Strength

Fast and simple.

Trade-off

Unsupported claims and no evidence traceability.

Evaluation

What the results actually showed.

Best faithfulness V2 — 0.843

Reranking was the most efficient way to improve grounding (+0.105 over V1 at only +1s latency).

Best relevancy V4 — 0.784

Query rewriting improved coverage and ambiguity handling for comparative queries.

Best numerical accuracy V3 — 0.600

Hybrid retrieval helped recover exact figures and finance terms via lexical matching.

Fastest grounded version V1 — 3.46s

Dense retrieval alone was already strong on simple factual questions at minimal latency.

Faithfulness

Grounded answers peak with reranking.

Higher is better. Faithfulness measures whether answer claims are supported by retrieved evidence.

Simple questions

Dense retrieval (V1) was usually enough for factual lookups, achieving 0.988 relevancy. More complex retrieval stages added little value for single-chunk answers.

Temporal and comparative questions

Broader retrieval mattered more because the system had to align multiple periods or comparison targets. V1's 0.199 comparative relevancy vs V3's 0.726 shows the gap.

Main takeaway

The bottleneck shifts with task complexity: first precision, then coverage, then structure. Pipeline selection should be query-type-driven, not aggregate-score-optimised.

Metric Guide

How the versions were compared.

Faithfulness

Checks whether answer claims are supported by the retrieved evidence. This is the grounding metric. Peaks at V2 (0.843) — the benefit of cross-encoder reranking.

Answer relevancy

Measures whether the response actually addresses the user's question. Peaks at V4 (0.784) — query rewriting expands coverage of multi-part questions.

Context recall

Tests whether retrieval surfaced the evidence needed to answer the question. Peaks at V1 (0.592) — reranking and filtering can reduce recall by over-pruning.

Context precision

Measures how much of the retrieved context was useful rather than noisy. Peaks at V2 (0.741) — reranking lifts both precision and faithfulness together.

Numerical accuracy

Custom finance metric checking whether the answer includes the correct figures verbatim. Peaks at V3 (0.600) — BM25 lexical matching is essential for exact numbers.

Latency

Average end-to-end query time. Ranges from 3.46s (V1) to 10.52s (V6). Each additional retrieval stage adds measurable cost without guaranteed quality gains.

Limitations & Future Work

What the current system still struggles with.

The report identifies a clear next step: improve evidence structure before trying to push generation harder.

Current limitations

Fixed-size chunking can split tables and financial rows across boundaries.
No table-aware parsing makes dense numeric extraction unreliable in statement-heavy pages.
The benchmark is small at 20 questions, which limits statistical confidence.
V5 depends heavily on consistent metadata tagging at ingest time — brittle for cross-period queries.
The study is scoped to Microsoft only, so cross-company generalisation is still untested.
LLM outputs can still vary slightly across runs even with temperature 0 and a fixed seed.

Most useful improvements

Table-aware chunking to keep financial statement rows intact within chunks.
Semantic section-aware chunking for MD&A, risk factors, and financial statements.
Adaptive runtime routing so query type selects the best variant automatically.
A larger benchmark with 50+ questions and real analyst-style prompts to reduce synthetic bias.
Fine-tuned embeddings for fiscal periods, financial terminology, and segment names.
Retrieval-level metrics such as MRR and top-k hit rate for cleaner error diagnosis.

Guardrails

How the system reduces risk when answering financial questions.

Critical risk Hallucination

Plausible but unsupported figures are the highest-severity failure in financial QA. V0 faithfulness of 0.098 quantifies this floor.

High risk Stale or missed filings

Answers become outdated if retrieval misses the newest period or the index is not refreshed after EDGAR releases.

Medium risk Numerical drift

The model may round, paraphrase, or simplify figures even when the correct chunk is present in context.

Scope risk Out-of-domain prompts

The system must avoid answering unsupported questions outside the indexed Microsoft filings corpus.

Implemented guardrails

Insufficient-evidence prompting tells the generator to refuse unsupported claims instead of guessing.
Inline citation requirements make factual answers auditable against retrieved chunks via [Doc-N] references.
Scope filtering rejects non-Microsoft or out-of-topic prompts before retrieval begins.
Temperature 0 reduces variation in numeric wording and output consistency across runs.

Residual risks and next mitigations

Automate EDGAR ingestion so stale filings are indexed immediately upon release.
Add post-processing checks to verify quoted figures against source chunks verbatim.
Classify prompt intent and sanitise adversarial or ambiguous inputs before retrieval.
Re-embed indexes when the embedding model changes to prevent index drift over time.

Conclusion

No single pipeline wins. Query type should drive the design.

The report's central result is that RAG components provide selective, query-type-dependent benefits. Financial QA improves most when the system matches the pipeline to the task instead of forcing every query through one universal stack. The most consistent failure mode across all variants was not retrieval miss or generation error — it was evidence misalignment.

Faithfulness priority

Use V2 when grounding is the priority. It delivered the best faithfulness (0.843) per unit of added latency — only 1s more than the V1 baseline.

Cross-period baseline

Use hybrid retrieval (V3) as the minimum for temporal and comparative queries. Dense-only retrieval is too narrow — V1 achieves just 0.199 relevancy on comparative tasks vs V3's 0.726.

Hardest unresolved issue

60% of classified failures were evidence misalignment: systems retrieved correct documents but failed to organise them into a usable reasoning structure for the generator.

Routing recommendations

Factual queries → V1 or V2
Temporal queries → V3 (not V4: faithfulness 0.861 vs 0.561)
Comparative queries → V4 or V6
Multi-hop queries → V6
Fast single-period lookups → V5 (3.63s, faithfulness 0.804)

Final takeaway

FinSight's main contribution is not only the application itself, but the controlled framework for testing how retrieval, reranking, rewriting, filtering, and compression behave across different financial reasoning tasks. Component-level evaluation with query-type sensitivity surfaces more actionable insights than aggregate benchmarking alone.

Team

The project team behind FinSight.

TeammateSi Ken

TeammateAshraf

TeammateNicholas Ang