The problem
SEC filings are information-rich but hard to navigate. Financial questions often require exact figures, fiscal-period awareness, and cross-document comparison.
Vibe-coded with love
IS469 Final Project
FinSight is a domain-specific retrieval-augmented generation system built to answer questions over Microsoft 10-K and 10-Q filings with citations, measurable grounding, and a controlled comparison of seven pipeline designs.
Presentation Flow
SEC filings are information-rich but hard to navigate. Financial questions often require exact figures, fiscal-period awareness, and cross-document comparison.
FinSight ingests Microsoft filings, chunks them into searchable evidence, indexes them with dense and sparse retrieval, and generates citation-backed answers.
Seven variants were evaluated to isolate the impact of retrieval, reranking, hybrid search, query rewriting, metadata filtering, and context compression.
No single pipeline wins every metric. Simpler versions ground answers better, while broader pipelines answer complex questions more completely at higher latency.
System Flow
Parse 10-K and 10-Q filings into structured text with fiscal metadata.
Split filings into evidence units sized for retrieval quality and citation traceability.
Build a dense ChromaDB index and a BM25 sparse index for complementary search signals.
Use dense, sparse, hybrid, filtered, or rewritten-query retrieval depending on the pipeline.
Optionally rerank or compress context so the generator sees cleaner, more targeted evidence.
Produce grounded answers with citations, then evaluate them for faithfulness, relevancy, and numerical accuracy.
Pipeline Variants
V0
No retrieval. The model answers from parametric memory, which makes it fluent but weakly grounded.
Generate only
Establish the hallucination floor.
Fast and simple.
Unsupported claims and no evidence traceability.
Evaluation
Reranking was the most efficient way to improve grounding (+0.105 over V1 at only +1s latency).
Query rewriting improved coverage and ambiguity handling for comparative queries.
Hybrid retrieval helped recover exact figures and finance terms via lexical matching.
Dense retrieval alone was already strong on simple factual questions at minimal latency.
Faithfulness
Higher is better. Faithfulness measures whether answer claims are supported by retrieved evidence.
Dense retrieval (V1) was usually enough for factual lookups, achieving 0.988 relevancy. More complex retrieval stages added little value for single-chunk answers.
Broader retrieval mattered more because the system had to align multiple periods or comparison targets. V1's 0.199 comparative relevancy vs V3's 0.726 shows the gap.
The bottleneck shifts with task complexity: first precision, then coverage, then structure. Pipeline selection should be query-type-driven, not aggregate-score-optimised.
Metric Guide
Checks whether answer claims are supported by the retrieved evidence. This is the grounding metric. Peaks at V2 (0.843) — the benefit of cross-encoder reranking.
Measures whether the response actually addresses the user's question. Peaks at V4 (0.784) — query rewriting expands coverage of multi-part questions.
Tests whether retrieval surfaced the evidence needed to answer the question. Peaks at V1 (0.592) — reranking and filtering can reduce recall by over-pruning.
Measures how much of the retrieved context was useful rather than noisy. Peaks at V2 (0.741) — reranking lifts both precision and faithfulness together.
Custom finance metric checking whether the answer includes the correct figures verbatim. Peaks at V3 (0.600) — BM25 lexical matching is essential for exact numbers.
Average end-to-end query time. Ranges from 3.46s (V1) to 10.52s (V6). Each additional retrieval stage adds measurable cost without guaranteed quality gains.
Limitations & Future Work
The report identifies a clear next step: improve evidence structure before trying to push generation harder.
Guardrails
Plausible but unsupported figures are the highest-severity failure in financial QA. V0 faithfulness of 0.098 quantifies this floor.
Answers become outdated if retrieval misses the newest period or the index is not refreshed after EDGAR releases.
The model may round, paraphrase, or simplify figures even when the correct chunk is present in context.
The system must avoid answering unsupported questions outside the indexed Microsoft filings corpus.
Conclusion
The report's central result is that RAG components provide selective, query-type-dependent benefits. Financial QA improves most when the system matches the pipeline to the task instead of forcing every query through one universal stack. The most consistent failure mode across all variants was not retrieval miss or generation error — it was evidence misalignment.
Use V2 when grounding is the priority. It delivered the best faithfulness (0.843) per unit of added latency — only 1s more than the V1 baseline.
Use hybrid retrieval (V3) as the minimum for temporal and comparative queries. Dense-only retrieval is too narrow — V1 achieves just 0.199 relevancy on comparative tasks vs V3's 0.726.
60% of classified failures were evidence misalignment: systems retrieved correct documents but failed to organise them into a usable reasoning structure for the generator.
FinSight's main contribution is not only the application itself, but the controlled framework for testing how retrieval, reranking, rewriting, filtering, and compression behave across different financial reasoning tasks. Component-level evaluation with query-type sensitivity surfaces more actionable insights than aggregate benchmarking alone.
Team