Prerequisites
Before starting, ensure the following are in place:- A running VectorAI DB instance (see the installation guide.)
- Python 3.9 or later.
- An OpenAI API key set as the environment variable
OPENAI_API_KEY. - The following packages installed.
System architecture
The following diagram shows how a financial document flows from raw PDF through ingestion, vector storage, and into AI-powered analysis:Concepts
Financial document analysis introduces challenges that go beyond standard semantic search. The following concepts are critical when building systems that operate on real-world financial data:Context-aware document chunking
Context-aware document chunking
Financial documents are highly structured, and meaning is often tied to sections rather than raw text proximity. Instead of fixed-size chunks, segment documents along logical boundaries such as Management Discussion and Analysis, Risk Factors, or Financial Statements.
Preserving section-level metadata enables more precise retrieval and allows downstream systems to distinguish between forward-looking statements and historical reporting.
Financial-domain embeddings
Financial-domain embeddings
Financial language is dense, nuanced, and often indirect. General-purpose embeddings may miss subtle relationships between phrases like “margin compression” and “cost pressures.”
Using domain-adapted models—or validating that your embedding pipeline captures financial semantics—is essential for high-quality retrieval in this space.
Time-aware retrieval and comparison
Time-aware retrieval and comparison
Financial insights are rarely static—they evolve across reporting periods. A meaningful analysis system must account for time by associating documents with structured temporal metadata such as quarter or fiscal year.
This enables queries that go beyond retrieval, supporting comparisons like quarter-over-quarter performance or shifts in company outlook.
Implementation
The pipeline covers everything from ingesting raw PDFs to running AI-driven analysis. Each step below builds on the previous one: set up the collection, parse and chunk documents, generate embeddings, index them, build the analyzer, and run the full pipeline.Step 1: Set up document processing
The following code imports the required libraries, initializes the OpenAI client, and definescreate_financial_collection(), which connects to VectorAI DB and creates a financial_docs collection with 1536-dimensional cosine vectors. Running asyncio.run(create_financial_collection()) at the bottom executes the setup immediately:
Step 2: Implement document parsing
The following code defines three functions.parse_financial_pdf() reads all pages from a PDF and returns the full text, document metadata, and a list of sections. extract_document_metadata() uses heuristics to detect the company name and reporting period from the first page. identify_sections() scans the full text for known section headers, sorts them by their position in the document, and returns each section as a dictionary with its title and content:
Step 3: Implement chunking
The following code defineschunk_financial_document(), which iterates over each section and splits its text into overlapping chunks of up to 1000 characters. Each chunk boundary is aligned to the nearest sentence ending to avoid cutting mid-sentence. A guard prevents an infinite loop when the remaining text is shorter than the overlap window:
Step 4: Index documents
embed_texts() calls the OpenAI embeddings API synchronously and blocks until the response returns. Keep this in mind when integrating into larger async pipelines.
The following code defines embed_texts(), which sends a batch of text strings to OpenAI and returns a list of embedding vectors. index_financial_document() orchestrates the full ingestion sequence: it parses the PDF, chunks the result, generates embeddings for every chunk, and upserts all vectors into the financial_docs collection. It returns the number of chunks indexed, or 0 if no text was extracted:
Step 5: Build the analysis system
The following code defines theFinancialAnalyzer class, which provides three methods. search() embeds a natural language query and retrieves the most relevant chunks from the collection, with optional filters for company, document type, and section. analyze_topic() retrieves relevant chunks and passes them to GPT-4o to produce a cited analysis. compare_companies() retrieves chunks per company and asks GPT-4o for a structured comparison. extract_metrics() queries specific metrics per company and uses GPT-4o Mini to extract values from the retrieved text.
Note that each chunk passed to the LLM is truncated to 500 characters. Precise figures may require cross-referencing the full source document:
Step 6: Usage example
The following code indexes three earnings PDFs as 10-K filings, then runs topic analysis for Apple, a cloud revenue comparison between Microsoft and Google, and metric extraction for Apple. Running it prints indexed chunk counts, a cited analysis paragraph, a structured company comparison, and three extracted metric values with their reporting periods:10-K filings by calling index_financial_document() for each file. It then instantiates FinancialAnalyzer and runs three queries: a topic analysis of Apple’s revenue growth drivers filtered to the company "Apple Inc", a structured comparison of Microsoft and Google on cloud computing revenue, and a metric extraction for Apple targeting Total Revenue, Gross Margin, and R&D Expenses from the Financial Statements section. The output reports the number of chunks indexed per file, a cited narrative analysis of Apple’s revenue trends sourced from the top-ranked document chunks, a side-by-side cloud revenue comparison between Microsoft Azure and Google Cloud, and the three extracted metric values with their associated reporting period.
Expected Output
Next steps
The following cards link to related articles and tutorials that extend the concepts covered in this guide:RAG fundamentals
Retrieval-augmented generation
Vector databases
Vector database fundamentals
Predicate filters
Advanced metadata filtering
Similarity search
Search patterns and techniques