- Charts and graphs—revenue trends, system architecture diagrams
- Tables—financial statements, comparison matrices
- Images—product photos, screenshots, annotated figures
- Complex layouts—multi-column reports, slide decks, scanned documents
clip-ViT-B-32for page-level image embeddings (512-dimensional dense vectors).actian-vectoraifor vector storage and semantic retrieval.openaiGPT-4o vision API for answer generation from retrieved page images.pdf2imagefor converting PDF pages to images.
Architecture overview
The diagram below shows the three phases of the pipeline. During ingestion, each PDF page is rendered to a high-resolution image, embedded with CLIP, saved to disk, and stored as a vector in Actian VectorAI DB. During semantic retrieval, a user’s text query is encoded into the same CLIP vector space and compared against stored page vectors using cosine similarity, returning the top-K most relevant pages. Finally, during visual RAG answer generation, the retrieved page images are base64-encoded and sent to GPT-4o vision alongside the original query, producing a Markdown-formatted answer grounded in the actual page content.Why visual document RAG
Standard text-based RAG pipelines lose critical information when documents contain visual content. This section explains where text extraction falls short and how the multivector approach addresses it.The problem with text extraction
Standard RAG pipelines use libraries like PyPDF2 or pdfplumber to extract text. But consider a financial report PDF:- Page 3 has a revenue chart — text extraction produces nothing useful.
- Page 7 has a comparison table — extraction loses row/column alignment.
- Page 12 has an architecture diagram — extraction ignores it entirely.
The multivector approach
Instead of extracting text, the pipeline:- Render each page as a high-resolution image (200 DPI).
- Embed the image with CLIP — capturing visual layout, text, charts, and diagrams.
- Store the embedding in Actian VectorAI DB with page metadata.
- At query time, encode the text query with CLIP’s text encoder (same vector space).
- Retrieve the most visually relevant pages via cosine similarity.
- Send the page images to GPT-4o vision to generate an answer.
Environment setup
The pipeline depends on an Actian VectorAI DB instance, a Python environment, an OpenAI API key, and at least one PDF document. The sections below describe each requirement.Actian VectorAI DB instance
The pipeline stores and retrieves page-level CLIP vectors through a VectorAI DB server.- A running Actian VectorAI DB server accessible over gRPC (default port
50051) - The server URL — for local development this is typically
http://localhost:50051 - If you do not have an instance running yet, then follow the installation guide to set one up before continuing
Python environment
All dependencies are installed throughpip into a standard Python environment.
- Python 3.9 or later
pipfor package installation
OpenAI account
GPT-4o generates answers from the retrieved page images during the final RAG step.- An OpenAI API key with access to
gpt-4o(required for Step 6 answer generation) - Set the key as an environment variable before running the tutorial (covered below)
PDF documents
The tutorial referencesannual_report.pdf as a placeholder. Substitute any PDF you have available — the pipeline processes one or more files and renders every page as an image.
Install Python packages
Install the required dependencies:os.getenv("OPENAI_API_KEY") in Step 6. Without it, the GPT-4o vision call will raise an authentication error.
These provide:
actian-vectorai— Actian VectorAI Python SDK (async client, gRPC transport)sentence-transformers— CLIP ViT-B-32 for image and text embeddingspillow— Image processingpdf2image— Converts PDF pages to PIL images (requires poppler)openai— GPT-4o vision API for answer generation
Implementation
The following steps build the pipeline end-to-end, from importing dependencies to running a full RAG query against ingested documents.Step 1: Import dependencies and configure
Load all libraries, set the VectorAI server address and collection name, and initialize the CLIP model so every subsequent step can reference them.Why this step matters
Every component is configured upfront. The key settings are listed below.SERVER— The Actian VectorAI gRPC endpoint (defaultlocalhost:50051).COLLECTION— The collection name for document page vectors.clip_model— The CLIP ViT-B-32 model that embeds both page images and text queries into the same 512-dimensional space.PAGE_IMAGES_DIR— The local directory where rendered page images are saved for later vision-language model (VLM) input.
Expected output
Running the configuration block prints the following to confirm each component loaded successfully:Step 2: Define embedding helpers
CLIP encodes both images and text into the same 512-dim vector space, so cosine similarity can compare a text query against page-image vectors. In practice, related text and visuals tend to sit closer together than unrelated pairs, but ranking is not guaranteed: results depend on query wording, the document set, and how strongly each page matches the query in CLIP’s representation.Why this step matters
Each helper has a distinct role in the pipeline.embed_image— Converts a rendered page image into a CLIP vector for storage.embed_text_clip— Converts a user’s text query into the same CLIP space for retrieval.pil_image_to_base64— Prepares page images for the GPT-4o vision API, which accepts base64-encoded images.
Step 3: Initialize the VectorAI collection
Create the collection with 512-dim cosine distance and HNSW indexing.collections.get_or_create takes the same vectors_config and optional hnsw_config arguments as collections.create in the Create a collection guide, including hnsw_config=HnswConfigDiff(m=..., ef_construct=...). Match the actian-vectorai version you install to the docs or SDK release you are using.
Why this step matters
The collection stores one vector per document page, configured with the following settings.Distance.Cosine— appropriate for normalized CLIP embeddings.HnswConfigDiff(m=32, ef_construct=256)— high-recall HNSW settings for accurate retrieval.
Expected output
If the collection already exists, thenget_or_create is a no-op and prints the same confirmation:
Step 4: Ingest a PDF document
Ingestion runs end-to-end for each PDF: each page is rendered to an image, embedded with CLIP, saved to disk, and upserted into VectorAI.Why this step matters
The ingestion pipeline performs four operations per page.- Render —
convert_from_bytes(pdf_bytes, dpi=200)produces high-resolution page images. - Embed — CLIP converts each page image into a 512-dim vector.
- Save — The rendered image is saved to disk for later retrieval by the VLM.
- Store —
client.points.upsert()inserts the vector with metadata payload.
page_point_id(filename, page_number) — an MD5 hash of the filename and page number, truncated to a 64-bit integer. Using a deterministic hash means re-ingesting the same file always produces the same IDs, so the upsert is idempotent and there are no collisions from deletions, concurrent ingests, or re-runs. The get_vector_count call is no longer needed for ID generation.
The payload stores source_file, page_number, and image_filename so retrieved results can be traced back to their source.
Example usage
The following snippet reads a PDF from disk and passes it toingest_pdf, which renders each page, embeds it, saves the image, and upserts the vector into the collection.
Expected output
The output confirms how many pages were ingested and the running total in the collection:Step 5: Semantic search for document pages
Search for pages that are visually and semantically similar to a text query. Vector similarity search usesAsyncVectorAIClient.points.search with the query embedding as vector, a result cap as limit, and with_payload=True to return metadata. That matches the pattern and parameter table in Similarity search basics (Step 4).
Why this step matters
The text query is encoded using CLIP’s text encoder into the same 512-dim space as the page images. Actian VectorAI’spoints.search ranks pages by cosine similarity to the query vector—the mathematically nearest neighbors in that space, which approximate relevance but are not a formal guarantee of correctness.
Because CLIP was trained on image-text pairs, a query like “quarterly revenue breakdown” will often place relevant pages—those with charts or tables that match the intent—higher in the similarity list than unrelated pages. Treat top-K results as a best-effort shortlist: you may still need to tune top_k, rephrase queries, or add filters for production accuracy.
Example usage
The following snippet runs a semantic search for pages related to quarterly revenue and prints each result’s page number, source file, and similarity score.Expected output
Each result shows the page number, source file, and cosine similarity score:Step 6: Generate answers with GPT-4o vision
The retrieved page images are sent to OpenAI’s GPT-4o vision model along with the user’s question.Why this step matters
At answer time, the workflow becomes a true visual RAG pipeline. GPT-4o can perform the following tasks.- Read text from page images (no separate OCR pipeline required)
- Interpret charts and graphs
- Parse tables and extract specific values
- Interpret diagrams and architecture drawings
Step 7: End-to-end RAG pipeline
Connect search and answer generation into a single function.Why this step matters
The RAG pipeline has three stages.- Retrieve — Find the top-K most relevant pages via CLIP similarity search on Actian VectorAI.
- Load — Get the page image filenames from the search results payload.
- Generate — Send images + query to GPT-4o vision for answer generation.
Example usage
The following snippet runs a complete RAG query, then prints a truncated preview of the generated answer alongside the source pages and their similarity scores.Expected output
Theanswer field contains raw Markdown text returned by GPT-4o — it is not rendered here, so **Revenue** and list hyphens appear as literal characters rather than formatted output.
Step 8: Collection administration
Use the following operations to inspect the collection, list ingested documents, flush data to disk, or delete the collection entirely.Expected output
Running the admin operations prints the total vector count, a sorted list of ingested document names, and a flush confirmation:How the visual RAG pipeline differs from text RAG
The table below compares the two approaches across each stage of the pipeline. It covers everything from input processing to answer generation, helping you decide which approach fits your documents.| Aspect | Traditional text RAG | Visual document RAG |
|---|---|---|
| Input processing | Extract text, chunk into passages | Render pages as images |
| Embedding model | Text embedder (such as text-embedding-3-small) | Vision model (CLIP ViT-B-32) |
| What gets embedded | Text chunks (500-1000 tokens) | Full page images (200 DPI) |
| Charts and tables | Lost during extraction | Preserved as visual content |
| Retrieval unit | Text chunk | Document page |
| Answer generation | LLM reads retrieved text | VLM reads retrieved page images |
| OCR dependency | Requires text extraction | No OCR needed — VLM reads images directly |
Actian VectorAI features used
The following table summarises the SDK methods used in this article, mapping each feature to its API call and its role in the pipeline.| Feature | API | Purpose |
|---|---|---|
| Collection creation | client.collections.get_or_create() | Create 512-dim cosine vector space with HNSW |
| Batch point upsert | client.points.upsert() | Store CLIP page vectors with metadata payload |
| Semantic search | client.points.search() | Find visually similar pages by cosine similarity |
| Point scroll | client.points.scroll() | Page through all points for document listing |
| Vector count | client.vde.get_vector_count() | Track total indexed pages |
| Flush | client.vde.flush() | Persist vectors to disk after ingestion |
| Delete collection | client.collections.delete() | Clean up all data |
The ColPali inspiration
This system is inspired by the ColPali architecture, which demonstrated that:- Treating document pages as images avoids lossy text extraction
- Vision encoders capture layout, typography, and visual elements
- Late interaction between query tokens and page patch embeddings improves retrieval
- CLIP instead of PaliGemma for embedding — simpler to deploy and widely available
- Single vector per page instead of multi-vector patch embeddings — compatible with standard vector databases without specialised infrastructure
- GPT-4o vision instead of a specialised reader model for answer generation — no custom training required
When to use visual document RAG
Visual document RAG is not the right fit for every use case. The sections below outline where it excels and where text-based RAG remains the better option.Best suited for
This approach works best for documents where critical information is encoded visually rather than as plain text.- Financial reports with charts and tables
- Slide decks and presentations
- Technical manuals with diagrams
- Scanned documents and forms
- Multi-column layouts and complex formatting
Consider text RAG instead when
Text RAG is the better choice when documents are text-dominant and token-level precision matters more than visual fidelity.- Documents are purely text-based (novels, articles)
- Token-level precision matters more than page-level retrieval
- You need to process thousands of pages per query (VLM calls add cost at scale)
Next steps
Now that you have built a full visual RAG pipeline, explore these topics to extend and improve your system:Multimodal system patterns
Combine vector similarity with structured constraints
Similarity search basics
Learn the core retrieval workflow
Filtering with predicates
Add
must, should, and must_not conditionsRetrieval quality
Measure and improve search result accuracy