SERVER aligned with your environment as you follow along.
Architecture overview
The diagram below shows how documents flow through model selection, embedding, and storage in Actian VectorAI DB. Each model produces vectors of a different size, stored in separate collections so you can compare retrieval quality across configurations.Environment setup
Run the following command to install the two packages this tutorial depends on.What this installs
Both packages are required: one communicates with the database, and the other loads and runs embedding models on your machine. The list below maps each dependency to the role it plays in later steps.actian-vectorai— Official Python SDK for Actian VectorAI DB; provides async/sync clients, Filter DSL, and gRPC transport.sentence-transformers— Framework for loading and running open-source embedding models; downloads and caches models from Hugging Face.
Step 1: Understand the model landscape
Before writing any code, review the table below to understand how the available models differ in dimension count, speed, and quality. The model you choose determines the shape of every vector stored in the database.| Model | Dimensions | Speed | Quality | Best for |
|---|---|---|---|---|
sentence-transformers/all-MiniLM-L6-v2 | 384 | Very fast | Good | Prototyping, low-latency apps |
sentence-transformers/all-MiniLM-L12-v2 | 384 | Fast | Better | Production with speed constraints |
sentence-transformers/all-mpnet-base-v2 | 768 | Moderate | High | General production use |
sentence-transformers/multi-qa-mpnet-base-dot-v1 | 768 | Moderate | High (QA) | Question-answering systems |
sentence-transformers/all-distilroberta-v1 | 768 | Moderate | High | Diverse text types |
intfloat/e5-large-v2 | 1024 | Slow | Very high | Maximum quality, offline indexing |
BAAI/bge-large-en-v1.5 | 1024 | Slow | Very high | Benchmarks, academic use |
sentence-transformers/clip-ViT-B-32 | 512 | Moderate | High (multi) | Text + image multimodal |
Key trade-offs
Keep the following trade-offs in mind before choosing a model.- More dimensions means more storage and slower search, but better semantic resolution.
- Fewer dimensions means less RAM and faster search, but may lose subtle meaning.
- Model architecture matters more than dimension count — a well-trained 384-dim model can outperform a poorly trained 768-dim one.
Step 2: Import dependencies and configure
The block below imports every module used across all steps of this tutorial and sets the server address. Run it once at the top of your script or notebook. If the import succeeds and the server address prints, your environment is ready.Expected output
This block imports all SDK classes and utility modules needed throughout the tutorial — including the async client, distance enums, point structures, vector parameters, quantization types, and search parameter models — and setsSERVER to the local gRPC address. The final print statement confirms that the configuration loaded without errors and that the server address is set correctly.
Step 3: Load multiple models and compare embedding output
The code below loads three models — small, medium, and large — and encodes the same sample sentence with each one. Running it prints each model’s load time, dimension count, and the first five values of the resulting vector, confirming that each model produces a vector of a different size.Expected output
This block iterates over the three model definitions — MiniLM-L6 (384 dimensions), MPNet-base (768 dimensions), and E5-large-v2 (1024 dimensions) — loads each one from the Hugging Face cache viaSentenceTransformer, and records the load time. It then encodes the same sample sentence with every loaded model and prints each model’s actual output dimension and the first five vector values to confirm that the models are producing embeddings of the expected shape and are ready for ingestion.
Load times and vector values vary by hardware, library version, and model revision.
Step 4: Measure embedding speed
The code below encodes a batch of 100 texts with each loaded model and prints total encoding time and throughput in texts per second. Running it shows how much slower larger models are relative to smaller ones, which directly affects ingestion time and real-time query latency.Expected output
This block constructs a 100-text corpus by repeating five representative sentences twenty times, then passes the full batch to each loaded model’sencode method. It measures the wall-clock time for each encoding pass and computes throughput in texts per second. The output shows the absolute encoding time and throughput for each model at its native dimension count, making the latency cost of moving from MiniLM to E5-large directly visible.
Throughput figures depend on your hardware and whether a GPU is available. Relative ordering — MiniLM fastest, E5-large slowest — is consistent across environments.
Step 5: Match distance metrics to models
Different models are trained with different objectives, and using the wrong distance metric silently degrades retrieval quality. The code below prints the correct metric for each model in the tutorial so you can verify your collection configuration matches the model.| Distance | When to use | Models trained with it |
|---|---|---|
Distance.Cosine | Most general-purpose models, where outputs are normalized or benefit from angular comparison. | MiniLM, MPNet, E5, BGE |
Distance.Dot | Models trained with dot-product loss, where outputs are not normalized and magnitude matters. | multi-qa-mpnet-base-dot-v1 |
Distance.Euclid | When absolute distance matters; rare for text, but common for structured or tabular embeddings. | Custom models |
Distance.Manhattan | L1 distance, more robust to outliers in individual dimensions. | Specialized pipelines |
Distance.Cosine. If it says “dot product”, use Distance.Dot. When in doubt, use Distance.Cosine. The mapping for multi-qa-mpnet-base-dot-v1 to Distance.Dot reflects its training objective; verify against your VectorAI DB version’s scoring semantics before deploying.
Step 6: Create collections for different models
The code below creates one collection for each of the three models, each configured with the matching dimension count, distance metric, and HNSW settings. Running it prints a confirmation line per collection showing its dimension and distance metric. The E5-large collection also applies int8 scalar quantization to reduce its memory footprint at scale.Why quantization for E5-large
E5-large produces 1024-dimensional float32 vectors. The table below shows how much memory each model requires at scale, and how scalar int8 quantization cuts that footprint by 4x.| Model | Dims | Bytes/vector (float32) | Bytes/vector (int8) | Savings |
|---|---|---|---|---|
| MiniLM | 384 | 1,536 | 384 | 4x |
| MPNet | 768 | 3,072 | 768 | 4x |
| E5-large | 1,024 | 4,096 | 1,024 | 4x |
always_ram=True keeps the compressed vectors in RAM for fast search while storing full-precision vectors on disk for rescoring. Verify this behavior against your product version and configuration before relying on it in production.
Expected output
This block iterates over thecollection_configs dictionary and calls get_or_create for each entry, passing the matching VectorParams (dimension size, distance metric, and optional quantization config) along with the HNSW settings. Each collection is provisioned independently — MiniLM and MPNet without quantization, E5-large with int8 scalar quantization — and a confirmation line is printed per collection once it is ready. The output confirms the dimension count and distance metric that were applied to each collection.
Step 7: Prepare a shared dataset
The code below defines a list of 20 short passages withtopic and difficulty metadata. All three models embed this same dataset so that retrieval quality can be compared directly. The variety of topics — indexing, filtering, quantization, search — means different models may disagree on the top result for a given query, which makes the later comparison meaningful.
Step 8: Embed and ingest with each model using upload_points
The code below embeds all 20 documents with each model and uploads the resulting vectors into the corresponding collection. Running it prints one line per model showing embedding time, ingestion time, and the total number of points confirmed in the collection after flushing.
Why upload_points instead of upsert
upload_points splits a large list into batches automatically (default batch_size=256). This avoids hitting gRPC message size limits when ingesting thousands of high-dimensional vectors at once. The method returns the number of points successfully uploaded; confirm the return type against your SDK version.
The snippet below shows the minimal call pattern.
Expected output
Theingest_all function loops over all three model-to-collection mappings, embeds the 20 shared documents with each model, wraps the resulting vectors into PointStruct objects that carry the text, topic, difficulty, and model name as payload, and uploads them in batches of 64 using upload_points. After uploading, it flushes pending writes and reads the confirmed vector count from the collection. The output reports per-model embedding time, ingestion time, and the final point count, confirming that all 20 documents were stored in each collection.
Timing values vary by hardware and network conditions.
Step 9: Compare search quality across models
The code below runs three test queries against all three collections and prints the top-scoring documents from each. Running it lets you see whether different models surface different documents for the same query, and how confidently each model scores its top result.What to look for
Use the same query across collections to see how each model ranks passages. Not every model will agree on rank 1, so the checks below keep comparisons fair.- Check whether scores are spread between relevant and irrelevant results. Higher-quality models tend to produce wider separation, making it easier to set a threshold.
- Check whether the top result is the most relevant document. When models disagree on rank 1, the larger model is often a better reference, though results vary by query and domain.
- Check whether cosine scores fall above 0.7, which indicates strong relevance for most sentence transformers. Scores below 0.4 are usually noise.
Step 10: Quantized search with rescoring
Scalar quantization (as shown for E5-large) compresses vectors but can reduce ranking accuracy. The code below runs the same query three ways — quantized without rescoring, quantized with rescoring, and exact brute-force — and prints timing and top results for each mode so you can see the accuracy-speed trade-off directly.How rescoring works
Quantized vectors make the first pass fast; rescoring then recomputes distances in full precision for the shortlist so rankings match what you would get without compression. The two passes work as follows.- Initial pass: Search quantized (int8) vectors and retrieve top K × oversampling candidates.
- Rescore pass: Recompute distance using full float32 vectors and return top K.
| Setting | Speed | Accuracy | When to use |
|---|---|---|---|
rescore=False | Fastest | Approximate | High-throughput workloads that can tolerate small errors. |
rescore=True, oversampling=1.5 | Fast | Near-exact | Default recommendation for most production workloads. |
rescore=True, oversampling=3.0 | Moderate | Very high | Quality-critical applications where accuracy outweighs speed. |
exact=True | Slowest | Perfect | Benchmarking and establishing ground truth. |
Step 11: Use Datatype.Float16 for medium compression
Between full float32 and int8 quantization, Datatype.Float16 offers a middle ground: 2x memory reduction with negligible quality loss. The code below creates a float16 collection for MPNet, embeds the shared dataset, and prints a confirmation with the document count.
Datatype comparison
VectorParams can store vector components as float32 (default), float16, or uint8, trading RAM for precision. Choose a datatype before creating the collection; changing it later requires re-ingestion or migration. The table below shows the memory impact at one million 768-dim vectors.
Datatype.Uint8 is listed for completeness. Verify with your engineering team that it is supported in your SDK version before using it in production.| Datatype | Bytes/component | Precision | Memory (1M × 768-dim) |
|---|---|---|---|
Datatype.Float32 (default) | 4 | Full | ~3.0 GB |
Datatype.Float16 | 2 | Near-full | ~1.5 GB |
Datatype.Uint8 | 1 | Reduced | ~0.75 GB |
Float16 is a good choice for models like MPNet where you want to reduce memory without adding the complexity of quantization rescoring.
Step 12: Named vectors — multiple models in one collection
Rather than creating separate collections, you can store embeddings from multiple models in a single collection using named vectors. The code below creates a collection with two named vector spaces, embeds the shared dataset with both models, and uploads each document with both vectors attached. Running it prints a confirmation showing the document count and number of vector spaces.Expected output
This block creates a single collection namedembeddings-multi-model with two named vector spaces — "minilm" (384-dim, Cosine, float32) and "mpnet" (768-dim, Cosine, float16) — then encodes all 20 shared documents with both models in parallel. Each document is stored as a single PointStruct carrying both embedding vectors alongside its text, topic, and difficulty metadata. After uploading and flushing, it reads the confirmed vector count and prints a summary showing the number of documents stored and the number of named vector spaces active in the collection.
Step 13: Search individual models and fuse results
Each named space was produced by a different encoder, so you embed the query once per model and pass the vector that matches theusing parameter ("minilm" or "mpnet"). The code below runs two single-space searches, then one fused query that merges candidate lists with reciprocal rank fusion (RRF). Running it prints the top three results from each approach side by side for the same query.
Why multi-model fusion improves results
Different models capture different aspects of meaning. MiniLM is strong at lexical similarity, where “search” closely matches “search.” MPNet better captures paraphrases, where “ANN search” matches “approximate nearest-neighbour.” Documents that rank highly in both models are almost certainly relevant, and RRF fusion naturally promotes those consensus results.The
client.points.query(query={"fusion": Fusion.RRF}, prefetch=[...]) call pattern is SDK-specific. Verify the method name and payload shape against your Python client version before deploying.Step 14: Re-embed specific vectors when switching models
To upgrade from one model to another, useupdate_vectors to re-embed specific vector spaces without touching other data. The code below simulates upgrading the MiniLM space from L6 to L12 by fetching all existing points, re-encoding each text with the upgraded model, and pushing only the new MiniLM vectors back. Running it prints a confirmation that the minilm space was updated and that MPNet vectors are unchanged.
Why update_vectors instead of re-ingesting
update_vectors only replaces the specified named vector(s). Payloads, other vector spaces, and point IDs remain untouched. This matters in the following situations.
- You want to upgrade one model without re-processing all data.
- Different teams own different embedding spaces.
- You need zero-downtime model upgrades.
Step 15: Batch search — multiple queries in a single call
The code below builds a batch of three queries, encodes them, and sends all three to the same collection in a single gRPC call usingsearch_batch. Running it loops over the MiniLM and MPNet collections in turn, printing the top two results for each query grouped by collection.
Why search_batch
search_batch sends up to 100 search requests in a single gRPC call. This is far more efficient than sending individual requests for three reasons.
- Network round-trips are reduced from N to 1.
- The server can parallelize the searches internally.
- Latency is dominated by the slowest query, not the sum of all queries.
Verify that
client.points.search_batch(collection, searches=searches) matches the current Python SDK method signature before deploying.Step 16: Build a model selection helper
The code below defines arecommend_model function that takes corpus size, latency budget, quality priority, and RAM budget as inputs and returns a fully configured ModelRecommendation dataclass. Running the four test scenarios at the end prints the recommended model, HNSW settings, and a plain-language reason for each configuration.
Step 17: Report and clean up tutorial collections
The code below lists all five tutorial collections with their document counts, flushes each one, then deletes them all. Running it confirms which collections exist before removing them so resources are freed without leaving orphaned data.Quick reference: model → VectorAI configuration
The table below maps each model covered in this tutorial to its recommended VectorAI DB configuration. Use it as a checklist when setting up a new collection.| Model | Dims | Distance | Datatype | Quantization | HNSW m | HNSW ef_construct |
|---|---|---|---|---|---|---|
sentence-transformers/all-MiniLM-L6-v2 | 384 | Cosine | Default (f32) | None | 16 | 100–128 |
sentence-transformers/all-MiniLM-L12-v2 | 384 | Cosine | Default (f32) | None | 16 | 128 |
sentence-transformers/all-mpnet-base-v2 | 768 | Cosine | Float16 (if RAM tight) | None | 16 | 128 |
sentence-transformers/multi-qa-mpnet-base-dot-v1 | 768 | Dot | Default (f32) | None | 16 | 128 |
sentence-transformers/all-distilroberta-v1 | 768 | Cosine | Float16 (if RAM tight) | None | 16 | 128 |
intfloat/e5-large-v2 | 1024 | Cosine | Default (f32) | ScalarQuantization (int8) | 32 | 256 |
BAAI/bge-large-en-v1.5 | 1024 | Cosine | Default (f32) | ScalarQuantization (int8) | 32 | 256 |
sentence-transformers/clip-ViT-B-32 | 512 | Cosine | Default (f32) | None | 16 | 128 |
Actian VectorAI features used
The table below lists every SDK feature and API used in this tutorial alongside its purpose.| Feature | API | Purpose |
|---|---|---|
| Collection creation | collections.get_or_create(vectors_config=VectorParams(...)) | Configure dimensions, distance, and quantization. |
| Distance metrics | Distance.Cosine, Distance.Dot, Distance.Euclid, Distance.Manhattan | Match the metric to the model’s training objective. |
| Scalar quantization | QuantizationConfig(scalar=ScalarQuantization(type=Int8, quantile=0.99, always_ram=True)) | Compress large model vectors to reduce memory usage. |
| Datatype control | VectorParams(datatype=Datatype.Float16) | 2x memory reduction with near-full precision. |
| Quantized search | SearchParams(quantization=QuantizationSearchParams(rescore=True, oversampling=2.0)) | Fast quantized search with accuracy recovery via rescoring. |
| Exact search | SearchParams(exact=True) | Brute-force ground truth for benchmarking. |
| Batched upload | client.upload_points(collection, points, batch_size=64) | Automatic batching for large ingestions. |
| Named vectors | vectors_config={"minilm": VectorParams(...), "mpnet": VectorParams(...)} | Multiple models stored in one collection. |
| Named vector search | points.search(..., using="mpnet") | Search a specific embedding space by name. |
| Multi-model fusion | PrefetchQuery(using="minilm") + PrefetchQuery(using="mpnet") + Fusion.RRF | Combine results from different models using RRF. |
| Update vectors | points.update_vectors(points=[PointStruct(id=..., vector={"minilm": new_vec})]) | Re-embed one model’s space without touching others. |
| Batch search | points.search_batch(searches=[...]) | Send multiple queries in one gRPC call. |
| Selective payload | WithPayloadSelector(include=["text"]) | Return only the needed payload fields. |
| Collection exists | collections.exists(name) | Check whether a collection exists before operating on it. |
| Vector count | vde.get_vector_count() | Verify ingestion completed successfully. |
| Flush | vde.flush() | Persist pending writes to durable storage. |
Next steps
Explore the tutorials below to apply open-source embedding models alongside other Actian VectorAI DB capabilities.Building multi-modal systems
Add image embeddings with CLIP alongside text models
Optimizing retrieval quality
Tune HNSW parameters, quantization, and search settings
Re-ranking search results
Improve result relevance with cross-encoders and fusion
Similarity search fundamentals
Master the core search and query workflow