- Comparing Euclidean and Manhattan distance metrics for medical text similarity and understanding when to use each.
- Applying scalar quantization at the collection level for memory-efficient storage of large patient registries.
- Using IVF indexing (
IndexType.INDEX_TYPE_IVF_FLAT) withIvfConfigDiffas an alternative to HNSW for high-volume datasets. - Configuring
VectorParamswithdatatypeandon_diskfor storage optimization. - Running server-side fusion using the
Fusionenum to merge results from multiple prefetch stages without client-side code. - Using random sampling via the
Sampleenum to retrieve random patient profiles for exploratory sampling. - Applying selective payload retrieval with
WithPayloadSelectorto return only specific fields (for example, exclude PHI). - Applying selective vector retrieval with
WithVectorsSelectorfor named vector subsets. - Creating UUID payload indexes with
UuidIndexParamsfor patient and trial identifier lookups. - Configuring sharding, replication, and on-disk payload for production-grade deployments.
- Tuning search with
indexed_onlyandivf_nprobeinSearchParams.
Architecture Overview
Patient records and trial eligibility criteria are embedded into the same vector space, enabling semantic matching across collections. The pipeline supports multi-metric search, scalar quantization for performance, and production sharding for scale.Environment Setup
Install the required packages before running any of the tutorial code.Install Python Dependencies
The following command installs the SDK and the embedding model library. Running it confirms that both packages are available before any other step in this tutorial.What This Installs
The two packages cover database operations and text embedding generation.actian-vectorai— Official Python SDK for Actian VectorAI DB (IVF indexing, scalar quantization, server-side fusion, sharding, gRPC transport).sentence-transformers— For generating text embeddings withall-MiniLM-L6-v2.
Implementation
This section walks through the full pipeline from collection setup to trial matching, covering each step in order.Step 1: Import Dependencies and Configure
The following block imports every module used in this tutorial — including quantization types, IVF config, distance metrics,Fusion and Sample enums, selective retrieval selectors, UUID index params, and sharding types — then sets the connection and collection constants and loads the embedding model.
ScalarQuantization,QuantizationConfig— Collection-level vector compression.WithPayloadSelector,WithVectorsSelector— Fine-grained control over returned fields.Datatype— Vector storage precision (Float32,Float16,Uint8).Fusion,Sample— Server-side fusion and random sampling enums.IndexType— Alternative index algorithms (IVF, FLAT, HNSW).IvfConfigDiff— IVF index tuning parameters.UuidIndexParams— UUID field indexing.ShardingMethod— Collection sharding strategy.
Expected Output
Running this block sets the server address, collection names, and embedding dimension as constants, then initializes theall-MiniLM-L6-v2 model. All four print statements confirm the active configuration values before any collection or data operation is performed.
Step 2: Define Embedding Helpers
The following two functions wrap theSentenceTransformer model. The rest of the pipeline calls these helpers instead of using the model directly, keeping embedding logic in one place.
Step 3: Create the Patient Collection
The followingcreate_patient_collection function creates a collection with Euclidean distance, Int8 scalar quantization, explicit Float32 vector precision, and two-shard automatic distribution. Running asyncio.run(create_patient_collection()) provisions the collection with all of these settings and prints a confirmation summary.
Scalar Quantization
ScalarQuantization compresses 32-bit float vectors to 8-bit integers. The table below describes each parameter and its effect.
| Parameter | Effect |
|---|---|
type=QuantizationType.Int8 | Compress each float to 8 bits (4x memory reduction). |
quantile=0.99 | Use the 99th percentile for calibration (handles outliers). |
always_ram=True | Keep quantized vectors in RAM even if on_disk=True. |
VectorParams.
VectorParams Fields
datatype=Datatype.Float32 explicitly sets vector precision, and on_disk=False keeps vectors in memory. The table below lists the available datatypes.
| Datatype | Bits per Dimension | Use Case |
|---|---|---|
Float32 | 32 | Full precision (default). |
Float16 | 16 | Half precision, 2x memory savings. |
Uint8 | 8 | Smallest, for pre-quantized vectors. |
Sharding and Replication
The table below describes the production deployment parameters passed toget_or_create.
| Parameter | Effect |
|---|---|
shard_number=2 | Split data across 2 shards for parallelism. |
replication_factor=1 | Number of data copies (1 = no replication). |
write_consistency_factor=1 | Minimum replicas that must acknowledge a write. |
on_disk_payload=False | Keep payload in memory for fast filtered search. |
sharding_method=ShardingMethod.Auto | Automatic shard distribution. |
Expected Output
Runningcreate_patient_collection() creates the collection with all of these settings applied — Euclidean distance, Int8 scalar quantization calibrated to the 99th percentile, two shards with automatic distribution, and Float32 vector precision held in RAM — then prints a summary confirming each parameter.
Step 4: Create the Trial Collection
The followingcreate_trial_collection function creates a collection that uses Manhattan distance and an IVF-Flat index instead of HNSW. Running asyncio.run(create_trial_collection()) creates the collection and prints the distance metric and index configuration.
Manhattan Distance (L1 Norm)
Manhattan distance computes similarity as the sum of absolute differences. The table below compares the two metrics used in this tutorial. Cosine and dot product are supported but not demonstrated in this example — they are included for reference only.| Metric | Formula | Behavior |
|---|---|---|
| Euclidean (L2) | sqrt(sum((a-b)^2)) | Penalizes large differences heavily. |
| Manhattan (L1) | sum(abs(a-b)) | Penalizes all differences equally. |
| Cosine (reference) | dot(a,b) / (norm(a)*norm(b)) | Direction only, ignores magnitude. |
| Dot (reference) | dot(a,b) | Direction and magnitude. |
IVF Indexing
IVF (inverted file index) is an alternative to HNSW. The table below describes each parameter.| Parameter | Effect |
|---|---|
IndexType.INDEX_TYPE_IVF_FLAT | Partitions vectors into nlist clusters; searches nprobe clusters at query time. |
nlist=16 | Number of Voronoi partitions. |
nprobe=4 | Number of partitions to search (higher = more accurate, slower). |
training_sample_size=1000 | Number of vectors used to train cluster centroids. |
- You need predictable memory usage (IVF has lower overhead than HNSW graphs).
- The dataset is large and you want fast approximate search with tunable accuracy.
- You plan to rebuild the index periodically.
Expected Output
Runningcreate_trial_collection() provisions the Clinical-Trials collection with Manhattan distance and an IVF-Flat index configured with 16 Voronoi partitions and a default probe count of 4. The print statements confirm the selected distance metric and index settings upon successful creation.
Step 5: Create Payload Indexes
The followingcreate_indexes function adds payload indexes across both collections — including UUID indexes, keyword filters, integer range indexes, and a datetime index. Running asyncio.run(create_indexes()) creates all indexes and prints a confirmation for each group.
UuidIndexParams is a dedicated index type for UUID-formatted string fields. The table below describes its two modes.
| Parameter | Effect |
|---|---|
is_tenant=False | Standard UUID index, optimized for general lookups. |
is_tenant=True | Tenant-optimized index, suited for multi-tenant filtering where UUIDs partition data. |
Expected Output
Runningcreate_indexes() registers all six payload indexes — one UUID index, one keyword index, one integer range index, and one datetime index on the patient collection, plus one tenant-optimized UUID index, two keyword indexes, and one integer range index on the trial collection — then prints a confirmation line for each group created.
Step 6: Prepare Sample Patient and Trial Data
The following block defines the patient records and clinical trials that will be ingested in Step 7. Running it loads the data into memory and prints a count of both collections.Expected Output
This block defines six patient records and four clinical trial entries as Python dictionaries, with each patient carrying a free-textprofile_text field alongside structured payload fields such as primary_condition, age, biomarkers, and prior_treatments. The final print statement confirms how many records of each type are held in memory before ingestion begins.
Step 7: Ingest Data into Both Collections
The followingingest_data function batch-embeds patient profiles and trial criteria, upserts both sets of points into their respective collections, flushes the data to disk, and prints the final vector counts. Running asyncio.run(ingest_data()) stores all records and confirms the totals.
Expected Output
Runningingest_data() batch-embeds all patient profile texts and trial criteria texts using embed_texts, constructs PointStruct objects with full payloads, upserts each set into its respective collection, and flushes both to disk. The final get_vector_count calls verify that all vectors have been persisted, and the print statement confirms the ingested totals for patients and trials.
Step 8: Selective Payload Retrieval with WithPayloadSelector
WithPayloadSelector controls exactly which payload fields are returned in a search result. This is useful in healthcare contexts where query results should limit exposure of sensitive fields such as patient identifiers or free-text profiles. Note that selective payload retrieval reduces data exposure at the query level; it is not a substitute for a comprehensive data governance or compliance program.
The following two functions demonstrate the two selector modes. search_patients_phi_safe returns only the de-identified clinical attributes listed in the include selector. search_patients_exclude_fields returns all fields except those listed in the exclude selector. Running both functions with the same query and printing the payload keys shows the difference in what each mode returns.
WithPayloadSelector.
| Mode | Effect |
|---|---|
WithPayloadSelector(include=["field1", "field2"]) | Return only these fields. |
WithPayloadSelector(exclude=["field3", "field4"]) | Return everything except these fields. |
WithPayloadSelector(enable=False) | Equivalent to with_payload=False. |
- Research queries — Include only de-identified clinical attributes (condition, age, biomarkers).
- Admin queries — Include everything.
- External API responses — Exclude
patient_uuid,enrolled_date, and free-text profiles.
Expected Output
Running both search functions with the query"HER2 positive breast cancer patient for clinical trial matching" retrieves the top matching patients. The include-only run returns only the six de-identified fields listed in the selector, omitting patient_uuid, profile_text, and enrolled_date. The exclude run returns all fields except those three, so prior_treatments reappears in the key list. The two result sets show the same patients and scores but different sets of payload keys.
Step 9: Selective Vector Retrieval with WithVectorsSelector
WithVectorsSelector controls which vectors are returned with each result point — useful when working with named vector collections or when you want to reduce response payload size.
The following code retrieves the same set of patient points twice: once with full vectors included and once without. The output shows the difference in vector_dim between the two calls, which illustrates the bandwidth savings from omitting vectors when they are not needed.
WithVectorsSelector supports three modes. For collections with named vectors, include lets you retrieve a subset, saving bandwidth when only one vector is needed for a downstream task.
| Mode | Effect |
|---|---|
WithVectorsSelector(enable=True) | Return all vectors (same as with_vectors=True). |
WithVectorsSelector(enable=False) | Return no vectors. |
WithVectorsSelector(include=["narrative"]) | Return only the specified named vectors. |
Expected Output
The code fetches patient points 0, 1, and 2 from theClinical-Patients collection twice: once with WithVectorsSelector(enable=True) to include the full 384-dimensional embedding, and once with WithVectorsSelector(enable=False) to suppress vector retrieval entirely. Both calls include only the primary_condition payload field. The output below shows that the enable=True call returns 384-dimensional vectors while the enable=False call returns none, confirming the bandwidth savings when vectors are not needed downstream.
Step 10: Search with IVF-Specific Parameters
When using IVF indexes,SearchParams supports ivf_nprobe and indexed_only for fine-grained control over accuracy and coverage. The following function overrides the collection’s default nprobe value and restricts the search to fully indexed segments. Running asyncio.run(search_trials_ivf_tuned(...)) returns trials ranked by Manhattan distance using those overridden parameters.
SearchParams fields used here.
| Parameter | Default | Effect |
|---|---|---|
ivf_nprobe | Collection default | Override the number of IVF partitions to search at query time. |
indexed_only | False | If True, skip unindexed segments (vectors not yet assigned to IVF clusters). |
ivf_nprobe=8 searches twice as many partitions as the collection default (nprobe=4), improving recall at the cost of latency. indexed_only=True ensures only properly indexed vectors are searched — useful during bulk ingestion when some vectors have not yet been assigned to clusters.
Expected Output
The function embeds the query"Clinical trial for EGFR mutant lung cancer after osimertinib progression", searches the Clinical-Trials collection with nprobe=8 and indexed_only=True, and returns the top trials ranked by Manhattan distance. The EAGLE-LUNG trial scores lowest (most similar) because its criteria text closely matches the EGFR progression query, while BEACON-HER2 ranks second despite being an oncology trial for a different condition.
Step 11: Server-Side Fusion with the Fusion Enum
Instead of merging results in client code, thequery endpoint supports server-side fusion via the Fusion enum. The fusion strategy is passed directly in the query parameter — for example, query={"fusion": Fusion.RRF} — while the prefetch list supplies the individual result sets to merge. The following function issues two prefetch stages — one filtered to breast cancer patients and one unfiltered — and merges them on the server using RRF. Running asyncio.run(server_side_fusion_search(...)) returns patients ranked by their fused RRF score without any client-side merge code.
| Approach | Where Fusion Happens | API |
|---|---|---|
| Client-side | Python SDK | reciprocal_rank_fusion([results_a, results_b]) |
| Server-side | VectorAI DB server | query(query={"fusion": Fusion.RRF}, prefetch=[...]) |
- Results never leave the server until after fusion, reducing network overhead.
- The server can optimize the merge internally.
- The approach works with any number of prefetch stages.
Fusion enum has two values: Fusion.RRF (reciprocal rank fusion) and Fusion.DBSF (distribution-based score fusion).
Breast cancer patients score highest in the output because they appear in both the filtered and unfiltered prefetch stages.
Expected Output
The function searches theClinical-Patients collection with the query "HER2 positive breast cancer patient failed prior targeted therapy". The first prefetch stage is filtered to primary_condition == breast_cancer, and the second is unfiltered. Server-side RRF merges both result sets by rank position without any client-side code. Breast cancer patients receive boosted RRF scores because they rank highly in both stages, placing them at the top of the fused results. Only primary_condition, age, biomarkers, and ecog_status are returned in the payload.
Step 12: Random Sampling with the Sample Enum
TheSample enum enables random point retrieval — useful for selecting audit samples or spot-checking data quality. Random sampling retrieves points without any vector similarity scoring, so the results are non-deterministic and differ on each call. The following code calls random_patient_sample twice with the same parameters. Because each call returns a different random set, the two outputs together confirm that sampling is non-deterministic.
Sample.Random retrieves random points from the collection without any vector similarity scoring. The table below shows how it differs from the other query modes.
| Mode | Behavior |
|---|---|
search(vector=...) | Returns points ranked by vector similarity. |
query(query={"order_by": ...}) | Returns points sorted by a payload field. |
query(query={"sample": Sample.Random}) | Returns random points with no ranking. |
- Quality audits — Randomly sample records for review.
- Data validation — Spot-check random entries for data integrity.
- Exploratory analysis — Retrieve a random subset without any bias from similarity ranking.
Expected Output
Running the sample function twice retrieves three patients at random from theClinical-Patients collection on each call, applying no vector similarity scoring. The payload selector returns only primary_condition, age, sex, and site for each result. Because Sample.Random bypasses any ranking, the two calls return completely different patient sets, and neither order reflects medical relevance or record insertion order.
Step 13: Compare Distance Metrics — Euclidean vs Manhattan
Different distance metrics produce different similarity rankings for the same query. The following function runs the same query against the patient collection (Euclidean) and the trial collection (Manhattan) and returns both result sets so the score ranges can be compared directly.- Euclidean — Absolute distance in vector space. Lower scores mean more similar. Sensitive to magnitude differences.
- Manhattan — Sum of absolute differences. Lower scores mean more similar. Less sensitive to outlier dimensions than Euclidean.
Expected Output
The function embeds the query"Advanced breast cancer with HER2 targeted therapy" once and uses the resulting vector to search both collections. The patient collection uses Euclidean distance and returns patients ranked by L2 norm, while the trial collection uses Manhattan distance and returns trials ranked by L1 norm. The output shows Euclidean scores in the 5–8 range and Manhattan scores in the 40–60 range — the same query vector produces numeric scales that differ by roughly an order of magnitude because each formula accumulates distance differently across 384 dimensions.
Step 14: Build the Trial Matching Engine
The followingevaluate_eligibility function applies four rule-based checks to each trial returned by semantic search: condition match, ECOG status, site availability, and trial recruiting status. It computes a match score as the percentage of checks passed, then sorts results with eligible trials first.
Note that this eligibility engine is a simplified demonstration. Real clinical trial inclusion and exclusion criteria cover many additional factors — biomarker profiles, prior treatment sequences, organ function thresholds, geographic constraints, and more. This example should not be used as a substitute for a clinically validated matching workflow.
Step 15: Run the End-to-End Trial Matching Pipeline
The followingmatch_patient_to_trials function wires the matching engine from Step 14 into a complete pipeline. For each patient, it embeds the profile text, retrieves the most similar trials via semantic search, passes those results to evaluate_eligibility, and prints a formatted report showing which trials the patient qualifies for and which eligibility checks passed or failed. Running the function for patient_records[0] and patient_records[1] produces one report per patient.
Expected Output
Running the pipeline for the first two patients produces a matching report for each, showing which trials they qualify for and which eligibility checks passed or failed. The function embeds each patient’sprofile_text, retrieves the top five semantically similar trials from Clinical-Trials, then passes those results to evaluate_eligibility for rule-based scoring. For patient_records[0] (a 58-year-old female with HER2-positive breast cancer at Memorial Cancer Center), BEACON-HER2 passes all four eligibility checks and receives a 100% match score, while EAGLE-LUNG fails on condition and site checks. For patient_records[1] (a 72-year-old male with EGFR-mutant NSCLC at University Medical Center), EAGLE-LUNG passes all four checks and is marked eligible.