Monitoring and logging

VectorAI DB exposes a /metrics endpoint on the REST API port (default 6333) that serves metrics in Prometheus/OpenMetrics format. Use these metrics to monitor REST API usage, process health, application status, and collection statistics.


Endpoint	`GET /metrics`
Port	REST API port (default `6333`)
Format	Prometheus / OpenMetrics

Scrape configuration

Add VectorAI DB as a Prometheus scrape target. The following example shows a minimal prometheus.yml configuration:

scrape_configs:
  - job_name: "vectorai"
    scrape_interval: 15s
    static_configs:
      - targets: ["localhost:6333"]

For Docker Compose deployments, replace localhost with the service name:

scrape_configs:
  - job_name: "vectorai"
    scrape_interval: 15s
    static_configs:
      - targets: ["vectorai:6333"]

The /metrics endpoint does not require authentication. If you expose it on a public network, restrict access with a firewall rule or reverse proxy.

Available metrics

The following sections describe every metric exposed by the /metrics endpoint, grouped by category. All metrics use the prefix actian_vectorai_. The full metric name is actian_vectorai_<name>. For example, actian_vectorai_collections_total, actian_vectorai_rest_responses_total. In tables, the prefix may be omitted from metric names for space considerations.

Label Keys

Concept	Label key
Collection name	`collection`
Named vector space	`vector_name`
HTTP or gRPC route	`endpoint`
HTTP verb	`method`
HTTP or gRPC status code (string)	`status`
App name	`name`
App version	`version`

Prometheus Naming Rules Applied

Counters end in _total.
Duration histograms end in _duration_seconds (base unit: seconds).
Memory gauges end in _bytes.
Boolean state gauges have a descriptive suffix (_running, _mode).

Application info

These metrics expose application identity and operational state.

Metric	Type	Labels	Description
`app_info`	Gauge	`name`, `version`	Application identity as name and version. Set once when the process starts from built-in metadata.
`app_status_recovery_mode`	Gauge	—	`1` if the engine is in recovery mode, `0` otherwise. Changes whenever the engine enters or exits recovery mode.

Collection metrics

These metrics provide visibility into collection sizes, vector counts, point counts, and optimization state.

Metric	Type	Labels	Description
`collections_total`	Gauge	—	Total number of collection, both loaded in memory and present on disk. Increased on creation and decreased on removal.
`collections_vector_total`	Gauge	—	Aggregate vector count across all collections. Recomputed whenever any collection’s vector count changes.
`collection_point_total`	Gauge	—	Aggregate point count across all collections.
`collection_points`	Gauge	`collection`	Live point count in a named collection. Taken from the count of external identifiers the collection tracks.
`collection_vectors`	Gauge	`collection`, `vector_name`	Vector count per named vector space. Calculated by summing vector counts per space; updated on inserts, deletes, and rebuilds.
`collection_indexed_only_excluded_vectors`	Gauge	`collection`, `vector_name`	Number of vectors excluded from the indexed-only search (for example, deleted or hidden points).
`collection_running_optimizations`	Gauge	`collection`	`1` if the collection is undergoing a rebuild or optimization, `0` if idle. Set when a rebuild task begins and cleared when it ends.

Rebuild metrics

These metrics track index rebuild operations across all collections.

Metric	Type	Labels	Description
`rebuild_running`	Gauge	`collection`	`1` if at least one rebuild is in progress, `0` otherwise. Reset to `0` when the last active rebuild finishes.
`rebuild_triggered_total`	Counter	`collection`	Cumulative count of rebuild tasks submitted. Incremented each time a rebuild request is accepted.
`rebuild_success_total`	Counter	`collection`	Cumulative count of rebuilds that finished successfully.
`rebuild_failed_total`	Counter	`collection`	Cumulative count of rebuilds that failed or were cancelled.
`rebuild_duration_seconds`	Histogram	`collection`	Total rebuild durations, measured from start to finish and recorded in predefined time buckets.
`rebuild_vectors_processed_total`	Counter	`collection`	Total vectors processed across all rebuilds (read or written).
`rebuild_vectors_skipped_total`	Counter	`collection`	Total vectors skipped during rebuilds because they were already updated.
`rebuild_vectors_deleted_total`	Counter	`collection`	Total vectors deleted across all rebuilds.
`rebuild_phase_duration_seconds`	Histogram	`collection`, `phase`	Duration of individual rebuild phases (for example, initialize, populate, catchup, finalize).

Snapshot

These metrics track snapshot creation and recovery operations.

Metric	Type	Labels	Description
`snapshot_creation_running`	Gauge	`collection`	`1` while `SaveSnapshot` is executing for a collection, `0` if idle.
`snapshot_recovery_running`	Gauge	`collection`	`1` while `LoadSnapshot` is executing for a collection, `0` if idle.
`snapshot_created_total`	Counter	`collection`	Cumulative count of successful snapshot saves.

REST API

These metrics track HTTP request volume and latency across all REST endpoints.

Metric	Type	Labels	Description
`rest_responses_total`	Counter	`endpoint`, `method`, `status`	Total number of REST responses by route, method, and status code. Increased for every response the server sends.
`rest_responses_fail_total`	Counter	`endpoint`, `method`	REST responses that returned a 5xx status.
`rest_responses_duration_seconds`	Histogram	`endpoint`, `method`	REST request latency per route and method.

Use actian_vectorai_rest_responses_total to track request rates and error ratios. Use actian_vectorai_rest_responses_duration_seconds to compute percentile latencies (p50, p95, p99) per endpoint.

gRPC API

These metrics track gRPC call volume and latency.

Metric	Type	Labels	Description
`grpc_responses_total`	Counter	`endpoint`, `status`	Total number of gRPC responses by fully qualified method and status. Increased for every completed RPC call.
`grpc_responses_fail_total`	Counter	`endpoint`	gRPC responses with an error status.
`grpc_responses_duration_seconds`	Histogram	`endpoint`	gRPC call latency per fully qualified method, measured from call start to final status.

Combined API

These metrics track combined requests and latency for REST and gRPC.

Metric	Type	Description
`api_requests_total`	Counter	Total number of API requests (REST + gRPC) received since the last server start
`api_responses_duration_seconds`	Histogram	Request latency across REST and gRPC, buckets shared with per-API histograms

Process metrics

These metrics report on the health of the VectorAI DB process at the operating system level, including memory usage from the allocator.

Metric	Type	Description
`memory_resident_bytes`	Gauge	Resident set size
`process_threads`	Gauge	Number of live threads
`process_open_fds`	Gauge	Open file descriptor / handle count
`process_open_mmaps`	Gauge	Open memory-mapped regions
`process_cpu_cores`	Gauge	Logical CPU core count observed by the process
`process_cpu_frequency_hz`	Gauge	Observed CPU frequency (hertz, from `/proc/cpuinfo` or Windows registry)
`process_minor_page_faults_total`	Counter	Minor page faults since start (Linux only)
`process_major_page_faults_total`	Counter	Major page faults since start (Linux only)
`process_cpu_seconds_total`	Counter	Total CPU time consumed (user + kernel)
`process_uptime_seconds`	Gauge	Process uptime in seconds (time since telemetry initialization)
`process_memory_usage_bytes`	Gauge	Total memory currently used by the process (working set/private bytes)
`process_memory_total_bytes`	Gauge	Total physical memory available to the machine
`process_memory_free_bytes`	Gauge	Currently available physical memory observed on the host
`process_disk_usage_bytes`	Gauge	Disk space consumed in the process data path
`process_disk_size_bytes`	Gauge	Total disk capacity reported by `std::filesystem::space()` for the configured VDE data path

The metric actian_vectorai_process_memory_free_bytes is sourced directly from the operating system (Windows GlobalMemoryStatusEx, Linux sysinfo). It reflects machine-wide available RAM, independent of the process’s own usage metrics.

A sustained increase in actian_vectorai_process_major_page_faults_total indicates the system is running low on physical memory and paging to disk, which severely degrades search performance. Consider increasing available memory or reducing the number of loaded collections.

Example PromQL queries

The following Prometheus Query Language examples demonstrate common monitoring patterns that you can use in Grafana or any Prometheus-compatible dashboard tool.

REST request rate by endpoint

sum by (endpoint) (rate(actian_vectorai_rest_responses_total[5m]))

REST error ratio

sum(rate(actian_vectorai_rest_responses_fail_total[5m]))
/
sum(rate(actian_vectorai_rest_responses_total[5m]))

REST p95 latency per endpoint

histogram_quantile(0.95, sum by (le, endpoint) (rate(actian_vectorai_rest_responses_duration_seconds_bucket[5m])))

gRPC request rate by method

sum by (method) (actian_vectorai_rate(grpc_responses_total[5m]))

gRPC error ratio

sum(rate(actian_vectorai_grpc_responses_fail_total[5m]))
/
sum(rate(actian_vectorai_grpc_responses_total[5m]))

Memory usage

actian_vectorai_memory_resident_bytes

Total vectors across all collections

actian_vectorai_collections_vector_total

Points per collection

actian_vectorai_collection_points

Active rebuilds

actian_vectorai_rebuild_running

Rebuild success rate

sum(rate(actian_vectorai_rebuild_success_total[1h]))
/
sum(rate(actian_vectorai_rebuild_triggered_total[1h]))

Recommended alerts

The following table lists suggested Prometheus alerting rules for production deployments.

Alert	Condition	Severity	Description
High REST error rate	`sum(rate(actian_vectorai_rest_responses_fail_total[5m])) / sum(rate(actian_vectorai_rest_responses_total[5m])) > 0.05`	Warning	More than 5% of REST requests failing
High REST p95 latency	`histogram_quantile(0.95, sum by (le) (rate(actian_vectorai_rest_responses_duration_seconds_bucket[5m]))) > 2`	Warning	REST p95 latency exceeds 2 seconds
High gRPC error rate	`sum(actian_vectorai_rate(grpc_responses_fail_total[5m])) / sum(rate(actian_vectorai_grpc_responses_total[5m])) > 0.05`	Warning	More than 5% of gRPC calls failing
Recovery mode active	`actian_vectorai_app_status_recovery_mode == 1`	Critical	Engine is in recovery mode
High memory usage	`actian_vectorai_memory_resident_bytes > 0.8 * <memory_limit>`	Warning	RSS exceeds 80% of available memory
Major page faults rising	`rate(actian_vectorai_process_major_page_faults_total[5m]) > 10`	Warning	Sustained major page faults indicate memory pressure
File descriptor exhaustion	`actian_vectorai_process_open_fds > 0.8 * <fd_limit>`	Warning	Open file descriptors approaching system limit
Rebuild failures	`rate(actian_vectorai_rebuild_failed_total[1h]) > 0`	Warning	One or more index rebuilds have failed

Replace <memory_limit> and <fd_limit> with the actual limits for your deployment environment.

Example alerting rule

The following Prometheus alerting rule fires when the REST error ratio exceeds 5% for more than 5 minutes:

groups:
  - name: vectorai
    rules:
      - alert: VectorAIHighErrorRate
        expr: >
          sum(rate(actian_vectorai_rest_responses_fail_total[5m]))
          /
          sum(rate(actian_vectorai_rest_responses_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "VectorAI DB error rate above 5%"
          description: "{{ $value | humanizePercentage }} of requests are returning errors."

Logging

VectorAI DB writes structured logs to stdout. Configure the log format and level to suit your log aggregation pipeline.

Log format

Set the log format to json for machine-readable output compatible with log aggregation tools such as Elasticsearch, Loki, or Datadog:

logging:
  format: json

The default format is text, which is human-readable but harder to parse programmatically.

Log level

Control log verbosity with the level setting:

logging:
  level: info

Level	Use case
`error`	Production — only errors
`warn`	Production — errors and warnings
`info`	Production default — normal operational messages
`debug`	Troubleshooting — verbose output
`trace`	Development only — extremely verbose

Running at debug or trace level in production generates significant log volume and may impact performance. Use these levels only for short-term troubleshooting.

Next steps

Explore these related guides to learn more.

Troubleshooting

Diagnose connection, performance, and startup issues.

Error handling

Handle specific gRPC error codes in your application code.

Docker installation

Container setup, volume mounts, and Docker Compose configuration.

License and upgrade

Manage license keys and upgrade your VectorAI DB deployment.

Get started

SDKs

Guides

Integrations

Support

Legal

Monitoring and logging

Scrape configuration

Available metrics

Label Keys

Prometheus Naming Rules Applied

Application info

Collection metrics

Rebuild metrics

Snapshot

REST API

gRPC API

Combined API

Process metrics

Example PromQL queries

REST request rate by endpoint

REST error ratio

REST p95 latency per endpoint

gRPC request rate by method

gRPC error ratio

Memory usage

Total vectors across all collections

Points per collection

Active rebuilds

Rebuild success rate

Recommended alerts

Example alerting rule

Logging

Log format

Log level

Next steps

Troubleshooting

Error handling

Docker installation

License and upgrade

​Scrape configuration

​Available metrics

​Label Keys

​Prometheus Naming Rules Applied

​Application info

​Collection metrics

​Rebuild metrics

​Snapshot

​REST API

​gRPC API

​Combined API

​Process metrics

​Example PromQL queries

​REST request rate by endpoint

​REST error ratio

​REST p95 latency per endpoint

​gRPC request rate by method

​gRPC error ratio

​Memory usage

​Total vectors across all collections

​Points per collection

​Active rebuilds

​Rebuild success rate

​Recommended alerts

​Example alerting rule

​Logging

​Log format

​Log level

​Next steps

Troubleshooting

Error handling

Docker installation

License and upgrade

Scrape configuration

Available metrics

Label Keys

Prometheus Naming Rules Applied

Application info

Collection metrics

Rebuild metrics

Snapshot

REST API

gRPC API

Combined API

Process metrics

Example PromQL queries

REST request rate by endpoint

REST error ratio

REST p95 latency per endpoint

gRPC request rate by method

gRPC error ratio

Memory usage

Total vectors across all collections

Points per collection

Active rebuilds

Rebuild success rate

Recommended alerts

Example alerting rule

Logging

Log format

Log level

Next steps