Skip to main content
VectorAI DB exposes a /metrics endpoint on the REST API port (default 6333) that serves metrics in Prometheus/OpenMetrics format. Use these metrics to monitor REST API usage, process health, application status, and collection statistics.
EndpointGET /metrics
PortREST API port (default 6333)
FormatPrometheus / OpenMetrics

Scrape configuration

Add VectorAI DB as a Prometheus scrape target. The following example shows a minimal prometheus.yml configuration:
scrape_configs:
  - job_name: "vectorai"
    scrape_interval: 15s
    static_configs:
      - targets: ["localhost:6333"]
For Docker Compose deployments, replace localhost with the service name:
scrape_configs:
  - job_name: "vectorai"
    scrape_interval: 15s
    static_configs:
      - targets: ["vectorai:6333"]
The /metrics endpoint does not require authentication. If you expose it on a public network, restrict access with a firewall rule or reverse proxy.

Available metrics

The following sections describe every metric exposed by the /metrics endpoint, grouped by category.

Application info

These metrics expose application identity and operational state.
MetricTypeLabelsDescription
app_infoInfoname, versionApplication name and version. Set once when the process starts from built-in metadata.
app_status_recovery_modeGauge1 if the engine is in recovery mode, 0 otherwise. Changed whenever the engine enters or exits recovery mode.

Collection metrics

These metrics provide visibility into collection sizes, vector counts, and optimization state.
MetricTypeLabelsDescription
collections_totalGaugeTotal number of collections (both loaded in memory and present on disk). Increased on creation and decreased on removal.
collections_vector_totalGaugeTotal number of vectors across all collections. Recomputed whenever any collection’s vector count changes.
collection_pointsGaugecollectionNumber of points in a collection. Taken from the count of external identifiers the collection tracks.
collection_vectorsGaugecollection, vector_nameNumber of vectors in a collection across all vector spaces. Calculated by summing vector counts per space; updated on inserts, deletes, and rebuilds.
collection_running_optimizationsGaugecollection1 if the collection is undergoing a rebuild or optimization, 0 if idle. Set when a rebuild task begins and cleared when it ends.
collection_indexed_only_excluded_pointsGaugecollectionNumber of points excluded from the indexed-only view (for example, deleted or hidden points).

Rebuild metrics

These metrics track index rebuild operations across all collections.
MetricTypeLabelsDescription
rebuild_runningGauge1 if at least one rebuild is in progress, 0 otherwise. Reset to 0 when the last active rebuild finishes.
rebuild_triggered_totalCounterCumulative count of rebuild tasks submitted. Incremented each time a rebuild request is accepted.
rebuild_success_totalCounterCumulative count of rebuilds that completed successfully.
rebuild_failed_totalCounterCumulative count of rebuilds that failed or were cancelled.
rebuild_duration_secondsHistogramTotal rebuild durations, measured from start to finish and recorded in predefined time buckets.
rebuild_vectors_processed_totalCounterTotal vectors processed across all rebuilds (read or written).
rebuild_vectors_skipped_totalCounterTotal vectors skipped during rebuilds because they were already up to date.
rebuild_vectors_deleted_totalCounterTotal vectors deleted as part of rebuilds.
rebuild_phase_duration_secondsHistogramphaseDuration of individual rebuild phases (for example, initialization, population, finalization).

Snapshot metrics

These metrics track snapshot creation and recovery operations.
MetricTypeLabelsDescription
snapshot_creation_runningGaugecollection1 if a snapshot creation is in progress for a collection, 0 if idle.
snapshot_recovery_runningGaugecollection1 if a snapshot recovery is in progress for a collection, 0 if idle.
snapshot_created_totalCounterCumulative count of successful snapshot creations.

REST API metrics

These metrics track HTTP request volume and latency across all REST endpoints.
MetricTypeLabelsDescription
rest_responses_totalCounterendpoint, method, statusTotal number of REST responses. Increased for every response the server sends.
rest_responses_fail_totalCounterendpoint, method, statusREST responses that returned a non-2xx status.
rest_responses_duration_secondsHistogramendpoint, method, statusREST request latency measured from request arrival to response.
Use rest_responses_total to track request rates and error ratios. Use rest_responses_duration_seconds to compute percentile latencies (p50, p95, p99) per endpoint.

gRPC API metrics

These metrics track gRPC call volume and latency.
MetricTypeLabelsDescription
grpc_responses_totalCountermethod, statusTotal number of gRPC responses. Increased for every completed RPC call.
grpc_responses_fail_totalCountermethod, statusgRPC responses that finished with an error status.
grpc_responses_duration_secondsHistogrammethod, statusgRPC call latency measured from call start to final status.

Memory metrics

These metrics report on memory usage from the allocator and the operating system.
MetricTypeDescription
memory_active_bytesGaugeCurrent active memory usage reported by the allocator.
memory_resident_bytesGaugeResident set size (RSS) of the process, obtained from the OS.
memory_allocated_bytesGaugeTotal memory allocated by the allocator, including memory that has been freed.
memory_metadata_bytesGaugeMemory used by the allocator for its own bookkeeping structures.
memory_retained_bytesGaugeMemory retained by the allocator but not currently in use.

Process metrics

These metrics report on the operating-system-level health of the VectorAI DB process.
MetricTypeDescription
process_threadsGaugeNumber of threads currently running in the process.
process_open_fdsGaugeNumber of open file descriptors held by the process.
process_open_mmapsGaugeNumber of memory-mapped regions owned by the process.
process_minor_page_faults_totalCounterCumulative minor page faults since process start.
process_major_page_faults_totalCounterCumulative major page faults since process start. Requires disk I/O.
process_cpu_seconds_totalCounterTotal CPU time consumed (user + kernel) in seconds.
A sustained increase in process_major_page_faults_total indicates the system is running low on physical memory and paging to disk, which severely degrades search performance. Consider increasing available memory or reducing the number of loaded collections.

Example PromQL queries

The following queries demonstrate common monitoring patterns you can use in Grafana or any Prometheus-compatible dashboard tool.

REST request rate by endpoint

sum by (endpoint) (rate(rest_responses_total[5m]))

REST error ratio

sum(rate(rest_responses_fail_total[5m]))
/
sum(rate(rest_responses_total[5m]))

REST p95 latency per endpoint

histogram_quantile(0.95, sum by (le, endpoint) (rate(rest_responses_duration_seconds_bucket[5m])))

gRPC request rate by method

sum by (method) (rate(grpc_responses_total[5m]))

gRPC error ratio

sum(rate(grpc_responses_fail_total[5m]))
/
sum(rate(grpc_responses_total[5m]))

Memory usage

memory_resident_bytes

Total vectors across all collections

collections_vector_total

Points per collection

collection_points

Active rebuilds

rebuild_running

Rebuild success rate

sum(rate(rebuild_success_total[1h]))
/
sum(rate(rebuild_triggered_total[1h]))
The following table lists suggested Prometheus alerting rules for production deployments.
AlertConditionSeverityDescription
High REST error ratesum(rate(rest_responses_fail_total[5m])) / sum(rate(rest_responses_total[5m])) > 0.05WarningMore than 5% of REST requests failing
High REST p95 latencyhistogram_quantile(0.95, sum by (le) (rate(rest_responses_duration_seconds_bucket[5m]))) > 2WarningREST p95 latency exceeds 2 seconds
High gRPC error ratesum(rate(grpc_responses_fail_total[5m])) / sum(rate(grpc_responses_total[5m])) > 0.05WarningMore than 5% of gRPC calls failing
Recovery mode activeapp_status_recovery_mode == 1CriticalEngine is in recovery mode
High memory usagememory_resident_bytes > 0.8 * <memory_limit>WarningRSS exceeds 80% of available memory
Major page faults risingrate(process_major_page_faults_total[5m]) > 10WarningSustained major page faults indicate memory pressure
File descriptor exhaustionprocess_open_fds > 0.8 * <fd_limit>WarningOpen file descriptors approaching system limit
Rebuild failuresrate(rebuild_failed_total[1h]) > 0WarningOne or more index rebuilds have failed
Replace <memory_limit> and <fd_limit> with the actual limits for your deployment environment.

Example alerting rule

The following Prometheus alerting rule fires when the REST error ratio exceeds 5% for more than 5 minutes:
groups:
  - name: vectorai
    rules:
      - alert: VectorAIHighErrorRate
        expr: >
          sum(rate(rest_responses_fail_total[5m]))
          /
          sum(rate(rest_responses_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "VectorAI DB error rate above 5%"
          description: "{{ $value | humanizePercentage }} of requests are returning errors."

Logging

VectorAI DB writes structured logs to stdout. Configure the log format and level to suit your log aggregation pipeline.

Log format

Set the log format to json for machine-readable output compatible with log aggregation tools such as Elasticsearch, Loki, or Datadog:
logging:
  format: json
The default format is text, which is human-readable but harder to parse programmatically.

Log level

Control log verbosity with the level setting:
logging:
  level: info
LevelUse case
errorProduction — only errors
warnProduction — errors and warnings
infoProduction default — normal operational messages
debugTroubleshooting — verbose output
traceDevelopment only — extremely verbose
Running at debug or trace level in production generates significant log volume and may impact performance. Use these levels only for short-term troubleshooting.

Next steps

Explore these related guides to learn more.

Troubleshooting

Diagnose connection, performance, and startup issues.

Error handling

Handle specific gRPC error codes in your application code.

Docker installation

Container setup, volume mounts, and Docker Compose configuration.

License and upgrade

Manage license keys and upgrade your VectorAI DB deployment.