Version: 0.33.0

Operations Enterprise

caution

This feature is in alpha and certain aspects will change

We're very excited for people to use this feature. However, please note that changes in the API, behaviour and security will evolve. The feature is suitable to use in controlled testing environments.

As platform engineer you could need to have a finer understanding on the underlying logic for Explorer. The following options are available to you to operate and troubleshoot it.

Debug Access Rules

It is a debugging tool to make visible explorer authorization logic. You could find it as tab Access Rules alongside the Query tab.

access rules

You could discover by Cluster and Subject the Kinds it is allowed to read. These are the rules that will be the source of truth doing authorization when a user does a query.

Monitoring

Explorer provides the following telemetry to use for operations.

Metrics

Explorer exports Prometheus metrics. Configuration happens during releasing as shown below.

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: weave-gitops-enterprise
  namespace: flux-system
spec:
  values:
    #### Metrics - Prometheus metrics configuration
    metrics:
      # Enables metrics generation and prometheus endpoint
      enabled: true
      service:
        # -- Port to start the metrics exporter on
        port: 8080
        # -- Annotations to set on the service
        annotations:
          prometheus.io/scrape: "true"
          prometheus.io/path: "/metrics"
          prometheus.io/port: "{{ .Values.metrics.service.port }}"

Querying

Explorer querying path is composed of three components exporting metrics:

API server
Datastore Reads
Indexer Reads

API Server

Based on go-http-metrics, the following metrics are generated.

Request Duration: histogram with the latency of the HTTP requests.

http_request_duration_seconds_bucket{handler="/v1/query",method="POST",le="0.05"} 0
http_request_duration_seconds_sum{handler="/v1/query",method="POST"} 10.088081923
http_request_duration_seconds_count{handler="/v1/query",method="POST"} 51

Response Size: histogram with the size of the HTTP responses in bytes

http_response_size_bytes_bucket{handler="/v1/query",method="POST",le="0.05"} 10
http_response_size_bytes_sum{handler="/v1/query",method="POST"} 120
http_response_size_bytes_count{handler="/v1/query",method="POST"} 10

Requests In Flight: gauge with the number of inflight requests being handled at the same time.

http_requests_inflight{handler="/v1/query"} 0

Datastore Reads

Request Latency: histogram with the latency of the datastore read requests.

action is the datastore read operation that could be either GetObjects, GetAccessRules, GetObjectByID, GetRoles or GetRoleBindings.
status is the result of the operation. It could be either read operation that could be either success or error.

datastore_latency_seconds_bucket{action="GetObjectByID", le="+Inf", status="success"} 1175
datastore_latency_seconds_bucket{action="GetObjectByID", le="0.01", status="success"} 1174

datastore_latency_seconds_count{action="GetObjectByID",  status="success"} 1175
datastore_latency_seconds_count{action="GetRoleBindings",  status="success"} 47
datastore_latency_seconds_count{action="GetRoles",  status="success"} 47

datastore_latency_seconds_sum{action="GetObjectByID",  status="success"} 0.6924557999999995
datastore_latency_seconds_sum{action="GetRoleBindings",  status="success"} 1.329158916
datastore_latency_seconds_sum{action="GetRoles",  status="success"} 3.942473879999999

Requests In Flight: gauge with the number of inflight requests being handled at the same time.

action is the datastore read operation that could be either GetObjects, GetAccessRules, GetObjectByID, GetRoles or GetRoleBindings

datastore_inflight_requests{action="GetObjectByID"} 0
datastore_inflight_requests{action="GetRoleBindings"} 0
datastore_inflight_requests{action="GetRoles"} 0

Indexer Reads

Request Latency: histogram with the latency of the indexer read requests.

action is the index read operation that could be either ListFacets or Search
status is the result of the operation. It could be either read operation that could be either success or error

indexer_latency_seconds_bucket{action="ListFacets", le="+Inf", status="success"} 1
indexer_latency_seconds_bucket{action="Search", le="+Inf", status="success"} 47

indexer_latency_seconds_sum{action="ListFacets", status="success"} 0.008928666
indexer_latency_seconds_sum{action="Search", status="success"} 0.06231312599999999

indexer_latency_seconds_count{action="ListFacets", status="success"} 1
indexer_latency_seconds_count{action="Search", status="success"} 47

Requests In Flight: gauge with the number of inflight requests being handled at the same time.

action is the index read operation that could be either ListFacets or Search

indexer_inflight_requests{action="ListFacets"} 0
indexer_inflight_requests{action="Search"} 0

Collecting

Explorer collecting path is composed of three components exporting metrics:

Cluster Watcher Manager
Datastore Writes
Indexer Writes

The following metrics are available to monitor its health.

Cluster Watcher

The metric collector_cluster_watcher provides the number of the cluster watchers it the following status:

Starting: a cluster watcher is starting at the back of detecting that a new cluster has been registered.
Started: cluster watcher has been started and collecting events from the remote cluster. This is the stable state.
Stopping: a cluster has been deregistered so its cluster watcher is no longer required. In the process of stopping it.
Failed: a cluster watcher has failed during the creation or starting process and cannot collect events from the remote clusters. This is the unstable state.

Where collector is the type of collector, it could be

rbac: for collecting RBAC resources (ie roles)
objects: for collecting non-rbac resources (ie kustomizations)

collector_cluster_watcher{collector="objects", status="started"} 1
collector_cluster_watcher{collector="objects", status="starting"} 0
collector_cluster_watcher{collector="rbac", status="started"} 1
collector_cluster_watcher{collector="rbac", status="starting"} 0

A sum on collector_cluster_watcher gives the total number of cluster watchers that should be equal to the number of clusters

Datastore Writes

Request Latency: histogram with the latency of the datastore write requests.

action is the datastore write operation that could be either StoreRoles, StoreRoleBindings, StoreObjects, DeleteObjects, DeleteAllObjects, DeleteRoles, DeleteAllRoles, DeleteRoleBindings, DeleteAllRoleBindings
status is the result of the operation. It could be either read operation that could be either success or error

datastore_latency_seconds_bucket{action="StoreRoles", le="+Inf", status="success"} 1175
datastore_latency_seconds_bucket{action="StoreRoles", le="0.01", status="success"} 1174

datastore_latency_seconds_count{action="StoreRoles",  status="success"} 1175
datastore_latency_seconds_count{action="DeleteRoles",  status="success"} 47
datastore_latency_seconds_count{action="DeleteAllRoleBindings",  status="success"} 47

datastore_latency_seconds_sum{action="StoreRoles",  status="success"} 0.6924557999999995
datastore_latency_seconds_sum{action="DeleteRoles",  status="success"} 1.329158916
datastore_latency_seconds_sum{action="DeleteAllRoleBindings",  status="success"} 3.942473879999999

Requests In Flight: gauge with the number of inflight write requests being handled at the same time.

action is the datastore write operation that could be either StoreRoles, StoreRoleBindings, StoreObjects, DeleteObjects, DeleteAllObjects, DeleteRoles, DeleteAllRoles, DeleteRoleBindings, DeleteAllRoleBindings

datastore_inflight_requests{action="StoreRoles"} 0
datastore_inflight_requests{action="StoreRoleBindings"} 0
datastore_inflight_requests{action="DeleteAllRoleBindings"} 0

Indexer Writes

Request Latency: histogram with the latency of the indexer write requests.

action is the index write operation that could be either Add, Remove or RemoveByQuery
status is the result of the operation. It could be either success or error

indexer_latency_seconds_bucket{action="Add",status="success",le="+Inf"} 109
indexer_latency_seconds_bucket{action="Remove",status="success",le="+Inf"} 3

indexer_latency_seconds_sum{action="Add",status="success"} 8.393912168
indexer_latency_seconds_sum{action="Remove",status="success"} 0.012298476

indexer_latency_seconds_count{action="Add",status="success"} 109
indexer_latency_seconds_count{action="Remove",status="success"} 3

Requests In Flight: gauge with the number of inflight requests being handled at the same time.

action is the index write operation that could be either Add, Remove or RemoveByQuery

indexer_inflight_requests{action="Add"} 0
indexer_inflight_requests{action="Remove"} 0

Dashboard

You could leverage this grafana dashboard in Grafana to monitor its golden signals

explorer

Operations Enterprise

Debug Access Rules​

Monitoring​

Metrics​

Querying​

API Server​

Datastore Reads​

Indexer Reads​

Collecting​

Cluster Watcher​

Datastore Writes​

Indexer Writes​

Dashboard​

Debug Access Rules

Monitoring

Metrics

Querying

API Server

Datastore Reads

Indexer Reads

Collecting

Cluster Watcher

Datastore Writes

Indexer Writes

Dashboard