Loki Canary

Loki Canary is a standalone app that audits the log capturing performance ofLoki.

How it works

block_diagram

Loki Canary writes a log to a file and stores the timestamp in an internalarray. The contents look something like this:

  1. 1557935669096040040 ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp

The relevant part of the log entry is the timestamp; the ps are just fillerbytes to make the size of the log configurable.

An agent (like Promtail) should be configured to read the log file and ship itto Loki.

Meanwhile, Loki Canary will open a WebSocket connection to Loki and will tailthe logs it creates. When a log is received on the WebSocket, the timestampin the log message is compared to the internal array.

If the received log is:

  • The next in the array to be received, it is removed from the array and the(current time - log timestamp) is recorded in the response_latencyhistogram. This is the expected behavior for well behaving logs.
  • Not the next in the array to be received, it is removed from the array, theresponse time is recorded in the response_latency histogram, and theout_of_order_entries counter is incremented.
  • Not in the array at all, it is checked against a separate list of receivedlogs to either increment the duplicate_entries counter or theunexpected_entries counter.

In the background, Loki Canary also runs a timer which iterates through all ofthe entries in the internal array. If any of the entries are older than theduration specified by the -wait flag (defaulting to 60s), they are removedfrom the array and the websocket_missing_entries counter is incremented. Anadditional query is then made directly to Loki for any missing entries todetermine if they are truly missing or only missing from the WebSocket. Ifmissing entries are not found in the direct query, the missing_entries counteris incremented.

Installation

Binary

Loki Canary is provided as a pre-compiled binary as part of theLoki Releases on GitHub.

Docker

Loki Canary is also provided as a Docker container image:

  1. # change tag to the most recent release
  2. $ docker pull grafana/loki-canary:v0.2.0

Kubernetes

To run on Kubernetes, you can do something simple like:

kubectl run loki-canary --generator=run-pod/v1 --image=grafana/loki-canary:latest --restart=Never --image-pull-policy=IfNotPresent --labels=name=loki-canary -- -addr=loki:3100

Or you can do something more complex like deploy it as a DaemonSet, there is aTanka setup for this in the production folder, you can import it usingjsonnet-bundler:

  1. jb install github.com/grafana/loki-canary/production/ksonnet/loki-canary

Then in your Tanka environment’s main.jsonnet you’ll want something likethis:

  1. local loki_canary = import 'loki-canary/loki-canary.libsonnet';
  2. loki_canary {
  3. loki_canary_args+:: {
  4. addr: "loki:3100",
  5. port: 80,
  6. labelname: "instance",
  7. interval: "100ms",
  8. size: 1024,
  9. wait: "3m",
  10. },
  11. _config+:: {
  12. namespace: "default",
  13. }
  14. }

Examples

Standalone Pod Implementation of loki-canary

  1. ---
  2. apiVersion: v1
  3. kind: Pod
  4. metadata:
  5. labels:
  6. app: loki-canary
  7. name: loki-canary
  8. name: loki-canary
  9. spec:
  10. containers:
  11. - args:
  12. - -addr=loki:3100
  13. image: grafana/loki-canary:latest
  14. imagePullPolicy: IfNotPresent
  15. name: loki-canary
  16. resources: {}
  17. ---
  18. apiVersion: v1
  19. kind: Service
  20. metadata:
  21. name: loki-canary
  22. labels:
  23. app: loki-canary
  24. spec:
  25. type: ClusterIP
  26. selector:
  27. app: loki-canary
  28. ports:
  29. - name: metrics
  30. protocol: TCP
  31. port: 3500
  32. targetPort: 3500

DeamonSet Implementation of loki-canary

  1. ---
  2. kind: DaemonSet
  3. apiVersion: extensions/v1beta1
  4. metadata:
  5. labels:
  6. app: loki-canary
  7. name: loki-canary
  8. name: loki-canary
  9. spec:
  10. template:
  11. metadata:
  12. name: loki-canary
  13. labels:
  14. app: loki-canary
  15. spec:
  16. containers:
  17. - args:
  18. - -addr=loki:3100
  19. image: grafana/loki-canary:latest
  20. imagePullPolicy: IfNotPresent
  21. name: loki-canary
  22. resources: {}
  23. ---
  24. apiVersion: v1
  25. kind: Service
  26. metadata:
  27. name: loki-canary
  28. labels:
  29. app: loki-canary
  30. spec:
  31. type: ClusterIP
  32. selector:
  33. app: loki-canary
  34. ports:
  35. - name: metrics
  36. protocol: TCP
  37. port: 3500
  38. targetPort: 3500

From Source

If the other options are not sufficient for your use case, you can compileloki-canary yourself:

  1. # clone the source tree
  2. $ git clone https://github.com/grafana/loki
  3. # build the binary
  4. $ make loki-canary
  5. # (optionally build the container image)
  6. $ make loki-canary-image

Configuration

The address of Loki must be passed in with the -addr flag, and if your Lokiserver uses TLS, -tls=true must also be provided. Note that using TLS willcause the WebSocket connection to use wss:// instead of ws://.

The -labelname and -labelvalue flags should also be provided, as these areused by Loki Canary to filter the log stream to only process logs for thecurrent instance of the canary. Ensure that the values provided to the flags areunique to each instance of Loki Canary. Grafana Labs’ Tanka configaccomplishes this by passing in the pod name as the label value.

If Loki Canary reports a high number of unexpected_entries, Loki Canary maynot be waiting long enough and the value for the -wait flag should beincreased to a larger value than 60s.

Be aware of the relationship between pruneinterval and the interval.For example, with an interval of 10ms (100 logs per second) and a prune intervalof 60s, you will write 6000 logs per minute. If those logs were not receivedover the WebSocket, the canary will attempt to query Loki directly to see ifthey are completely lost. However the query return is limited to 1000results so you will not be able to return all the logs even if they did make itto Loki.

Likewise, if you lower the pruneinterval you risk causing a denial ofservice attack as all your canaries attempt to query for missing logs atwhatever your pruneinterval is defined at.

All options:

  1. -addr string
  2. The Loki server URL:Port, e.g. loki:3100
  3. -buckets int
  4. Number of buckets in the response_latency histogram (default 10)
  5. -interval duration
  6. Duration between log entries (default 1s)
  7. -labelname string
  8. The label name for this instance of loki-canary to use in the log selector (default "name")
  9. -labelvalue string
  10. The unique label value for this instance of loki-canary to use in the log selector (default "loki-canary")
  11. -pass string
  12. Loki password
  13. -port int
  14. Port which loki-canary should expose metrics (default 3500)
  15. -pruneinterval duration
  16. Frequency to check sent vs received logs, also the frequency which queries for missing logs will be dispatched to loki (default 1m0s)
  17. -size int
  18. Size in bytes of each log line (default 100)
  19. -tls
  20. Does the loki connection use TLS?
  21. -user string
  22. Loki username
  23. -wait duration
  24. Duration to wait for log entries before reporting them lost (default 1m0s)