Configuring concurrency
- Soft versus hard concurrency limits
  - Soft limit
  - Hard limit
- Target utilization

Configuring concurrency

Concurrency determines the number of simultaneous requests that can be processed by each replica of an application at any given time.

For per-revision concurrency, you must configure both autoscaling.knative.dev/metricand autoscaling.knative.dev/target for a soft limit, or containerConcurrency for a hard limit.

For global concurrency, you can set the container-concurrency-target-default value.

Soft versus hard concurrency limits

It is possible to set either a soft or hard concurrency limit.

NOTE: If both a soft and a hard limit are specified, the smaller of the two values will be used. This prevents the Autoscaler from having a target value that is not permitted by the hard limit value.

The soft limit is a targeted limit rather than a strictly enforced bound. In some situations, particularly if there is a sudden burst of requests, this value can be exceeded.

The hard limit is an enforced upper bound. If concurrency reaches the hard limit, surplus requests will be buffered and must wait until enough capacity is free to execute the requests.

IMPORTANT: Using a hard limit configuration is only recommended if there is a clear use case for it with your application. Having a low hard limit specified may have a negative impact on the throughput and latency of an application, and may cause additional cold starts.

Soft limit

Global key: container-concurrency-target-default
Per-revision annotation key: autoscaling.knative.dev/target
Possible values: An integer.
Default: "100"

Example:

Per Revision
Global (ConfigMap)
Global (Operator)

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/target: "200"

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 container-concurrency-target-default: "200"

apiVersion: operator.knative.dev/v1alpha1
kind: KnativeServing
metadata:
  name: knative-serving
spec:
  config:
    autoscaler:
      container-concurrency-target-default: "200"

Hard limit

The hard limit is specified per Revision using the containerConcurrency field on the Revision spec. This setting is not an annotation.

There is no global setting for the hard limit in the autoscaling ConfigMap, because containerConcurrency has implications outside of autoscaling, such as on buffering and queuing of requests. However, a default value can be set for the Revision’s containerConcurrency field in config-defaults.yaml.

The default value is 0, meaning that there is no limit on the number of requests that are allowed to flow into the revision.
A value greater than 0 specifies the exact number of requests that are allowed to flow to the replica at any one time.
Global key: container-concurrency (in config-defaults.yaml)
Per-revision spec key: containerConcurrency
Possible values: integer
Default: 0, meaning no limit
Per Revision
Global (Defaults ConfigMap)
Global (Operator)

Example:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    spec:
      containerConcurrency: 50

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-defaults
 namespace: knative-serving
data:
 container-concurrency: "50"

apiVersion: operator.knative.dev/v1alpha1
kind: KnativeServing
metadata:
  name: knative-serving
spec:
  config:
    defaults:
      container-concurrency: "50"

Target utilization

In addition to the literal settings explained previously, concurrency values can be further adjusted by using a target utilization value.

This value specifies what percentage of the previously specified target should actually be targeted by the Autoscaler. This is also known as specifying the hotness at which a replica runs, which causes the Autoscaler to scale up before the defined hard limit is reached.

For example, if containerConcurrency is set to 10, and the target utilization value is set to 70 (percent), the Autoscaler will create a new replica when the average number of concurrent requests across all existing replicas reaches 7. Requests numbered 7 to 10 will still be sent to the existing replicas, but this allows for additional replicas to be started in anticipation of being needed when the containerConcurrency limit is reached.

Global key: container-concurrency-target-percentage
Per-revision annotation key: autoscaling.knative.dev/targetUtilizationPercentage
Possible values: float
Default: 70

Example:

Per Revision
Global (ConfigMap)
Global (Operator)

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/targetUtilizationPercentage: "80"
    spec:
      containers:
        - image: gcr.io/knative-samples/helloworld-go

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 container-concurrency-target-percentage: "80"

apiVersion: operator.knative.dev/v1alpha1
kind: KnativeServing
metadata:
  name: knative-serving
spec:
  config:
    autoscaler:
      container-concurrency-target-percentage: "80"