Consul on AWS Elastic Container Service (ECS) Architecture
The following diagram shows the main components of the Consul architecture when deployed to an ECS cluster:
- Consul servers: Production-ready Consul server cluster
- Application tasks: Runs user application containers along with two helper containers:
- Consul client: The Consul client container runs Consul. The Consul client communicates with the Consul server and configures the Envoy proxy sidecar. This communication is called control plane communication.
- Sidecar proxy: The sidecar proxy container runs Envoy. All requests to and from the application container(s) run through the sidecar proxy. This communication is called data plane communication.
- Mesh Init: Each task runs a short-lived container, called
mesh-init
, which sets up initial configuration for Consul and Envoy. - Health Syncing: Optionally, an additional
health-sync
container can be included in a task to sync health statuses from ECS into Consul. - ACL Controller: The ACL controller is responsible for automating configuration and cleanup in the Consul servers. The ACL controller will automatically configure the AWS IAM Auth Method, and cleanup unused ACL tokens from Consul. When using Consul Enterprise namespaces, the ACL controller will automatically create Consul namespaces for ECS tasks.
For more information about how Consul works in general, see Consul’s Architecture Overview.
Task Startup
This diagram shows the timeline of a task starting up and all its containers:
- T0: ECS starts the task. The
consul-client
andmesh-init
containers start:- consul-client does the following:
- If ACLs are enabled, a startup script runs a
consul login
command to obtain a token from the AWS IAM auth method for the Consul client. This token hasnode:write
permissions. - It uses the
retry-join
option to join the Consul cluster.
- If ACLs are enabled, a startup script runs a
- mesh-init does the following:
- If ACLs are enabled, mesh-init runs a
consul login
command to obtain a token from the AWS IAM auth method for the service registration. This token hasservice:write
permissions for the service and its sidecar proxy. This token is written to a shared volume for use by thehealth-sync
container. - It registers the service for the current task and its sidecar proxy with Consul.
- It runs
consul connect envoy -bootstrap
to generate Envoy’s bootstrap JSON file and writes it to a shared volume.
- If ACLs are enabled, mesh-init runs a
- consul-client does the following:
- T1: The following containers start:
- sidecar-proxy starts using a custom entrypoint command,
consul-ecs envoy-entrypoint
. The entrypoint command starts Envoy by runningenvoy -c <path-to-bootstrap-json>
. - health-sync starts if ECS health checks are defined or if ACLs are enabled. It syncs health checks from ECS to Consul (see ECS Health Check Syncing).
- sidecar-proxy starts using a custom entrypoint command,
- T2: The
sidecar-proxy
container is marked as healthy by ECS. It uses a health check that detects if its public listener port is open. At this time, your application containers are started since all Consul machinery is ready to service requests.
Task Shutdown
This diagram shows an example timeline of a task shutting down:
- T0: ECS sends a TERM signal to all containers. Each container reacts to the TERM signal:
- consul-client begins to gracefully leave the Consul cluster.
- health-sync stops syncing health status from ECS into Consul checks.
- sidecar-proxy ignores the TERM signal and continues running until the
user-app
container exits. The custom entrypoint command,consul-ecs envoy-entrypoint
, monitors the local ECS task metadata. It waits until theuser-app
container has exited before terminating Envoy. This enables the application to continue making outgoing requests through the proxy to the mesh for graceful shutdown. - user-app exits if it is not configured to ignore the TERM signal. The
user-app
container will continue running if it is configured to ignore the TERM signal.
- T1:
- health-sync does the following:
- It updates its Consul checks to critical status and exits. This ensures this service instance is marked unhealthy.
- If ACLs are enabled, it runs
consul logout
for the two tokens created by theconsul-client
andmesh-init
containers. This removes those tokens from Consul. Ifconsul logout
fails for some reason, the ACL controller will remove the tokens after the task has stopped.
- sidecar-proxy notices the
user-app
container has stopped and exits.
- health-sync does the following:
- T2:
consul-client
finishes gracefully leaving the Consul datacenter and exits. - T3:
- ECS notices all containers have exited, and will soon change the Task status to
STOPPED
- Updates about this task have reached the rest of the Consul cluster, so downstream proxies have been updated to stopped sending traffic to this task.
- ECS notices all containers have exited, and will soon change the Task status to
- T4: At this point task shutdown should be complete. Otherwise, ECS will send a KILL signal to any containers still running. The KILL signal cannot be ignored and will forcefully stop containers. This will interrupt in-progress operations and possibly cause errors.
ACL Tokens
Two types of ACL tokens are required by ECS tasks:
- Client tokens: used by the
consul-client
containers to join the Consul cluster - Service tokens: used by sidecar containers for service registration and health syncing
With Consul on ECS, these tokens are obtained dynamically when a task starts up by logging in via Consul’s AWS IAM auth method.
Consul Client Token
Consul client tokens require node:write
for any node name, which is necessary because the Consul node names on ECS are not known until runtime.
Service Token
Service tokens are associated with a service identity. The service identity includes service:write
permissions for the service and sidecar proxy.
AWS IAM Auth Method
Consul’s AWS IAM Auth Method is used by ECS tasks to automatically obtain Consul ACL tokens. When a service mesh task on ECS starts up, it runs two consul login
commands to obtain a client token and a service token via the auth method. When the task stops, it attempts two consul logout
commands in order to destroy these tokens.
During a consul login
, the task’s IAM role is presented to the AWS IAM auth method on the Consul servers. The role is validated with AWS. If the role is valid, and if the auth method trusts the IAM role, then the role is permitted to login. A new Consul ACL token is created and Binding Rules associate permissions with the newly created token. These permissions are mapped to the token based on the IAM role details. For example, tags on the IAM role are used to specify the service name and the Consul Enterprise namespace to be associated with a service token that is created by a successful login to the auth method.
Task IAM Role
The following configuration is required for the task IAM role in order to be compatible with the auth method. When using Terraform, the mesh-task
module creates the task role with this configuration by default.
- A scoped
iam:GetRole
permission must be included on the IAM role, enabling the role to fetch details about itself. - A
consul.hashicorp.com.service-name
tag on the IAM role must be set to the Consul service name. - EnterpriseEnterprise A
consul.hashicorp.com.namespace
tag must be set on the IAM role to the Consul Enterprise namespace of the Consul service for the task.
Task IAM roles should not typically be shared across task families. Since a task family represents a single Consul service, and since the task role must include the Consul service name, one task role is required for each task family when using the auth method.
Security
The auth method relies on the configuration of AWS resources, such as IAM roles, IAM policies, and ECS tasks. If these AWS resources are misconfigured or if the account has loose access controls, then the security of your service mesh may be at risk.
Any entity in your AWS account with the ability to obtain credentials for an IAM role could potentially obtain a Consul ACL token and impersonate a Consul service. The mesh-task
Terraform module mitigates against this concern by creating the task role with an AssumeRolePolicyDocument
that allows only the AWS ECS service to assume the task role. By default, other entities are unable to obtain credentials for task roles, and are unable to abuse the AWS IAM auth method to obtain Consul ACL tokens.
However, other entities in your AWS account with the ability to create or modify IAM roles can potentially circumvent this. For example, if they are able to create an IAM role with the correct tags, they can obtain a Consul ACL token for any service. Or, if they can pass a role to an ECS task and start an ECS task, they can use the task to obtain a Consul ACL token via the auth method.
The IAM policy actions iam:CreateRole
, iam:TagRole
, iam:PassRole
, and sts:AssumeRole
can be used to restrict these capabilities in your AWS account and improve security when using the AWS IAM auth method. See the AWS documentation to learn how to restrict these permissions in your AWS account.
ACL Controller
The ACL controller performs the following operations on the Consul servers:
- Configures the Consul AWS IAM auth method.
- Monitors tasks in ECS cluster where the controller is running.
- Cleans up unused Consul ACL tokens created by tasks in this cluster.
- EnterpriseEnterprise Manages Consul admin partitions and namespaces.
Auth Method Configuration
The ACL controller is responsible for configuring the AWS IAM auth method. The following resources are created by the ACL controller when it starts up:
- Client role: The controller creates the Consul (not IAM) role and policy used for client tokens if these do not exist. This policy has
node:write
permissions to enable Consul clients to join the Consul cluster. - Auth method for client tokens: One instance of the AWS IAM auth method is created for client tokens, if it does not exist. A binding rule is configured that attaches the Consul client role to each token created during a successful login to this auth method instance.
- Auth method for service tokens: One instance of the AWS IAM auth method is created for service tokens, if it does not exist:
- A binding rule is configured to attach a service identity to each token created during a successful login to this auth method instance. The service name for this service identity is taken from the tag,
consul.hashicorp.com.service-name
, on the IAM role used to log in. - EnterpriseEnterprise A namespace binding rule is configured to create service tokens in the namespace specified by the tag,
consul.hashicorp.com.namespace
, on the IAM role used to log in.
- A binding rule is configured to attach a service identity to each token created during a successful login to this auth method instance. The service name for this service identity is taken from the tag,
The ACL controller configures both instances of the auth method to permit only certain IAM roles to login, by setting the BoundIAMPrincipalARNs field of the AWS IAM auth method as follows:
- By default, the only IAM roles permitted to log in must have an ARN matching the pattern,
arn:aws:iam::<ACCOUNT>:role/consul-ecs/*
. This allows IAM roles at the role path/consul-ecs/
to log in, and only those IAM roles in the same AWS account where the ACL controller is running. - The role path can be changed by setting the
iam_role_path
input variable for themesh-task
andacl-controller
modules, or by passing the-iam-role-path
flag to theconsul-ecs acl-controller
command. - Each instance of the auth method is shared by ACL controllers in the same Consul datacenter. Each controller updates the auth method, if necessary, to include additional entries in the
BoundIAMPrincipalARNs
list. This enables the use of the auth method with ECS clusters in different AWS accounts, for example. This does not apply when using Consul Enterprise admin partitions because auth method instances are not shared by multiple controllers in that case.
Task Monitoring
After startup, the ACL controller monitors tasks in the same ECS cluster where the ACL controller is running in order to discover newly running tasks and tasks that have stopped.
The ACL controller cleans up tokens created by consul login
for tasks that are no longer running. Normally, each task attempts consul logout
commands when the task stops to destroy its tokens. However, in unstable conditions the consul logout
command may fail to clean up a token. The ACL controller runs continually to ensure those unused tokens are soon removed.
Admin Partitions and NamespacesEnterpriseEnterprise
When admin partitions and namespaces are enabled, the ACL controller is assigned to its configured admin partition. It supports one ACL controller instance per ECS cluster. This results in an architecture with one admin partition per ECS cluster.
When admin partitions and namespace are enabled, the ACL controller performs the following additional actions:
- At startup, creates its assigned admin partition if it does not exist.
- Inspects task tags for new ECS tasks to discover the task’s intended partition and namespace. The ACL controller ignores tasks with a partition tag that does not match the controller’s assigned partition.
- Creates namespaces when tasks start up. Namespaces are only created if they do not exist.
- Creates auth method instances for client and service tokens in controller’s assigned admin partition.
ECS Health Check Syncing
If the following conditions apply, ECS health checks automatically sync with Consul health checks for all application containers:
- marked as
essential
- have ECS
healthChecks
- are not configured with native Consul health checks
The mesh-init
container creates a TTL health check for every container that fits these criteria and the health-sync
container ensures that the ECS and Consul health checks remain in sync.