Tiered Locality
The tiered locality feature enables applications to make intelligent, topology-based decisions regarding which Alluxio workers to read from and write to. For an application running on host1
, reading data from an Alluxio worker on host1
is more efficient than reading from a worker on host2
. Similarly, reading from a worker on the same rack or in the same data center is faster than reading from a worker on a different rack or different data center. Tiered locality allows users to take advantage of various levels of locality by configuring servers and clients with network topology information.
Tiered Identity
Each entity is identified with a Tiered Identity, where an entity is a master, worker, or client. This identity is an address tuple in the format (tierName1=value1, tierName2=value2, …). Each entry in the tuple is called a locality tier. Alluxio clients will favor interacting with workers that share identical identity entries in the provided order.
Using the template identity of (node=node_name, rack=rack_name, datacenter=datacenter_name)
, a client with an identity (node=A, rack=rack1, datacenter=dc1)
will prefer to read from a worker at (node=B, rack=rack1, datacenter=dc1)
over a worker at (node=C, rack=rack2, datacenter=dc1)
because:
- no worker shares the same
node
value as the client - the first worker shares the same
rack
value,rack1
, as the client - the
datacenter
entry is ignored because at least one match was found in the previous locality tier
Basic Setup
For this example, suppose Alluxio workers are spread across multiple availability zones within EC2. To configure tiered locality:
Write a script named
alluxio-locality.sh
. Alluxio uses this script to determine the tiered identity for each entity. The output format is a comma-separated list oftierName=tierValue
pairs.#!/usr/bin/env bash
node=$(hostname)
# Ask EC2 which availability zone we're in
availability_zone=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)
echo "node=${node},availability_zone=${availability_zone}"
Make the script executable with
chmod +x alluxio-locality.sh
.Include the script on the classpath of your applications and Alluxio servers. For servers, put the file in the
conf
directory.On the Alluxio master(s), set
alluxio.locality.order=node,availability_zone
to define the order of locality tiers.Restart Alluxio servers to pick up the configuration changes.
To verify that the configuration is working, check the master, worker, and application logs. A log entry should appear of the format:
INFO TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=ip-xx-xx-xx-xx, availability_zone=us-east-1)
If the log entry does not appear, try running the locality script directly to check its output and ensure it is executable by the user that luanched the Alluxio server.
Advanced
Custom locality tiers
Alluxio configures two locality tiers by default: node
and rack
. Users may customize the set of locality tiers to take advantage of more advanced topologies. The list of tiers is determined by the alluxio.locality.order
property on the master. The order should go from most to least specific. For example, to add availability zone locality to a cluster, set:
alluxio.locality.order=node,rack,availability_zone
If the user does nothing to provide tiered identity info, each entity will perform a hostname lookup to set its node-level identity info. If other locality tiers are left unset, they will not be used to inform locality decisions.
Setting locality
Locality script
By default, Alluxio clients and servers search the classpath for a script named alluxio-locality.sh
. Output format of this script is a comma-separated list of tierName=tierValue pairs. The script name can be overridden by setting:
alluxio.locality.script=locality_script_name
Configuration properties
Using locality script is the preferred way to configure tiered locality because the same script can be used for Alluxio servers and compute applications. In situations where users do not have access to application classpaths, tiered locality can be configured by setting alluxio.locality.[tiername]
:
alluxio.locality.node=node_name
alluxio.locality.rack=rack_name
See the Configuration-Settings page for the different ways to set configuration properties.
Tier value priority order
For every tier name (e.g. node
, rack
, availibility_zone
etc.) the order of precedence for obtaining its value, from highest priority to lowest priority, is as follows:
From
alluxio.locality.node
set inalluxio-site.properties
From
node=...
in the output of the script, whose name is configured byalluxio.locality.script
In order to supply a default value for a particular node
tier, above list is followed by two more sources, from highest to lowest priority:
From
alluxio.worker.hostname
on a worker,alluxio.master.hostname
on a master, oralluxio.user.hostname
on a client in their respectivealluxio-site.properties
If none of the above are configured, node locality is determined by hostname lookup
When exactly is tiered locality used?
- When clients choose which worker to read through during UFS reads
- When clients choose which worker to read from when multiple Alluxio workers hold a block
- If using the
LocalFirstPolicy
orLocalFirstAvoidEvictionPolicy
, clients use tiered locality to choose which worker to write to when writing data to Alluxio
Custom locality tiers
By default, Alluxio has two locality tiers: node
and rack
. Users may customize the set of locality tiers to take advantage of more advanced topologies. To change the set of tiers available, set alluxio.locality.order
. The order should go from most specific to least specific. For example, to add availability zone locality to a cluster, set
alluxio.locality.order=node,rack,availability_zone
Note that this configuration must be set for all entities, including Alluxio clients.
Now to set the availability zone for each entity, either set the alluxio.locality.availability_zone
property key, or use a locality script and include availability_zone=...
in the output.