Amazon AWS S3
This guide describes the instructions to configure Amazon S3 as Alluxio’s under storage system. Alluxio recognizes the s3://
scheme and uses the aws-sdk to access S3.
Prerequisites
The Alluxio binaries must be on your machine. You can either compile Alluxio, or download the binaries locally.
In preparation for using S3 with Alluxio, create a bucket (or use an existing bucket). You should also note the directory you want to use in that bucket, either by creating a new directory in the bucket, or using an existing one. For the purposes of this guide, the S3 bucket name is called S3_BUCKET
, and the directory in that bucket is called S3_DIRECTORY
.
Basic Setup
Alluxio unifies access to different storage systems through the unified namespace feature. An S3 location can be either mounted at the root of the Alluxio namespace or at a nested directory.
Root Mount Point
Create conf/alluxio-site.properties
if it does not exist.
$ cp conf/alluxio-site.properties.template conf/alluxio-site.properties
Configure Alluxio to use S3 as its under storage system by modifying conf/alluxio-site.properties
. Specify an existing S3 bucket and directory as the under storage system by modifying conf/alluxio-site.properties
to include:
alluxio.master.mount.table.root.ufs=s3://S3_BUCKET/S3_DIRECTORY
Note that if you want to mount the whole s3 bucket, add a trailing slash after the bucket name (e.g. s3://S3_BUCKET/
).
Specify the AWS credentials for S3 access by setting aws.accessKeyId and aws.secretKey in alluxio-site.properties
.
aws.accessKeyId=<S3 ACCESS KEY>
aws.secretKey=<S3 SECRET KEY>
For other methods of setting AWS credentials, see the credentials section in Advanced Setup.
After these changes, Alluxio should be configured to work with S3 as its under storage system, and you can try Running Alluxio Locally with S3.
Nested Mount
An S3 location can be mounted at a nested directory in the Alluxio namespace to have unified access to multiple under storage systems. Alluxio’s Command Line Interface can be used for this purpose.
$ ./bin/alluxio fs mount \
--option aws.accessKeyId=<AWS_ACCESS_KEY_ID> \
--option aws.secretKey=<AWS_SECRET_KEY_ID> \
/mnt/s3 s3://<S3_BUCKET>/<S3_DIRECTORY>
Running Alluxio Locally with S3
Start up Alluxio locally to see that everything works.
$ ./bin/alluxio format
$ ./bin/alluxio-start.sh local
This should start an Alluxio master and an Alluxio worker. You can see the master UI at http://localhost:19999.
Run a simple example program:
$ ./bin/alluxio runTests
Visit your S3 directory S3_BUCKET/S3_DIRECTORY
to verify the files and directories created by Alluxio exist. For this test, you should see files named like:
S3_BUCKET/S3_DIRECTORY/alluxio/data/default_tests_files/Basic_CACHE_THROUGH
To stop Alluxio, you can run:
$ ./bin/alluxio-stop.sh local
Advanced Setup
Advanced Credentials Setup
You can specify credentials in different ways, from highest to lowest priority:
- aws.accessKeyId and aws.secretKey specified as mount options
- aws.accessKeyId and aws.secretKey specified as Java system properties
- aws.accessKeyId and aws.secretKey in
alluxio-site.properties
- Environment Variables
AWS_ACCESS_KEY_ID
orAWS_ACCESS_KEY
(either is acceptable) andAWS_SECRET_ACCESS_KEY
orAWS_SECRET_KEY
(either is acceptable) on the Alluxio servers - Profile file containing credentials at
~/.aws/credentials
- AWS Instance profile credentials, if you are using an EC2 instance
When using an AWS Instance profile as the credentials provider:
- Create an IAM Role with access to the mounted bucket
- Create an Instance profile as a container for the defined IAM Role
- Launch an EC2 instance using the created profile
Note that the IAM role will need access to both the files in the bucket as well as the bucket itself in order to determine the bucket’s owner. Automatically assigning an owner to the bucket can be avoided by setting the property alluxio.underfs.s3.inherit.acl=false
.
See Amazon’s documentation for more details.
Enabling Server Side Encryption
You may encrypt your data stored in S3. The encryption is only valid for data at rest in S3 and will be transferred in decrypted form when read by clients.
Enable this feature by configuring conf/alluxio-site.properties
:
alluxio.underfs.s3.server.side.encryption.enabled=true
DNS-Buckets
By default, a request directed at the bucket named “mybucket” will be sent to the host name “mybucket.s3.amazonaws.com”. You can enable DNS-Buckets to use path style data access, for example: “http://s3.amazonaws.com/mybucket” by setting the following configuration:
alluxio.underfs.s3.disable.dns.buckets=true
Accessing S3 through a proxy
To communicate with S3 through a proxy, modify conf/alluxio-site.properties
to include:
alluxio.underfs.s3.proxy.host=<PROXY_HOST>
alluxio.underfs.s3.proxy.port=<PROXY_PORT>
<PROXY_HOST>
and <PROXY_PORT>
should be replaced the host and port for your proxy.
Configuring Application Dependency
When building your application to use Alluxio, your application should include a client module, the alluxio-core-client-fs
module to use the Alluxio file system interface or the alluxio-core-client-hdfs
module to use the Hadoop file system interface. For example, if you are using maven, you can add the dependency to your application with:
Alluxio file system interface
<dependency>
<groupId>org.alluxio</groupId>
<artifactId>alluxio-core-client-fs</artifactId>
<version>2.1.2</version>
</dependency>
HDFS file system interface
<dependency>
<groupId>org.alluxio</groupId>
<artifactId>alluxio-core-client-hdfs</artifactId>
<version>2.1.2</version>
</dependency>
Alternatively, you may copy conf/alluxio-site.properties
(having the properties setting credentials) to the classpath of your application runtime (e.g., $SPARK_CLASSPATH
for Spark), or append the path to this site properties file to the classpath.
Using a non-Amazon service provider
To use an S3 service provider other than “s3.amazonaws.com”, modify conf/alluxio-site.properties
to include:
alluxio.underfs.s3.endpoint=<S3_ENDPOINT>
For these parameters, replace <S3_ENDPOINT>
with the hostname and port of your S3 service, e.g., http://localhost:9000
. Only use this parameter if you are using a provider other than s3.amazonaws.com
.
Using v2 S3 Signatures
Some S3 service providers only support v2 signatures. For these S3 providers, you can enforce using the v2 signatures by setting the alluxio.underfs.s3.signer.algorithm
to S3SignerType
.
[Experimental] S3 streaming upload
S3 is an object store and because of this feature, the whole file is sent from client to worker, stored in the local disk temporary directory, and uploaded in the close()
method by default.
To enable S3 streaming upload, you need to modify conf/alluxio-site.properties
to include:
alluxio.underfs.s3.streaming.upload.enabled=true
The default upload process is safer but has the following issues:
- Slow upload time. The file has to be sent to Alluxio worker first and then Alluxio worker is responsible for uploading the file to S3. The two processes are sequential.
- The temporary directory must have the capacity to store the whole file.
- Slow
close()
. The time ofclose()
method is proportional to the file size and inversely proportional to the bandwidth. That isO(FILE_SIZE/BANDWITH)
. Slowclose()
is unexpected and has already been a bottleneck in the Alluxio Fuse integration. Alluxio Fuse method which callsclose()
is asynchronous and thus if we write a big file through Alluxio Fuse to S3, the Fuse write operation will be returned much earlier than the file has been written to S3.
The S3 streaming upload feature addresses the above issues and is based on the S3 low-level multipart upload.
The S3 streaming upload has the following advantages:
- Shorter upload time. Alluxio worker uploads buffered data while receiving new data. The total upload time will be at least as fast as the default method.
- Smaller capacity requirement. Our data is buffered and uploaded according to partitions (
alluxio.underfs.s3.streaming.upload.partition.size
which is 64MB by default). When a partition is successfully uploaded, this partition will be deleted. - Faster
close()
. We begin uploading data when data buffered reaches the partition size instead of uploading the whole file inclose()
.
If a S3 streaming upload is interrupted, there may be intermediate partitions uploaded to S3 and S3 will charge for those data.
To reduce the charges, users can modify conf/alluxio-site.properties
to include:
alluxio.underfs.cleanup.enabled=true
Intermediate multipart uploads in all non-readonly S3 mount points older than the clean age (configured by alluxio.underfs.s3.intermediate.upload.clean.age
) will be cleaned when a leading master starts or a cleanup interval (configured by alluxio.underfs.cleanup.interval
) is reached.
Tuning for High Concurrency
When using Alluxio to access S3 with a high number of clients per Alluxio server, these parameters can be tuned so that Alluxio uses a configuration optimized for the S3 backend.
# If the S3 connection is slow, a larger timeout is useful
alluxio.underfs.s3.socket.timeout=500sec
alluxio.underfs.s3.request.timeout=5min
# If we expect a high number of concurrent metadata operations
alluxio.underfs.s3.admin.threads.max=80
# If the total metadata + data operations is high
alluxio.underfs.s3.threads.max=160
# A. For an Alluxio worker. this controls the number of concurrent writes to S3
# B. For an Alluxio master, this controls the number of threads used to concurrently rename
# files within a directory
alluxio.underfs.s3.upload.threads.max=80
# An Alluxio master uses this thread-pool size to submit delete and rename operations to S3
alluxio.underfs.object.store.service.threads=80
S3 Access Control
If Alluxio security is enabled, Alluxio enforces the access control inherited from underlying object storage.
The S3 credentials specified in Alluxio config represents a S3 user. S3 service backend checks the user permission to the bucket and the object for access control. If the given S3 user does not have the access permissions to the specified bucket, a permission denied error will be thrown. When Alluxio security is enabled, Alluxio loads the bucket ACL to Alluxio permission on the first time when the metadata is loaded to Alluxio namespace.
Mapping from S3 user to Alluxio file owner
By default, Alluxio tries to extract the S3 user display name from the S3 credentials. Optionally, alluxio.underfs.s3.owner.id.to.username.mapping
can be used to specify a preset S3 canonical id to Alluxio username static mapping, in the format “id1=user1;id2=user2”. The AWS S3 canonical ID can be found at the console https://console.aws.amazon.com/iam/home?#/security_credentials
. Please expand the “Account Identifiers” tab and refer to “Canonical User ID”.
Mapping from S3 ACL to Alluxio permission
Alluxio checks the S3 bucket read/write ACLs to determine the owner’s permissions to a Alluxio file. For example, if the S3 user has read-only access to the underlying bucket, the mounted directory and files will be set with 0500 permissions. If the S3 user has full access to the underlying bucket, the mounted directory and files will be set with 0700 permissions.
Mount point sharing
If you want to share the S3 mount point with other users in Alluxio namespace, you can enable alluxio.underfs.object.store.mount.shared.publicly
.
Permission change
chown
, chgrp
, and chmod
of Alluxio directories and files do NOT propagate to the underlying S3 buckets nor objects.
Troubleshooting
If issues are encountered running against your S3 backend, enable additional logging to track HTTP traffic. Modify conf/log4j.properties
to add the following properties:
log4j.logger.com.amazonaws=WARN
log4j.logger.com.amazonaws.request=DEBUG
log4j.logger.org.apache.http.wire=DEBUG
See Amazon’s documentation for more details.
Note: Alluxio creates zero-byte files in S3 as a performance optimization. For a bucket mounted with read-write access, zero-byte file creation (S3 PUT operation) is not restricted to writes using Alluxio but also occurs when listing contents of the under storage. To disable the PUT operation, mount the bucket with the --readonly
flag.