Running Alluxio on Google Cloud Dataproc
This guide describes how to configure Alluxio to run on Google Cloud Dataproc.
Overview
Google Cloud Dataproc is a managed on-demand service to run Presto, Spark and Hadoop compute workloads. It manages the deployment of various Hadoop Services and allows for hooks into these services for customizations. Aside from the added performance benefits of caching, Alluxio also enables users to run compute workloads against on-premise storage, or even a different cloud provider’s storage such as AWS S3 and Azure Blob Store.
Prerequisites
- A project with Cloud Dataproc API and Compute Engine API enabled.
- A GCS Bucket.
Make sure that the gcloud CLI is set up with necessary GCS interoperable storage access keys.
Note: GCS interoperability should be enabled in the Interoperability tab in GCS setting.
A GCS bucket is required if mounted to the root of the Alluxio namespace. Alternatively, the root UFS can be reconfigured to HDFS or any other supported under store.
Basic Setup
When creating a Dataproc cluster, Alluxio can be installed using an initialization action.
Create a cluster
There are several properties set as metadata labels which control the Alluxio Deployment.
- A required argument is the root UFS address configured using alluxio_root_ufs_uri.
- Properties must be specified using the metadata key alluxio_site_properties delimited using a semicolon (
;
).
$ gcloud dataproc clusters create <cluster_name> \
--initialization-actions gs://alluxio-public/dataproc/2.2.2/alluxio-dataproc.sh \
--metadata \
alluxio_root_ufs_uri=gs://<my_bucket>,\
alluxio_site_properties="fs.gcs.accessKeyId=<my_access_key>;fs.gcs.secretAccessKey=<my_secret_key>"
Customization
The Alluxio deployment on Google Dataproc can customized for more complex scenarios by passing additional metadata labels to the gcloud clusters create
command.
Download Additional Files
Additional files can be downloaded into the Alluxio installation directory at /opt/alluxio/conf
using the metadata key alluxio_download_files_list
. Specify http(s)
or gs
uris delimited using ;.
...
--metadata \
alluxio_download_files_list="gs://<my_bucket>/<my_file>;https://<server>/<file>",\
...
Tiered Storage
The default Alluxio Worker memory is set to 1/3 of the physical memory on the instance. If a specific value is desired, set alluxio.worker.memory.size
in the provided alluxio-site.properties
.
Alternatively, when volumes such as Dataproc Local SSDs are mounted, specify the metadata label alluxio_ssd_capacity_usage
to configure the percentage of all available SSDs on the virtual machine provisioned as Alluxio worker storage. Memory is not configured as the primary Alluxio storage tier in this case.
Pass additional arguments to the gcloud clusters create
command.
...
--num-worker-local-ssds=1 \
--metadata \
alluxio_ssd_capacity_usage="60",\
...
Next steps
The status of the cluster deployment can be monitored using the CLI.
$ gcloud dataproc clusters list
Identify the instance name and SSH into this instance to test the deployment.
$ gcloud compute ssh <cluster_name>-m
Test that Alluxio is running as expected
$ alluxio runTests
Alluxio is installed in /opt/alluxio/
by default.
Compute Applications
Spark, Hive and Presto on Dataproc are pre-configured to connect to Alluxio.
To run a Spark application accessing data from Alluxio, simply refer to the path as alluxio:///<path_to_file>
.
Open a shell.
$ spark-shell
Run a sample job.
scala> sc.textFile("alluxio:///default_tests_files/BASIC_NO_CACHE_MUST_CACHE").count
For further information, visit our Spark on Alluxio documentation.
Download a sample dataset.
$ wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
$ unzip ml-100k.zip
Copy the data to Alluxio
$ alluxio fs mkdir /ml-100k
$ alluxio fs copyFromLocal ~/ml-100k/u.user /ml-100k/
Open the Hive CLI.
$ hive
Create a table.
hive> CREATE EXTERNAL TABLE u_user (
userid INT,
age INT,
gender CHAR(1),
occupation STRING,
zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION 'alluxio:///ml-100k';
Run a query.
hive> select * from u_user limit 10;
For further information, visit our Hive on Alluxio documentation.
Note:
- Initialization actions are executed sequentially and Presto installation must precede Alluxio.
- The Presto initialization action should install in the home directory
/opt/presto-server
.
To test Presto on Alluxio, simply run a query on the table created in the Hive section above:
presto --execute "select * from u_user limit 10;" --catalog hive --schema default
For further information, visit our Presto on Alluxio documentation.