Running Presto with Alluxio
Presto is an open source distributed SQL query engine for running interactive analytic queries on data at a large scale. This guide describes how to run Presto to query Alluxio as a distributed cache layer, where the data sources can be AWS S3, Azure blob store, HDFS and many others. With this setup, Alluxio will help Presto access data regardless of the data source and transparently cache the data frequently accessed (e.g., tables commonly used) into Alluxio distributed storage. Co-locating Alluxio workers with Presto workers can benefit data locality and reduce the I/O access latency especially when data is remote or network is slow or congested.
Using Presto with the Alluxio Catalog Service
Currently, there are 2 ways to enable Presto to interact with Alluxio:
- Presto interacts with the Alluxio Catalog Service
- Presto interacts with the Hive Metastore (with table definitions updated to use Alluxio paths)
The primary benefits for using Presto with the Alluxio Catalog Service are simpler deployments of Alluxio with Presto (no modifications to the Hive Metastore), and enabling schema-aware optimizations (transformations like coalescing and file conversions). However, currently, the catalog service is limited to read-only workloads.
For more details and instructions on how to use the Alluxio Catalog Service with Presto, please visit the Alluxio Catalog Service documentation.
The rest of this page discusses the alternative approach of Presto interacting with the Hive Metastore.
Prerequisites
- Setup Java for Java 8 Update 161 or higher (8u161+), 64-bit.
- Deploy Presto. This guide is tested with
presto-315
. - Alluxio has been set up and is running.
- Make sure that the Alluxio client jar is available. This Alluxio client jar file can be found at
/<PATH_TO_ALLUXIO>/client/alluxio-2.2.2-client.jar
in the tarball downloaded from Alluxio download page. - Make sure that Hive metastore is running to serve metadata information of Hive tables.
Basic Setup
Configure Presto to connect to Hive Metastore
Presto gets the database and table metadata information (including file system locations) from the Hive Metastore, via Presto’s Hive connector. Here is a example Presto configuration file ${PRESTO_HOME}/etc/catalog/hive.properties
, for a catalog using the Hive connector.
connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
Distribute the Alluxio client jar to all Presto servers
In order for Presto to be able to communicate with the Alluxio servers, the Alluxio client jar must be in the classpath of Presto servers. Put Alluxio client jar /<PATH_TO_ALLUXIO>/client/alluxio-2.2.2-client.jar
into directory ${PRESTO_HOME}/plugin/hive-hadoop2/
(this directory may differ across versions) on all Presto servers. Restart Presto service:
$ ${PRESTO_HOME}/bin/launcher restart
After completing the basic configuration, Presto should be able to access data in Alluxio. To configure more advanced features for Presto (e.g., connect to Alluxio with HA), please follow the instructions at Advanced Setup.
Examples: Use Presto to Query Tables on Alluxio
Create a Hive table on Alluxio
Here is an example to create an internal table in Hive backed by files in Alluxio. You can download a data file (e.g., ml-100k.zip
) from http://grouplens.org/datasets/movielens/. Unzip this file and upload the file u.user
into /ml-100k/
on Alluxio:
$ ./bin/alluxio fs mkdir /ml-100k
$ ./bin/alluxio fs copyFromLocal /path/to/ml-100k/u.user alluxio:///ml-100k
Create an external Hive table pointing to the Alluxio file location.
hive> CREATE TABLE u_user (
userid INT,
age INT,
gender CHAR(1),
occupation STRING,
zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'alluxio://master_hostname:port/ml-100k';
View Alluxio WebUI at http://master_hostname:19999
and you can see the directory and file Hive creates:
Start Hive metastore
Ensure your Hive metastore service is running. Hive metastore listens on port 9083
by default. If it is not running,
$ ${HIVE_HOME}/bin/hive --service metastore
Start Presto server
Start your Presto server. Presto server runs on port 8080
by default (configurable with http-server.http.port
in ${PRESTO_HOME}/etc/config.properties
):
$ ${PRESTO_HOME}/bin/launcher run
Query tables using Presto
Follow Presto CLI instructions to download the presto-cli-<PRESTO_VERSION>-executable.jar
, rename it to presto
, and make it executable with chmod +x
(sometimes the executable presto
exists in ${PRESTO_HOME}/bin/presto
and you can use it directly).
Run a single query (replace localhost:8080
with your actual Presto server hostname and port):
$ ./presto --server localhost:8080 --execute "use default; select * from u_user limit 10;" \
--catalog hive --debug
And you can see the query results from console:
You can also find some of the Alluxio client log messages in the Presto Server log:
Advanced Setup
Customize Alluxio User Properties
To configure additional Alluxio properties, you can append the conf path (i.e. ${ALLUXIO_HOME}/conf
) containing alluxio-site.properties
to Presto’s JVM config at etc/jvm.config
under Presto folder. The advantage of this approach is to have all the Alluxio properties set within the same file of alluxio-site.properties
.
...
-Xbootclasspath/p:<path-to-alluxio-conf>
Alternatively, one can add them to the Hadoop conf files (core-site.xml
, hdfs-site.xml
), and use Presto property hive.config.resources
in file ${PRESTO_HOME}/etc/catalog/hive.properties
to point to the file’s location for every Presto worker.
hive.config.resources=/<PATH_TO_CONF>/core-site.xml,/<PATH_TO_CONF>/hdfs-site.xml
Example: connect to Alluxio with HA
If the Alluxio HA cluster uses internal leader election, set the Alluxio cluster property appropriately in the alluxio-site.properties
file which is on the classpath.
alluxio.master.rpc.addresses=master_hostname_1:19998,master_hostname_2:19998,master_hostname_3:19998
Alternatively you can add the property to the Hadoop core-site.xml
configuration which is contained by hive.config.resources
.
<configuration>
<property>
<name>alluxio.master.rpc.addresses</name>
<value>master_hostname_1:19998,master_hostname_2:19998,master_hostname_3:19998</value>
</property>
</configuration>
For information about how to connect to Alluxio HA cluster using Zookeeper-based leader election, please refer to HA mode client configuration parameters.
Example: change default Alluxio write type
For example, change alluxio.user.file.writetype.default
from default ASYNC_THROUGH
to CACHE_THROUGH
.
One can specify the property in alluxio-site.properties
and distribute this file to the classpath of each Presto node:
alluxio.user.file.writetype.default=CACHE_THROUGH
Alternatively, modify conf/hive-site.xml
to include:
<property>
<name>alluxio.user.file.writetype.default</name>
<value>CACHE_THROUGH</value>
</property>
Increase parallelism
Presto’s Hive connector uses the config hive.max-split-size
to control the parallelism of the query. For Alluxio 1.6 or earlier, it is recommended to set this size no less than Alluxio’s block size to avoid the read contention within the same block. For later Alluxio versions, this is no longer an issue because of Alluxio’s async caching abilities.
Avoid Presto timeout reading large files
It is recommended to increase alluxio.user.network.data.timeout
to a bigger value (e.g 10min
) to avoid a timeout failure when reading large files from remote workers.
Troubleshooting
Error message “No FileSystem for scheme: alluxio” on queries
When you see error messages like the following, it is likely that Alluxio client jar is not put into the classpath of Presto worker. Please follow instructions to fix this issue.
Query 20180907_063430_00001_cm7xe failed: No FileSystem for scheme: alluxio
com.facebook.presto.spi.PrestoException: No FileSystem for scheme: alluxio
at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:189)
at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
at com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
at com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)