JuiceFS

In this page, we explain how to use Hudi with JuiceFS.

JuiceFS configs

JuiceFS is a high-performance distributed file system. Any data stored into JuiceFS, the data itself will be persisted in object storage (e.g. Amazon S3), and the metadata corresponding to the data can be persisted in various database engines such as Redis, MySQL, and TiKV according to the needs of the scene.

There are three configurations required for Hudi-JuiceFS compatibility:

  1. Creating JuiceFS file system
  2. Adding JuiceFS configuration for Hudi
  3. Adding required JAR to classpath

Creating JuiceFS file system

JuiceFS supports multiple metadata engines such as Redis, MySQL, SQLite, and TiKV. And supports almost all object storage as data storage, e.g. Amazon S3, Google Cloud Storage, Azure Blob Storage.

The following example uses Redis as “Metadata Engine” and Amazon S3 as “Data Storage” in Linux environment.

Download JuiceFS client

  1. $ JFS_LATEST_TAG=$(curl -s https://api.github.com/repos/juicedata/juicefs/releases/latest | grep 'tag_name' | cut -d '"' -f 4 | tr -d 'v')
  2. $ wget "https://github.com/juicedata/juicefs/releases/download/v${JFS_LATEST_TAG}/juicefs-${JFS_LATEST_TAG}-linux-amd64.tar.gz"

Install JuiceFS client

  1. $ mkdir juice && tar -zxvf "juicefs-${JFS_LATEST_TAG}-linux-amd64.tar.gz" -C juice
  2. $ sudo install juice/juicefs /usr/local/bin

Format a JuiceFS file system

  1. $ juicefs format \
  2. --storage s3 \
  3. --bucket https://<bucket>.s3.<region>.amazonaws.com \
  4. --access-key <your-access-key-id> \
  5. --secret-key <your-access-key-secret> \
  6. redis://:<password>@<redis-host>:6379/1 \
  7. myjfs

For more information, please refer to “JuiceFS Quick Start Guide”.

Adding JuiceFS configuration for Hudi

Add the required configurations in your core-site.xml from where Hudi can fetch them.

  1. <property>
  2. <name>fs.defaultFS</name>
  3. <value>jfs://myjfs</value>
  4. <description>Optional, you can also specify full path "jfs://myjfs/path-to-dir" with location to use JuiceFS</description>
  5. </property>
  6. <property>
  7. <name>fs.jfs.impl</name>
  8. <value>io.juicefs.JuiceFileSystem</value>
  9. </property>
  10. <property>
  11. <name>fs.AbstractFileSystem.jfs.impl</name>
  12. <value>io.juicefs.JuiceFS</value>
  13. </property>
  14. <property>
  15. <name>juicefs.meta</name>
  16. <value>redis://:<password>@<redis-host>:6379/1</value>
  17. </property>
  18. <property>
  19. <name>juicefs.cache-dir</name>
  20. <value>/path-to-your-disk</value>
  21. </property>
  22. <property>
  23. <name>juicefs.cache-size</name>
  24. <value>1024</value>
  25. </property>
  26. <property>
  27. <name>juicefs.access-log</name>
  28. <value>/tmp/juicefs.access.log</value>
  29. </property>

You can visit here for more configuration information.

Adding JuiceFS Hadoop Java SDK

You can download latest JuiceFS Hadoop Java SDK from here (download the file called like juicefs-hadoop-X.Y.Z-linux-amd64.jar), and place it to the classpath. You can also compile it by yourself.

For example, if you use Hudi in Spark, please put the JAR in $SPARK_HOME/jars.