FUSE-based POSIX API

FUSE-based POSIX API

The Alluxio POSIX API is a feature that allows mounting an Alluxio File System as a standard file system on most flavors of Unix. By using this feature, standard tools (for example, ls, cat or mkdir) will have basic access to the Alluxio namespace. More importantly, with the POSIX API integration applications can interact with the Alluxio no matter what language (C, C++, Python, Ruby, Perl, or Java) they are written in without any Alluxio library integrations.

Note that Alluxio-FUSE is different from projects like s3fs, mountableHdfs which mount specific storage services like S3 or HDFS to the local filesystem. The Alluxio POSIX API is a generic solution for the many storage systems supported by Alluxio. Data orchestration and caching features from Alluxio speed up I/O access to frequently used data.

The Alluxio POSIX API is based on the Filesystem in Userspace (FUSE) project. Most basic file system operations are supported. However, given the intrinsic characteristics of Alluxio, like its write-once/read-many-times file data model, the mounted file system does not have full POSIX semantics and contains some limitations. Please read the section of limitations for details.

Choose POSIX API Implementation

The Alluxio POSIX API has two implementations for users to choose from:

Alluxio JNR-Fuse Alluxio’s first generation Fuse implementation that uses JNR-Fuse for FUSE on Java. JNR-Fuse targets for low concurrency scenarios and has some known limitations in performance.
Alluxio JNI-Fuse Alluxio’s default in-house implementation based on JNI (Java Native Interface) which targets more performance-sensitve applications (like model training workloads) and initiated by researchers from Nanjing University and engineers from Alibaba Inc.

Here is a guideline To choose between the default JNR-Fuse and experimental JNI-Fuse:

Workloads: If your data access pattern is highly concurrent (e.g., deep learning training), JNI-Fuse is better and more stable.
Maintenance: JNI-Fuse is experimental but under active development (checkout our developer meeting notes). Alluxio community will focus more on developing JNI-Fuse and deprecate Alluxio JNR-Fuse eventually.

Requirements

The followings are the basic requirements running Alluxio FUSE integration to support POSIX API in a standalone way. Installing Alluxio using Docker and Kubernetes can further simplify the setup.

Install JDK 1.8 or newer
Install libfuse
- On Linux, install libfuse 2.9.3 or newer (2.8.3 has been reported to also work - with some warnings). For example on a Redhat, run yum install fuse fuse-devel
- On MacOS, install osxfuse 3.7.1 or newer. For example, run brew install osxfuse
JNI-Fuse is enabled by default for better performance. If JNR-Fuse is needed, set alluxio.fuse.jnifuse.enabled to false in ${ALLUXIO_HOME}/conf/alluxio-site.properties:

alluxio.fuse.jnifuse.enabled=false

Basic Setup

Mount Alluxio as a FUSE mount point

After properly configuring and starting an Alluxio cluster; Run the following command on the node where you want to create the mount point:

$ ${ALLUXIO_HOME}/integration/fuse/bin/alluxio-fuse mount \
  <mount_point> [<alluxio_path>]

This will spawn a background user-space java process (alluxio-fuse) that will mount the Alluxio path specified at <alluxio_path> to the local file system on the specified <mount_point>.

For example, running the following commands from the ${ALLUXIO_HOME} directory will mount the Alluxio path /people to the folder /mnt/people on the local file system.

$ ./bin/alluxio fs mkdir /people
$ sudo mkdir -p /mnt/people
$ sudo chown $(whoami) /mnt/people
$ chmod 755 /mnt/people
$ integration/fuse/bin/alluxio-fuse mount /mnt/people /people

When <alluxio_path> is not given, the value defaults to the root (/). Note that the <mount_point> must be an existing and empty path in your local file system hierarchy and that the user that runs the integration/fuse/bin/alluxio-fuse script must own the mount point and have read and write permissions on it. Multiple Alluxio FUSE mount points can be created in the same node. All the AlluxioFuse processes share the same log output at $ALLUXIO_HOME\logs\fuse.log, which is useful for troubleshooting when errors happen on operations under the filesystem.

Unmount Alluxio from FUSE

To unmount a previously mounted Alluxio-FUSE file system, on the node where the file system is mounted run:

$ ${ALLUXIO_HOME}/integration/fuse/bin/alluxio-fuse unmount mount_point

This unmounts the file system at the mount point and stops the corresponding Alluxio-FUSE process. For example,

$ ${ALLUXIO_HOME}/integration/fuse/bin/alluxio-fuse unmount /mnt/people
Unmount fuse at /mnt/people (PID:97626).

Check the Alluxio POSIX API mounting status

To list the mount points; on the node where the file system is mounted run:

$ ${ALLUXIO_HOME}/integration/fuse/bin/alluxio-fuse stat

This outputs the pid, mount_point, alluxio_path of all the running Alluxio-FUSE processes.

For example, the output could be:

pid mount_point alluxio_path
80846 /mnt/people /people
80847 /mnt/sales  /sales

Advanced Setup

Configure Alluxio fuse options

These are the configuration parameters for Alluxio POSIX API.

Parameter	Default Value	Description
alluxio.fuse.cached.paths.max	500	Defines the size of the internal Alluxio-FUSE cache that maintains the most frequently used translations between local file system paths and Alluxio file URIs.
alluxio.fuse.debug.enabled	false	Enable FUSE debug output. This output will be redirected in a `fuse.out` log file inside `alluxio.logs.dir`.
alluxio.fuse.fs.name	alluxio-fuse	Descriptive name used by FUSE to mount the file system.
alluxio.fuse.jnifuse.enabled	true	Use JNI-Fuse library for better performance. If disabled, JNR-Fuse will be used.
alluxio.fuse.shared.caching.reader.enabled	false	(Experimental) Use share grpc data reader for better performance on multi-process file reading through Alluxio JNI Fuse. Blocks data will be cached on the client side so more memory is required for the Fuse process.
alluxio.fuse.logging.threshold	10s	Logging a FUSE API call when it takes more time than the threshold.
alluxio.fuse.maxwrite.bytes	131072	The desired granularity of FUSE write upcalls in bytes. Note that 128K is currently an upper bound imposed by the linux kernel.
alluxio.fuse.user.group.translation.enabled	false	Whether to translate Alluxio users and groups into Unix users and groups when exposing Alluxio files through the FUSE API. When this property is set to false, the user and group for all FUSE files will match the user who started the alluxio-fuse process

Configure mount point options

You can use -o [mount options] to set mount options. If you want to set multiple mount options, you can pass in comma separated mount options as the value of -o. The -o [mount options] must follow the mount command.

Different versions of libfuse and osxfuse may support different mount options. The available Linux mount options are listed here. The mount options of MacOS with osxfuse are listed here . Some mount options (e.g. allow_other and allow_root) need additional set-up and the set up process may be different depending on the platform.

$ ${ALLUXIO_HOME}integration/fuse/bin/alluxio-fuse mount \
  -o [comma separated mount options] mount_point [alluxio_path]

Tuning mount options

Mount option	Default value	Tuning suggestion	Description
direct_io	set by default in JNR-Fuse	don’t set in JNI-Fuse	When `direct_io` is enabled, kernel will not cache data and read-ahead. `direct_io` is enabled by default in JNR-Fuse but is recommended not to be set in JNI-Fuse cause it may have stability issue under high I/O load.
kernel_cache		Unable to set in JNR-Fuse, recommend to set in JNI-Fuse based on workloads	`kernel_cache` utilizes kernel system caching and improves read performance. This should only be enabled on filesystems, where the file data is never changed externally (not through the mounted FUSE filesystem).
attr_timeout=N	1.0	7200	The timeout in seconds for which file/directory attributes are cached. The default is 1 second. Recommend set to a larger value to reduce the time to retrieve file metadata operations from Alluxio master and improve performance.
big_writes		Set	Stop Fuse from splitting I/O into small chunks and speed up write.
entry_timeout=N	1.0	7200	The timeout in seconds for which name lookups will be cached. The default is 1 second. Recommend set to a larger value to reduce the file metadata operations in Alluxio-Fuse and improve performance.
`max_read=N`	131072	Use default value	Define the maximum size of data can be read in a single Fuse request. The default is infinite. Note that the size of read requests is limited anyway to 32 pages (which is 128kbyte on i386).

A special mount option is the max_idle_threads=N which defines the maximum number of idle fuse daemon threads allowed. If the value is too small, FUSE may frequently create and destroy threads which will introduce extra performance overhead. Note that, libfuse introduce this mount option in 3.2 while JNI-Fuse supports 2.9.X during experimental stage. The Alluxio Fuse docker image alluxio/alluxio-fuse enables this property by modifying the libfuse source code.

If you are using alluxio fuse docker image, set the MAX_IDLE_THREADS via environment variable:

$ docker run --rm \
    ...
    --env MAX_IDLE_THREADS=64 \
    alluxio/alluxio-fuse fuse

Example: `allow_other` and `allow_root`

By default, Alluxio-FUSE mount point can only be accessed by the user mounting the Alluxio namespace to the local filesystem.

For Linux, add the following line to file /etc/fuse.conf to allow other users or allow root to access the mounted folder:

user_allow_other

This option allow non-root users to specify the allow_other or allow_root mount options.

For MacOS, follow the osxfuse allow_other instructions to allow other users to use the allow_other and allow_root mount options.

After setting up, pass the allow_other or allow_root mount options when mounting Alluxio-FUSE:

# All users (including root) can access the files.
$ integration/fuse/bin/alluxio-fuse mount -o allow_other mount_point [alluxio_path]
# The user mounting the filesystem and root can access the files.
$ integration/fuse/bin/alluxio-fuse mount -o allow_root mount_point [alluxio_path]

Note that only one of the allow_other or allow_root could be set.

Assumptions and Limitations

Currently, most basic file system operations are supported. However, due to Alluxio implicit characteristics, please be aware that:

Files can be written only once, only sequentially, and never be modified. That means overriding a file is not allowed, and an explicit combination of delete and then create is needed. For example, the cp command will fail when the destination file exists. vi and vim commands will succeed because the underlying system do create, delete, and rename operation combinations.
Alluxio does not have hard-links or soft-links, so commands like ln are not supported. The hardlinks number is not displayed in ll output.
The user and group are mapped to the Unix user and group only when Alluxio POSIX API is configured to use shell user group translation service, by setting alluxio.fuse.user.group.translation.enabled to true. Otherwise chown and chgrp are no-ops, and ll will return the user and group of the user who started the Alluxio-FUSE process. The translation service does not change the actual file permission when running ll.

Performance Optimization

Due to the conjunct use of FUSE, the performance of the mounted file system is expected to lower compared to using the Alluxio Java client directly.

Most of the overheads come from the fact that there are several memory copies going on for each call for read or write operations. FUSE caps the maximum granularity of writes to 128KB. This could be probably improved by a large extent by leveraging the FUSE cache write-backs feature introduced in the 3.15 Linux Kernel (supported by libfuse 3.x but not yet supported in jnr-fuse/jni-fuse).

The following client options are useful when running deep learning workloads against Alluxio JNI-Fuse based on our experience. If you find other options useful, please share with us via Alluxio community slack channel or pull request. Note that these changes should be done before the mounting steps.

Enable Metadata Caching

Alluxio Fuse process can cache file metadata locally to reduce the overhead of repeatedly requesting metadata of the same file from Alluxio Master. Enable when the workload repeatedly getting information of numerous files.

Configuration	Default Value	Description
alluxio.user.metadata.cache.enabled	false	If this is enabled, metadata of paths will be cached. The cached metadata will be evicted when it expires after alluxio.user.metadata.cache.expiration.time or the cache size is over the limit of alluxio.user.metadata.cache.max.size.
alluxio.user.metadata.cache.max.size	100000	Maximum number of paths with cached metadata. Only valid if the filesystem is alluxio.client.file.MetadataCachingBaseFileSystem.
alluxio.user.metadata.cache.expiration.time	10min	Metadata will expire and be evicted after being cached for this time period. Only valid if the filesystem is alluxio.client.file.MetadataCachingBaseFileSystem.

For example, a workload that repeatedly gets information of 1 million files and runs for 50 minutes can set the following configuration:

alluxio.user.metadata.cache.enabled=true
alluxio.user.metadata.cache.max.size=1000000
alluxio.user.metadata.cache.expiration.time=1h

The metadata size of 1 million files usually is around 25MB to 100MB. Enable metadata cache may also introduce some overhead, but may not be as big as client data cache.

Other Performance or Debugging Tips

The following client options may affect the training performance or provides more training information.

Configuration	Default Value	Description
alluxio.user.metrics.collection.enabled	false	Enable the collection of fuse client side metrics like short-circuit read/write information to show on the Alluxio Web UI.
alluxio.user.logging.threshold	10s	Logging a client RPC when it takes more time than the threshold.
alluxio.user.unsafe.direct.local.io.enabled	false	(Experimental) If this is enabled, clients will read from local worker directly without invoking extra RPCs to worker to require locations. Note this optimization is only safe when the workload is read only and the worker has only one tier and one storage directory in this tier.
alluxio.user.update.file.accesstime.disabled	false	(Experimental) By default, a master RPC will be issued to Alluxio Master to update the file access time whenever a user accesses it. If this is enabled, the clients doesn’t update file access time which may improve the file access performance but cause issues for some applications.
alluxio.user.block.worker.client.pool.max	1024	Limits the number of block worker clients for Alluxio JNI-Fuse to read data from remote worker or validate block locations. Some deep training jobs don’t release the block worker clients immediately and may stuck in waiting for any available.
alluxio.user.block.master.client.pool.size.max	1024	Limits the number of block master client for Alluxio JNI-Fuse to get block information.
alluxio.user.file.master.client.pool.size.max	1024	Limits the number of file master client or Alluxio JNI-Fuse to get or update file metadata.

Increase Direct Memory Size

When encountering the out of direct memory issue, add the following JVM opts to ${ALLUXIO_HOME}/conf/alluxio-env.sh to increase the max amount of direct memory.

ALLUXIO_FUSE_JAVA_OPTS+=" -XX:MaxDirectMemorySize=8G"