Filesystem API

Slack Docker Pulls GitHub edit source

Applications primarily interact with Alluxio through its Filesystem API. Java users can either use the Alluxio Java Client, or the Hadoop-Compatible Java Client, which wraps the Alluxio Java Client to implement the Hadoop API.

Alluxio also provides a POSIX API after mounting Alluxio as a local FUSE volume.

By setting up an Alluxio Proxy, users can also interact with Alluxio through a REST API similar to the Filesystem API. The REST API is currently used for the Go and Python language bindings.

A fourth option is to interact with Alluxio through its S3 API. Users can interact using the same S3 clients used for AWS S3 operations. This makes it easy to change existing S3 workloads to use Alluxio.

Java Client

Alluxio provides access to data through a filesystem interface. Files in Alluxio offer write-once semantics: they become immutable after they have been written in their entirety and cannot be read before being completed. Alluxio provides users two different Filesystem APIs to access the same file system:

  1. Alluxio file system API and
  2. Hadoop compatible file system API

The Alluxio file system API provides full functionality, while the Hadoop compatible API gives users the flexibility of leveraging Alluxio without having to modify existing code written using Hadoop’s API with limitaitons.

Configuring Dependency

To build your Java application to access Alluxio File System using maven, include the artifact alluxio-shaded-client in your pom.xml like the following:

  1. <dependency>
  2. <groupId>org.alluxio</groupId>
  3. <artifactId>alluxio-shaded-client</artifactId>
  4. <version>2.3.0</version>
  5. </dependency>

Available since 2.0.1, this artifact is self-contained by including all its transitive dependencies in a shaded form to prevent potential dependency conflicts. This artifact is recommended generally for a project to use Alluxio client.

Alternatively, an application can also depend on the alluxio-core-client-fs artifact for the Alluxio file system interface or the alluxio-core-client-hdfs artifact for the Hadoop compatible file system interface of Alluxio. These two artifacts do not include transitive dependencies and therefore much smaller in size, also both included in alluxio-shaded-client artifact.

Alluxio Java API

This section introduces the basic operations to use Alluxio File System interface. Read its javadoc for the complete list of API methods. All resources with the Alluxio Java API are specified through an AlluxioURI which represents the path to the resource.

Getting a Filesystem Client

To obtain an Alluxio filesystem client in Java code, use:

  1. FileSystem fs = FileSystem.Factory.get();

Creating a File

All metadata operations as well as opening a file for reading or creating a file for writing are executed through the FileSystem object. Since Alluxio files are immutable once written, the idiomatic way to create files is to use FileSystem#createFile(AlluxioURI), which returns a stream object that can be used to write the file. For example:

  1. FileSystem fs = FileSystem.Factory.get();
  2. AlluxioURI path = new AlluxioURI("/myFile");
  3. // Create a file and get its output stream
  4. FileOutStream out = fs.createFile(path);
  5. // Write data
  6. out.write(...);
  7. // Close and complete file
  8. out.close();

Specifying Operation Options

For all FileSystem operations, an additional options field may be specified, which allows users to specify non-default settings for the operation. For example:

  1. FileSystem fs = FileSystem.Factory.get();
  2. AlluxioURI path = new AlluxioURI("/myFile");
  3. // Generate options to set a custom blocksize of 64 MB
  4. CreateFilePOptions options = CreateFilePOptions.newBuilder().setBlockSizeBytes(64 * Constants.MB).build();
  5. FileOutStream out = fs.createFile(path, options);

Programmatically Modifying Configuration

Alluxio configuration can be set through alluxio-site.properties, but these properties apply to all instances of Alluxio that read from the file. If fine-grained configuration management is required, pass in a customized configuration object when creating the FileSystem object. The generated FileSystem object will have modified configuration properties, independent of any other FileSystem clients.

  1. FileSystem normalFs = FileSystem.Factory.get();
  2. AlluxioURI normalPath = new AlluxioURI("/normalFile");
  3. // Create a file with default properties
  4. FileOutStream normalOut = normalFs.createFile(normalPath);
  5. ...
  6. normalOut.close();
  7. // Create a file system with custom configuration
  8. InstancedConfiguration conf = InstancedConfiguration.defaults();
  9. conf.set(PropertyKey.SECURITY_LOGIN_USERNAME, "alice");
  10. FileSystem customizedFs = FileSystem.Factory.create(conf);
  11. AlluxioURI normalPath = new AlluxioURI("/customizedFile");
  12. // The newly created file will be created under the username "alice"
  13. FileOutStream customizedOut = customizedFs.createFile(customizedPath);
  14. ...
  15. customizedOut.close();
  16. // normalFs can still be used as a FileSystem client with the default username.
  17. // Likewise, using customizedFs will use the username "alice".

IO Options

Alluxio uses two different storage types: Alluxio managed storage and under storage. Alluxio managed storage is the memory, SSD, and/or HDD allocated to Alluxio workers. Under storage is the storage resource managed by the underlying storage system, such as S3, Swift or HDFS. Users can specify the interaction with Alluxio managed storage and under storage through ReadType and WriteType. ReadType specifies the data read behavior when reading a file. WriteType specifies the data write behavior when writing a new file, ie. whether the data should be written in Alluxio Storage.

Below is a table of the expected behaviors of ReadType. Reads will always prefer Alluxio storage over the under storage.

Read TypeBehavior
CACHE_PROMOTEData is moved to the highest tier in the worker where the data was read. If the data was not in the Alluxio storage of the local worker, a replica will be added to the local Alluxio worker. If there is no local Alluxio worker, a replica will be added to a remote Alluxio worker if the data was fetched from the under storage system.
CACHEIf the data was not in the Alluxio storage of the local worker, a replica will be added to the local Alluxio worker. If there is no local Alluxio worker, a replica will be added to a remote Alluxio worker if the data was fetched from the under storage system.
NO_CACHEData is read without storing a replica in Alluxio. Note that a replica may already exist in Alluxio.

Below is a table of the expected behaviors of WriteType

Write TypeBehavior
CACHE_THROUGHData is written synchronously to a Alluxio worker and the under storage system.
MUST_CACHEData is written synchronously to a Alluxio worker. No data will be written to the under storage. This is the default write type.
THROUGHData is written synchronously to the under storage. No data will be written to Alluxio.
ASYNC_THROUGHData is written synchronously to a Alluxio worker and asynchronously to the under storage system. Experimental.

Location policy

Alluxio provides location policy to choose which workers to store the blocks of a file.

Using Alluxio’s Java API, users can set the policy in CreateFilePOptions for writing files and OpenFilePOptions for reading files into Alluxio.

Users can override the default policy class in the configuration file at property alluxio.user.block.write.location.policy.class. The built-in policies include:

  • LocalFirstPolicy (alluxio.client.block.policy.LocalFirstPolicy)

    Returns the local worker first, and if it does not have enough capacity of a block, randomly picks a worker from the active workers list. This is the default policy.

  • MostAvailableFirstPolicy (alluxio.client.block.policy.MostAvailableFirstPolicy)

    Returns the worker with the most available bytes.

  • RoundRobinPolicy (alluxio.client.block.policy.RoundRobinPolicy)

    Chooses the worker for the next block in a round-robin manner and skips workers that do not have enough capacity.

  • SpecificHostPolicy (alluxio.client.block.policy.SpecificHostPolicy)

    Returns a worker with the specified host name. This policy cannot be set as default policy.

Alluxio supports custom policies, so you can also develop your own policy appropriate for your workload by implementing the interface alluxio.client.block.policy.BlockLocationPolicy. Note that a default policy must have a constructor which takes alluxio.conf.AlluxioConfiguration. To use ASYNC_THROUGH write type, all the blocks of a file must be written to the same worker.

Write Tier

Alluxio allows a client to select a tier preference when writing blocks to a local worker. Currently this policy preference exists only for local workers, not remote workers; remote workers will write blocks to the highest tier.

By default, data is written to the top tier. Users can modify the default setting through the alluxio.user.file.write.tier.default configuration property or override it through an option to the FileSystem#createFile(AlluxioURI) API call.

Accessing an existing file in Alluxio

All operations on existing files or directories require the user to specify the AlluxioURI. With the AlluxioURI, the user may use any of the methods of FileSystem to access the resource.

Reading Data

A AlluxioURI can be used to perform Alluxio FileSystem operations, such as modifying the file metadata, ie. TTL or pin state, or getting an input stream to read the file.

For example, to read a file:

  1. FileSystem fs = FileSystem.Factory.get();
  2. AlluxioURI path = new AlluxioURI("/myFile");
  3. // Open the file for reading
  4. FileInStream in = fs.openFile(path);
  5. // Read data
  6. in.read(...);
  7. // Close file relinquishing the lock
  8. in.close();

Javadoc

For additional API information, please refer to the Alluxio javadocs.

Hadoop-Compatible Java Client

On top of Alluxio file system, Alluxio also has a convenience class alluxio.hadoop.FileSystem to provide applications a Hadoop compatible FileSystem interface. This client translates Hadoop file operations to Alluxio file system operations, allowing users to reuse previous code written for Hadoop without modification. Read its javadoc for more details.

Example

Here is a piece of example code to read ORC files from Alluxio file system using Hadoop interface.

  1. // create a new hadoop configuration
  2. org.apache.hadoop.conf.Configuration conf = new org.apache.hadoop.conf.Configuration();
  3. // enforce hadoop client to bind alluxio.hadoop.FileSystem for URIs like alluxio://
  4. conf.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem");
  5. conf.set("fs.AbstractFileSystem.alluxio.impl", "alluxio.hadoop.AlluxioFileSystem");
  6. // Now alluxio address can be used like any other hadoop-compatible file system URIs
  7. org.apache.orc.OrcFile.ReaderOptions options = new org.apache.orc.OrcFile.ReaderOptions(conf)
  8. org.apache.orc.Reader orc = org.apache.orc.OrcFile.createReader(
  9. new Path("alluxio://localhost:19998/path/file.orc"), options);

Rest API

For portability with other languages, the Alluxio API is also accessible via an HTTP proxy in the form of a REST API.

The REST API documentation is generated as part of Alluxio build and accessible through ${ALLUXIO_HOME}/core/server/proxy/target/miredot/index.html. The main difference between the REST API and the Alluxio Java API is in how streams are represented. While the Alluxio Java API can use in-memory streams, the REST API decouples the stream creation and access (see the create and open REST API methods and the streams resource endpoints for details).

The HTTP proxy is a standalone server that can be started using ${ALLUXIO_HOME}/bin/alluxio-start.sh proxy and stopped using ${ALLUXIO_HOME}/bin/alluxio-stop.sh proxy. By default, the REST API is available on port 39999.

There are performance implications of using the HTTP proxy. In particular, using the proxy requires an extra hop. For optimal performance, it is recommended to run the proxy server and an Alluxio worker on each compute node.

Python

Alluxio has a Python Client for interacting with Alluxio through its REST API. The Python client exposes an API similar to the Alluxio Java API. See the doc for detailed documentation about all available methods. See the example of how to perform basic filesystem operations in Alluxio.

Alluxio Proxy dependency

The Python client interacts with Alluxio through the REST API provided by the Alluxio proxy.

The proxy is a standalone server that can be started using ${ALLUXIO_HOME}/bin/alluxio-start.sh proxy and stopped using ${ALLUXIO_HOME}/bin/alluxio-stop.sh proxy. By default, the REST API is available on port 39999.

There are performance implications of using the HTTP proxy. In particular, using the proxy requires an extra hop. For optimal performance, it is recommended to run the proxy server and an Alluxio worker on each compute node.

Install Python Client Library

  1. $ pip install alluxio

Example Usage

The following program includes examples of how to create directory, download, upload, check existence for, and list status for files in Alluxio.

  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. import json
  4. import sys
  5. import alluxio
  6. from alluxio import option
  7. def colorize(code):
  8. def _(text, bold=False):
  9. c = code
  10. if bold:
  11. c = '1;%s' % c
  12. return '\033[%sm%s\033[0m' % (c, text)
  13. return _
  14. green = colorize('32')
  15. def info(s):
  16. print green(s)
  17. def pretty_json(obj):
  18. return json.dumps(obj, indent=2)
  19. def main():
  20. py_test_root_dir = '/py-test-dir'
  21. py_test_nested_dir = '/py-test-dir/nested'
  22. py_test = py_test_nested_dir + '/py-test'
  23. py_test_renamed = py_test_root_dir + '/py-test-renamed'
  24. client = alluxio.Client('localhost', 39999)
  25. info("creating directory %s" % py_test_nested_dir)
  26. opt = option.CreateDirectory(recursive=True)
  27. client.create_directory(py_test_nested_dir, opt)
  28. info("done")
  29. info("writing to %s" % py_test)
  30. with client.open(py_test, 'w') as f:
  31. f.write('Alluxio works with Python!\n')
  32. with open(sys.argv[0]) as this_file:
  33. f.write(this_file)
  34. info("done")
  35. info("getting status of %s" % py_test)
  36. stat = client.get_status(py_test)
  37. print pretty_json(stat.json())
  38. info("done")
  39. info("renaming %s to %s" % (py_test, py_test_renamed))
  40. client.rename(py_test, py_test_renamed)
  41. info("done")
  42. info("getting status of %s" % py_test_renamed)
  43. stat = client.get_status(py_test_renamed)
  44. print pretty_json(stat.json())
  45. info("done")
  46. info("reading %s" % py_test_renamed)
  47. with client.open(py_test_renamed, 'r') as f:
  48. print f.read()
  49. info("done")
  50. info("listing status of paths under /")
  51. root_stats = client.list_status('/')
  52. for stat in root_stats:
  53. print pretty_json(stat.json())
  54. info("done")
  55. info("deleting %s" % py_test_root_dir)
  56. opt = option.Delete(recursive=True)
  57. client.delete(py_test_root_dir, opt)
  58. info("done")
  59. info("asserting that %s is deleted" % py_test_root_dir)
  60. assert not client.exists(py_test_root_dir)
  61. info("done")
  62. if __name__ == '__main__':
  63. main()

Go

Alluxio has a Go Client for interacting with Alluxio through its REST API. The Go client exposes an API similar to the Alluxio Java API. See the godoc for detailed documentation about all available methods. The godoc includes examples of how to download, upload, check existence for, and list status for files in Alluxio.

Alluxio Proxy dependency

The Go client talks to Alluxio through the REST API provided by the Alluxio proxy.

The proxy is a standalone server that can be started using ${ALLUXIO_HOME}/bin/alluxio-start.sh proxy and stopped using ${ALLUXIO_HOME}/bin/alluxio-stop.sh proxy. By default, the REST API is available on port 39999.

There are performance implications of using the HTTP proxy. In particular, using the proxy requires an extra hop. For optimal performance, it is recommended to run the proxy server and an Alluxio worker on each compute node.

Install Go Client Library

  1. $ go get -d github.com/Alluxio/alluxio-go

Example Usage

If there is no Alluxio proxy running locally, replace “localhost” below with a hostname of a proxy.

  1. package main
  2. import (
  3. "fmt"
  4. "io/ioutil"
  5. "log"
  6. "strings"
  7. "time"
  8. alluxio "github.com/Alluxio/alluxio-go"
  9. "github.com/Alluxio/alluxio-go/option"
  10. )
  11. func write(fs *alluxio.Client, path, s string) error {
  12. id, err := fs.CreateFile(path, &option.CreateFile{})
  13. if err != nil {
  14. return err
  15. }
  16. defer fs.Close(id)
  17. _, err = fs.Write(id, strings.NewReader(s))
  18. return err
  19. }
  20. func read(fs *alluxio.Client, path string) (string, error) {
  21. id, err := fs.OpenFile(path, &option.OpenFile{})
  22. if err != nil {
  23. return "", err
  24. }
  25. defer fs.Close(id)
  26. r, err := fs.Read(id)
  27. if err != nil {
  28. return "", err
  29. }
  30. defer r.Close()
  31. content, err := ioutil.ReadAll(r)
  32. if err != nil {
  33. return "", err
  34. }
  35. return string(content), err
  36. }
  37. func main() {
  38. fs := alluxio.NewClient("localhost", 39999, 10*time.Second)
  39. path := "/test_path"
  40. exists, err := fs.Exists(path, &option.Exists{})
  41. if err != nil {
  42. log.Fatal(err)
  43. }
  44. if exists {
  45. if err := fs.Delete(path, &option.Delete{}); err != nil {
  46. log.Fatal(err)
  47. }
  48. }
  49. if err := write(fs, path, "Success"); err != nil {
  50. log.Fatal(err)
  51. }
  52. content, err := read(fs, path)
  53. if err != nil {
  54. log.Fatal(err)
  55. }
  56. fmt.Printf("Result: %v\n", content)
  57. }