Use filesystem offloader with Pulsar
This chapter guides you through every step of installing and configuring the filesystem offloader and using it with Pulsar.
安装
Follow the steps below to install the filesystem offloader.
前提条件
Pulsar: 2.4.2 or later versions
Hadoop: 3.x.x
Step
This example uses Pulsar 2.5.1.
Download the Pulsar tarball using one of the following ways:
Download from the Apache mirror
Download from the Pulsar download page
Use wget
wget https://archive.apache.org/dist/pulsar/pulsar-2.5.1/apache-pulsar-2.5.1-bin.tar.gz
Download and untar the Pulsar offloaders package.
wget https://downloads.apache.org/pulsar/pulsar-2.5.1/apache-pulsar-offloaders-2.5.1-bin.tar.gz
tar xvfz apache-pulsar-offloaders-2.5.1-bin.tar.gz
Note
If you are running Pulsar in a bare metal cluster, make sure that
offloaders
tarball is unzipped in every broker’s Pulsar directory.If you are running Pulsar in Docker or deploying Pulsar using a Docker image (such as K8S and DCOS), you can use the
apachepulsar/pulsar-all
image instead of theapachepulsar/pulsar
image.apachepulsar/pulsar-all
image has already bundled tiered storage offloaders.
Copy the Pulsar offloaders as
offloaders
in the Pulsar directory.mv apache-pulsar-offloaders-2.5.1/offloaders apache-pulsar-2.5.1/offloaders
ls offloaders
输出
tiered-storage-file-system-2.5.1.nar
tiered-storage-jcloud-2.5.1.nar
Note
If you are running Pulsar in a bare metal cluster, make sure that
offloaders
tarball is unzipped in every broker’s Pulsar directory.If you are running Pulsar in Docker or deploying Pulsar using a Docker image (such as K8s and DCOS), you can use the
apachepulsar/pulsar-all
image instead of theapachepulsar/pulsar
image.apachepulsar/pulsar-all
image has already bundled tiered storage offloaders.
Configuration
Note
Before offloading data from BookKeeper to filesystem, you need to configure some properties of the filesystem offloader driver.
Besides, you can also configure the filesystem offloader to run it automatically or trigger it manually.
Configure filesystem offloader driver
You can configure filesystem offloader driver in the configuration file broker.conf
or standalone.conf
.
Required configurations are as below.
Required configuration | Description | Example value |—-|—-|—-
managedLedgerOffloadDriver
| Offloader driver name, which is case-insensitive. | filesystemfileSystemURI
| Connection address | hdfs://127.0.0.1:9000offloadersDirectory
| Hadoop profile path | ../conf/filesystem_offload_core_site.xmlOptional configurations are as below.
Optional configuration| Description | Example value |—-|—-|—-
managedLedgerMinLedgerRolloverTimeMinutes
|Minimum time between ledger rollover for a topicNote: it is not recommended that you set this configuration in the production environment.|2
managedLedgerMaxEntriesPerLedger
|Maximum number of entries to append to a ledger before triggering a rollover.Note: it is not recommended that you set this configuration in the production environment.|5000
Offloader driver (required)
Offloader driver name, which is case-insensitive.
This example sets the offloader driver name as filesystem.
managedLedgerOffloadDriver=filesystem
Connection address (required)
Connection address is the URI to access the default Hadoop distributed file system.
示例
This example sets the connection address as hdfs://127.0.0.1:9000.
fileSystemURI=hdfs://127.0.0.1:9000
Hadoop profile path (required)
The configuration file is stored in the Hadoop profile path. It contains various settings for Hadoop performance tuning.
示例
This example sets the Hadoop profile path as ../conf/filesystem_offload_core_site.xml.
fileSystemProfilePath=../conf/filesystem_offload_core_site.xml
You can set the following configurations in the filesystem_offload_core_site.xml file.
<property>
<name>fs.defaultFS</name>
<value></value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>pulsar</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>4096</value>
</property>
<property>
<name>io.seqfile.compress.blocksize</name>
<value>1000000</value>
</property>
<property>
<name>io.seqfile.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>io.map.index.interval</name>
<value>128</value>
</property>
Tip
For more information about the Hadoop HDFS, see here.
Configure filesystem offloader to run automatically
Namespace policy can be configured to offload data automatically once a threshold is reached. The threshold is based on the size of data that a topic has stored on a Pulsar cluster. Once the topic reaches the threshold, an offload operation is triggered automatically.
Threshold value|Action |—-|—-
0 | It triggers the offloading operation if the topic storage reaches its threshold. = 0|It causes a broker to offload data as soon as possible. < 0 |It disables automatic offloading operation.
Automatic offload runs when a new segment is added to a topic log. If you set the threshold on a namespace, but few messages are being produced to the topic, offloader does not work until the current segment is full.
You can configure the threshold size using CLI tools, such as pulsar-admin.
示例
This example sets the filesystem offloader threshold size to 10 MB using pulsar-admin.
pulsar-admin namespaces set-offload-threshold --size 10M my-tenant/my-namespace
Tip
For more information about the
pulsar-admin namespaces set-offload-threshold options
command, including flags, descriptions, default values, and shorthands, see here.
Configure filesystem offloader to run manually
For individual topics, you can trigger filesystem offloader manually using one of the following methods:
Use REST endpoint.
Use CLI tools (such as pulsar-admin).
To trigger via CLI tools, you need to specify the maximum amount of data (threshold) that should be retained on a Pulsar cluster for a topic. If the size of the topic data on the Pulsar cluster exceeds this threshold, segments from the topic are offloaded to the filesystem until the threshold is no longer exceeded. Older segments are offloaded first.
示例
This example triggers the filesystem offloader to run manually using pulsar-admin.
pulsar-admin topics offload --size-threshold 10M persistent://my-tenant/my-namespace/topic1
输出
Offload triggered for persistent://my-tenant/my-namespace/topic1 for messages before 2:0:-1
Tip
For more information about the
pulsar-admin topics offload options
command, including flags, descriptions, default values, and shorthands, see here.This example checks filesystem offloader status using pulsar-admin.
pulsar-admin topics offload-status persistent://my-tenant/my-namespace/topic1
输出
Offload is currently running
To wait for the filesystem to complete the job, add the
-w
flag.pulsar-admin topics offload-status -w persistent://my-tenant/my-namespace/topic1
输出
Offload was a success
If there is an error in the offloading operation, the error is propagated to the
pulsar-admin topics offload-status
command.pulsar-admin topics offload-status persistent://my-tenant/my-namespace/topic1
输出
Reason: Error offloading: org.apache.bookkeeper.mledger.ManagedLedgerException: java.util.concurrent.CompletionException: com.amazonaws.services.s3.model.AmazonS3Exception: Anonymous users cannot initiate multipart uploads. Please authenticate. (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 798758DE3F1776DF; S3 Extended Request ID: dhBFz/lZm1oiG/oBEepeNlhrtsDlzoOhocuYMpKihQGXe6EG8puRGOkK6UwqzVrMXTWBxxHcS+g=), S3 Extended Request ID: dhBFz/lZm1oiG/oBEepeNlhrtsDlzoOhocuYMpKihQGXe6EG8puRGOkK6UwqzVrMXTWBxxHcS+g= `
> #### Tip
>
> For more information about the `pulsar-admin topics offload-status options` command, including flags, descriptions, default values, and shorthands, see [here](reference-pulsar-admin.md#offload-status).
Tutorial
For the complete and step-by-step instructions on how to use the filesystem offloader with Pulsar, see here. ```