How to connect Pulsar to database
This tutorial provides a hands-on look at how you can move data out of Pulsar without writing a single line of code.
It is helpful to review the concepts for Pulsar I/O by running the steps in this guide to gain a deeper understanding.
At the end of this tutorial, you can:
tip
- These instructions assume you are running Pulsar in standalone mode. However, all the commands used in this tutorial can be used in a multi-node Pulsar cluster without any changes.
- All the instructions are assumed to run at the root directory of a Pulsar binary distribution.
Install Pulsar and built-in connector
Before connecting Pulsar to a database, you need to install Pulsar and the desired built-in connector.
Read Run a standalone Pulsar cluster locally for downloading the Pulsar distribution.
To enable Pulsar connectors, you need to download the connectors’ tarball release on download page.
After you download the NAR file, copy the file to the connectors
directory in the Pulsar directory. For example, if you download the pulsar-io-aerospike-3.1.1.nar
connector file, enter the following commands:
mkdir connectors
mv pulsar-io-aerospike-3.1.1.nar connectors
ls connectors
# pulsar-io-aerospike-3.1.1.nar
# ...
note
- If you are running Pulsar in a bare metal cluster, make sure
connectors
tarball is unzipped in every pulsar directory of the broker (or in every pulsar directory of function-worker if you are running a separate worker cluster for Pulsar Functions). - If you are running Pulsar in Docker or deploying Pulsar using a docker image (e.g. K8S), you can use the
apachepulsar/pulsar-all
image instead of theapachepulsar/pulsar
image. Theapachepulsar/pulsar-all
image has already bundled all built-in connectors.
Start Pulsar standalone
Start Pulsar locally.
bin/pulsar standalone
All the components of a Pulsar service are started in order.
You can curl those pulsar service endpoints to make sure the Pulsar service is up and running correctly.
Check Pulsar binary protocol port.
telnet localhost 6650
Check Pulsar Function cluster.
curl -s http://localhost:8080/admin/v2/worker/cluster
Example output
[{"workerId":"c-standalone-fw-localhost-6750","workerHostname":"localhost","port":6750}]
Make sure a public tenant and a default namespace exist.
curl -s http://localhost:8080/admin/v2/namespaces/public
Example output
["public/default","public/functions"]
All built-in connectors should be listed as available.
curl -s http://localhost:8080/admin/v2/functions/connectors
Example output
[{"name":"aerospike","description":"Aerospike database sink","sinkClass":"org.apache.pulsar.io.aerospike.AerospikeStringSink"},{"name":"cassandra","description":"Writes data into Cassandra","sinkClass":"org.apache.pulsar.io.cassandra.CassandraStringSink"},{"name":"kafka","description":"Kafka source and sink connector","sourceClass":"org.apache.pulsar.io.kafka.KafkaStringSource","sinkClass":"org.apache.pulsar.io.kafka.KafkaBytesSink"},{"name":"kinesis","description":"Kinesis sink connector","sinkClass":"org.apache.pulsar.io.kinesis.KinesisSink"},{"name":"rabbitmq","description":"RabbitMQ source connector","sourceClass":"org.apache.pulsar.io.rabbitmq.RabbitMQSource"},{"name":"twitter","description":"Ingest data from Twitter firehose","sourceClass":"org.apache.pulsar.io.twitter.TwitterFireHose"}]
If an error occurs when starting the Pulsar service, you may see an exception at the terminal running
pulsar/standalone
, or you can navigate to thelogs
directory under the Pulsar directory to view the logs.
Connect Pulsar to Cassandra
This section demonstrates how to connect Pulsar to Cassandra.
tip
- Make sure you have Docker installed. If you do not have one, see install Docker. For more information about Docker commands, see Docker CLI.
- The Cassandra sink connector reads messages from Pulsar topics and writes the messages into Cassandra tables. For more information, see Cassandra sink connector.
Set up a Cassandra cluster
This example uses cassandra
Docker image to start a single-node Cassandra cluster in Docker.
Start a Cassandra cluster.
docker run -d --rm --name=cassandra -p 9042:9042 cassandra
note
Before moving to the next steps, make sure the Cassandra cluster is running.
Make sure the Docker process is running.
docker ps
Check the Cassandra logs to make sure the Cassandra process is running as expected.
docker logs cassandra
Check the status of the Cassandra cluster.
docker exec cassandra nodetool status
Example output
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.17.0.2 103.67 KiB 256 100.0% af0e4b2f-84e0-4f0b-bb14-bd5f9070ff26 rack1
Use
cqlsh
to connect to the Cassandra cluster.docker exec -ti cassandra cqlsh localhost
Output
Connected to Test Cluster at localhost:9042.
[cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>
Create a keyspace
pulsar_test_keyspace
.cqlsh> CREATE KEYSPACE pulsar_test_keyspace WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};
Create a table
pulsar_test_table
.cqlsh> USE pulsar_test_keyspace;
cqlsh:pulsar_test_keyspace> CREATE TABLE pulsar_test_table (key text PRIMARY KEY, col text);
Configure a Cassandra sink
Now that we have a Cassandra cluster running locally.
In this section, you need to configure a Cassandra sink connector.
To run a Cassandra sink connector, you need to prepare a configuration file including the information that Pulsar connector runtime needs to know.
For example, how Pulsar connector can find the Cassandra cluster, what is the keyspace and the table that Pulsar connector uses for writing Pulsar messages to, and so on.
You can create a configuration file through one of the following methods.
JSON
{
"roots": "localhost:9042",
"keyspace": "pulsar_test_keyspace",
"columnFamily": "pulsar_test_table",
"keyname": "key",
"columnName": "col"
}
YAML
configs:
roots: "localhost:9042"
keyspace: "pulsar_test_keyspace"
columnFamily: "pulsar_test_table"
keyname: "key"
columnName: "col"
For more information, see Cassandra sink connector.
Create a Cassandra sink
You can use the Connector Admin CLI to create a sink connector and perform other operations on them.
Run the following command to create a Cassandra sink connector with sink type cassandra and the config file examples/cassandra-sink.yml created previously.
note
The sink-type
parameter of the currently built-in connectors is determined by the setting of the name
parameter specified in the pulsar-io.yaml file.
bin/pulsar-admin sinks create \
--tenant public \
--namespace default \
--name cassandra-test-sink \
--sink-type cassandra \
--sink-config-file examples/cassandra-sink.yml \
--inputs test_cassandra
Once the command is executed, Pulsar creates the sink connector cassandra-test-sink.
This sink connector runs as a Pulsar Function and writes the messages produced in the topic test_cassandra to the Cassandra table pulsar_test_table.
Inspect a Cassandra sink
You can use the Connector Admin CLI to monitor a connector and perform other operations on it.
Get the information of a Cassandra sink.
bin/pulsar-admin sinks get \
--tenant public \
--namespace default \
--name cassandra-test-sink
Example output
{
"tenant": "public",
"namespace": "default",
"name": "cassandra-test-sink",
"className": "org.apache.pulsar.io.cassandra.CassandraStringSink",
"inputSpecs": {
"test_cassandra": {
"isRegexPattern": false
}
},
"configs": {
"roots": "localhost:9042",
"keyspace": "pulsar_test_keyspace",
"columnFamily": "pulsar_test_table",
"keyname": "key",
"columnName": "col"
},
"parallelism": 1,
"processingGuarantees": "ATLEAST_ONCE",
"retainOrdering": false,
"autoAck": true,
"archive": "builtin://cassandra"
}
Check the status of a Cassandra sink.
bin/pulsar-admin sinks status \
--tenant public \
--namespace default \
--name cassandra-test-sink
Example output
{
"numInstances" : 1,
"numRunning" : 1,
"instances" : [ {
"instanceId" : 0,
"status" : {
"running" : true,
"error" : "",
"numRestarts" : 0,
"numReadFromPulsar" : 0,
"numSystemExceptions" : 0,
"latestSystemExceptions" : [ ],
"numSinkExceptions" : 0,
"latestSinkExceptions" : [ ],
"numWrittenToSink" : 0,
"lastReceivedTime" : 0,
"workerId" : "c-standalone-fw-localhost-8080"
}
} ]
}
Verify a Cassandra sink
Produce some messages to the input topic of the Cassandra sink test_cassandra.
for i in {0..9}; do bin/pulsar-client produce -m "key-$i" -n 1 test_cassandra; done
Inspect the status of the Cassandra sink test_cassandra.
bin/pulsar-admin sinks status \
--tenant public \
--namespace default \
--name cassandra-test-sink
You can see 10 messages are processed by the Cassandra sink test_cassandra.
Example output
{
"numInstances" : 1,
"numRunning" : 1,
"instances" : [ {
"instanceId" : 0,
"status" : {
"running" : true,
"error" : "",
"numRestarts" : 0,
"numReadFromPulsar" : 10,
"numSystemExceptions" : 0,
"latestSystemExceptions" : [ ],
"numSinkExceptions" : 0,
"latestSinkExceptions" : [ ],
"numWrittenToSink" : 10,
"lastReceivedTime" : 1551685489136,
"workerId" : "c-standalone-fw-localhost-8080"
}
} ]
}
Use
cqlsh
to connect to the Cassandra cluster.docker exec -ti cassandra cqlsh localhost
Check the data of the Cassandra table pulsar_test_table.
cqlsh> use pulsar_test_keyspace;
cqlsh:pulsar_test_keyspace> select * from pulsar_test_table;
key | col
--------+--------
key-5 | key-5
key-0 | key-0
key-9 | key-9
key-2 | key-2
key-1 | key-1
key-3 | key-3
key-6 | key-6
key-7 | key-7
key-4 | key-4
key-8 | key-8
Delete a Cassandra Sink
You can use the Connector Admin CLI to delete a connector and perform other operations on it.
bin/pulsar-admin sinks delete \
--tenant public \
--namespace default \
--name cassandra-test-sink
Connect Pulsar to PostgreSQL
This section demonstrates how to connect Pulsar to PostgreSQL.
tip
- Make sure you have Docker installed. If you do not have one, see install Docker. For more information about Docker commands, see Docker CLI.
- The JDBC sink connector pulls messages from Pulsar topics and persists the messages to ClickHouse, MariaDB, PostgreSQL, or SQLite. For more information, see JDBC sink connector.
Set up a PostgreSQL cluster
This example uses the PostgreSQL 12 docker image to start a single-node PostgreSQL cluster in Docker.
Pull the PostgreSQL 12 image from Docker.
docker pull postgres:12
Start PostgreSQL.
docker run -d -it --rm \
--name pulsar-postgres \
-p 5432:5432 \
-e POSTGRES_PASSWORD=password \
-e POSTGRES_USER=postgres \
postgres:12
Check if PostgreSQL has been started successfully.
docker logs -f pulsar-postgres
PostgreSQL has been started successfully if the following message appears.
2020-05-11 20:09:24.492 UTC [1] LOG: starting PostgreSQL 12.2 (Debian 12.2-2.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2020-05-11 20:09:24.492 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-05-11 20:09:24.492 UTC [1] LOG: listening on IPv6 address "::", port 5432
2020-05-11 20:09:24.499 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-11 20:09:24.523 UTC [55] LOG: database system was shut down at 2020-05-11 20:09:24 UTC
2020-05-11 20:09:24.533 UTC [1] LOG: database system is ready to accept connections
Access to PostgreSQL container.
docker exec -it pulsar-postgres /bin/bash
Log in to PostgreSQL with the default username and password:
psql -U postgres postgres
Create a
pulsar_postgres_jdbc_sink
table using the following command:create table if not exists pulsar_postgres_jdbc_sink
(
id serial PRIMARY KEY,
name VARCHAR(255) NOT NULL
);
Configure a JDBC sink
Now we have a PostgreSQL running locally.
In this section, you need to configure a JDBC sink connector.
Add a configuration file.
To run a JDBC sink connector, you need to prepare a YAML configuration file including the information that the Pulsar connector runtime needs to know.
For example, how Pulsar connector can find the PostgreSQL cluster, what is the JDBC URL and the table that Pulsar connector uses for writing messages.
Create a pulsar-postgres-jdbc-sink.yaml file, copy the following contents to this file, and place the file in the
pulsar/connectors
folder.configs:
userName: "postgres"
password: "password"
jdbcUrl: "jdbc:postgresql://localhost:5432/postgres"
tableName: "pulsar_postgres_jdbc_sink"
Create a schema.
Create a avro-schema file, copy the following contents to this file, and place the file in the
pulsar/connectors
folder.{
"type": "AVRO",
"schema": "{\"type\":\"record\",\"name\":\"Test\",\"fields\":[{\"name\":\"id\",\"type\":[\"null\",\"int\"]},{\"name\":\"name\",\"type\":[\"null\",\"string\"]}]}",
"properties": {}
}
tip
For more information about AVRO, see Apache Avro.
Upload a schema to a topic.
This example uploads the avro-schema schema to the pulsar-postgres-jdbc-sink-topic topic.
bin/pulsar-admin schemas upload pulsar-postgres-jdbc-sink-topic -f ./connectors/avro-schema
Check if the schema has been uploaded successfully.
bin/pulsar-admin schemas get pulsar-postgres-jdbc-sink-topic
The schema has been uploaded successfully if the following message appears.
{"name":"pulsar-postgres-jdbc-sink-topic","schema":"{\"type\":\"record\",\"name\":\"Test\",\"fields\":[{\"name\":\"id\",\"type\":[\"null\",\"int\"]},{\"name\":\"name\",\"type\":[\"null\",\"string\"]}]}","type":"AVRO","properties":{}}
Create a JDBC sink
You can use the Connector Admin CLI to create a sink connector and perform other operations on it.
This example creates a sink connector and specifies the desired information.
bin/pulsar-admin sinks create \
--archive ./connectors/pulsar-io-jdbc-postgres-3.1.1.nar \
--inputs pulsar-postgres-jdbc-sink-topic \
--name pulsar-postgres-jdbc-sink \
--sink-config-file ./connectors/pulsar-postgres-jdbc-sink.yaml \
--parallelism 1
Once the command is executed, Pulsar creates a sink connector pulsar-postgres-jdbc-sink.
This sink connector runs as a Pulsar Function and writes the messages produced in the topic pulsar-postgres-jdbc-sink-topic to the PostgreSQL table pulsar_postgres_jdbc_sink.
Tip
Flag | Description | Example |
---|---|---|
—archive | The path to the archive file for the sink. | pulsar-io-jdbc-postgres-3.1.1.nar |
—inputs | The input topic(s) of the sink. Multiple topics can be specified as a comma-separated list. | |
—name | The name of the sink. | pulsar-postgres-jdbc-sink |
—sink-config-file | The path to a YAML config file specifying the configuration of the sink. | pulsar-postgres-jdbc-sink.yaml |
—parallelism | The parallelism factor of the sink. For example, the number of sink instances to run. | 1 |
tip
For more information about pulsar-admin sinks create options
, see Pulsar admin docs.
The sink has been created successfully if the following message appears.
Created successfully
Inspect a JDBC sink
You can use the Connector Admin CLI to monitor a connector and perform other operations on it.
List all running JDBC sink(s).
bin/pulsar-admin sinks list \
--tenant public \
--namespace default
tip
For more information about
pulsar-admin sinks list options
, see Pulsar admin docs.The result shows that only the postgres-jdbc-sink sink is running.
[
"pulsar-postgres-jdbc-sink"
]
Get the information of a JDBC sink.
bin/pulsar-admin sinks get \
--tenant public \
--namespace default \
--name pulsar-postgres-jdbc-sink
tip
For more information about
pulsar-admin sinks get options
, see Pulsar admin docs.The result shows the information of the sink connector, including tenant, namespace, topic and so on.
{
"tenant": "public",
"namespace": "default",
"name": "pulsar-postgres-jdbc-sink",
"className": "org.apache.pulsar.io.jdbc.PostgresJdbcAutoSchemaSink",
"inputSpecs": {
"pulsar-postgres-jdbc-sink-topic": {
"isRegexPattern": false
}
},
"configs": {
"password": "password",
"jdbcUrl": "jdbc:postgresql://localhost:5432/pulsar_postgres_jdbc_sink",
"userName": "postgres",
"tableName": "pulsar_postgres_jdbc_sink"
},
"parallelism": 1,
"processingGuarantees": "ATLEAST_ONCE",
"retainOrdering": false,
"autoAck": true
}
Get the status of a JDBC sink
bin/pulsar-admin sinks status \
--tenant public \
--namespace default \
--name pulsar-postgres-jdbc-sink
tip
For more information about
pulsar-admin sinks status options
, see Pulsar admin docs.The result shows the current status of the sink connector, including the number of instances, running status, worker ID and so on.
{
"numInstances" : 1,
"numRunning" : 1,
"instances" : [ {
"instanceId" : 0,
"status" : {
"running" : true,
"error" : "",
"numRestarts" : 0,
"numReadFromPulsar" : 0,
"numSystemExceptions" : 0,
"latestSystemExceptions" : [ ],
"numSinkExceptions" : 0,
"latestSinkExceptions" : [ ],
"numWrittenToSink" : 0,
"lastReceivedTime" : 0,
"workerId" : "c-standalone-fw-192.168.2.52-8080"
}
} ]
}
Stop a JDBC sink
You can use the Connector Admin CLI to stop a connector and perform other operations on it.
bin/pulsar-admin sinks stop \
--tenant public \
--namespace default \
--name pulsar-postgres-jdbc-sink
tip
For more information about pulsar-admin sinks stop options
, see Pulsar admin docs.
The sink instance has been stopped successfully if the following message disappears.
Stopped successfully
Restart a JDBC sink
You can use the Connector Admin CLI to restart a connector and perform other operations on it.
bin/pulsar-admin sinks restart \
--tenant public \
--namespace default \
--name pulsar-postgres-jdbc-sink
tip
For more information about pulsar-admin sinks restart options
, see Pulsar admin docs.
The sink instance has been started successfully if the following message disappears.
Started successfully
tip
- Optionally, you can run a standalone sink connector using
pulsar-admin sinks localrun options
. Note thatpulsar-admin sinks localrun options
runs a sink connector locally, whilepulsar-admin sinks start options
starts a sink connector in a cluster. - For more information about
pulsar-admin sinks localrun options
, see Pulsar admin docs.
Update a JDBC sink
You can use the Connector Admin CLI to update a connector and perform other operations on it.
This example updates the parallelism of the pulsar-postgres-jdbc-sink sink connector to 2.
bin/pulsar-admin sinks update \
--name pulsar-postgres-jdbc-sink \
--parallelism 2
tip
For more information about pulsar-admin sinks update options
, see Pulsar admin docs.
The sink connector has been updated successfully if the following message disappears.
Updated successfully
This example double-checks the information.
bin/pulsar-admin sinks get \
--tenant public \
--namespace default \
--name pulsar-postgres-jdbc-sink
The result shows that the parallelism is 2.
{
"tenant": "public",
"namespace": "default",
"name": "pulsar-postgres-jdbc-sink",
"className": "org.apache.pulsar.io.jdbc.PostgresJdbcAutoSchemaSink",
"inputSpecs": {
"pulsar-postgres-jdbc-sink-topic": {
"isRegexPattern": false
}
},
"configs": {
"password": "password",
"jdbcUrl": "jdbc:postgresql://localhost:5432/pulsar_postgres_jdbc_sink",
"userName": "postgres",
"tableName": "pulsar_postgres_jdbc_sink"
},
"parallelism": 2,
"processingGuarantees": "ATLEAST_ONCE",
"retainOrdering": false,
"autoAck": true
}
Delete a JDBC sink
You can use the Connector Admin CLI to delete a connector and perform other operations on it.
This example deletes the pulsar-postgres-jdbc-sink sink connector.
bin/pulsar-admin sinks delete \
--tenant public \
--namespace default \
--name pulsar-postgres-jdbc-sink
tip
For more information about pulsar-admin sinks delete options
, see Pulsar admin docs.
The sink connector has been deleted successfully if the following message appears.
Deleted successfully
This example double-checks the status of the sink connector.
bin/pulsar-admin sinks get \
--tenant public \
--namespace default \
--name pulsar-postgres-jdbc-sink
The result shows that the sink connector does not exist.
HTTP 404 Not Found
Reason: Sink pulsar-postgres-jdbc-sink doesn't exist