Bulk Loading

Bulk loading Apache Cassandra data is supported by different tools. The data to bulk load must be in the form of SSTables. Cassandra does not support loading data in any other format such as CSV, JSON, and XML directly. Although the cqlsh COPY command can load CSV data, it is not a good option for amounts of data. Bulk loading is used to:

  • Restore incremental backups and snapshots. Backups and snapshots are already in the form of SSTables.

  • Load existing SSTables into another cluster. The data can have a different number of nodes or replication strategy.

  • Load external data to a cluster.

Tools for Bulk Loading

Cassandra provides two commands or tools for bulk loading data:

  • Cassandra Bulk loader, also called sstableloader

  • The nodetool import command

The sstableloader and nodetool import are accessible if the Cassandra installation bin directory is in the PATH environment variable. Or these may be accessed directly from the bin directory. The examples use the keyspaces and tables created in Backups.

Using sstableloader

The sstableloader is the main tool for bulk uploading data. sstableloader streams SSTable data files to a running cluster, conforming to the replication strategy and replication factor. The table to upload data to does need not to be empty.

The only requirements to run sstableloader are:

  • One or more comma separated initial hosts to connect to and get ring information

  • A directory path for the SSTables to load

  1. sstableloader [options] <dir_path>

Sstableloader bulk loads the SSTables found in the directory <dir_path> to the configured cluster. The <dir_path> is used as the target keyspace/table name. For example, to load an SSTable named Standard1-g-1-Data.db into Keyspace1/Standard1, you will need to have the files Standard1-g-1-Data.db and Standard1-g-1-Index.db in a directory /path/to/Keyspace1/Standard1/.

Sstableloader Option to accept Target keyspace name

Often as part of a backup strategy, some Cassandra DBAs store an entire data directory. When corruption in the data is found, restoring data in the same cluster (for large clusters 200 nodes) is common, but with a different keyspace name.

Currently sstableloader derives keyspace name from the folder structure. As an option, to specify target keyspace name as part of sstableloader, version 4.0 adds support for the --target-keyspace option (CASSANDRA-13884).

The following options are supported, with -d,--nodes <initial hosts> required:

  1. -alg,--ssl-alg <ALGORITHM> Client SSL: algorithm
  2. -ap,--auth-provider <auth provider> Custom
  3. AuthProvider class name for
  4. cassandra authentication
  5. -ciphers,--ssl-ciphers <CIPHER-SUITES> Client SSL:
  6. comma-separated list of
  7. encryption suites to use
  8. -cph,--connections-per-host <connectionsPerHost> Number of
  9. concurrent connections-per-host.
  10. -d,--nodes <initial hosts> Required.
  11. Try to connect to these hosts (comma separated) initially for ring information
  12. --entire-sstable-throttle-mib <throttle-mib> Entire SSTable throttle
  13. speed in MiB/s (default 0 for unlimited).
  14. --entire-sstable-inter-dc-throttle-mib <inter-dc-throttle-mib>
  15. Entire SSTable inter-datacenter throttle
  16. speed in MiB/s (default 0 for unlimited).
  17. -f,--conf-path <path to config file> cassandra.yaml file path for streaming throughput and client/server SSL.
  18. -h,--help Display this help message
  19. -i,--ignore <NODES> Don't stream to this (comma separated) list of nodes
  20. -idct,--inter-dc-throttle <inter-dc-throttle> (deprecated) Inter-datacenter throttle speed in Mbits (default 0 for unlimited).
  21. Use --inter-dc-throttle-mib instead.
  22. --inter-dc-throttle-mib <inter-dc-throttle-mib> Inter-datacenter throttle speed in MiB/s (default 0 for unlimited)
  23. -k,--target-keyspace <target keyspace name> Target
  24. keyspace name
  25. -ks,--keystore <KEYSTORE> Client SSL:
  26. full path to keystore
  27. -kspw,--keystore-password <KEYSTORE-PASSWORD> Client SSL:
  28. password of the keystore
  29. --no-progress Don't
  30. display progress
  31. -p,--port <native transport port> Port used
  32. for native connection (default 9042)
  33. -prtcl,--ssl-protocol <PROTOCOL> Client SSL:
  34. connections protocol to use (default: TLS)
  35. -pw,--password <password> Password for
  36. cassandra authentication
  37. -sp,--storage-port <storage port> Port used
  38. for internode communication (default 7000)
  39. -spd,--server-port-discovery <allow server port discovery> Use ports
  40. published by server to decide how to connect. With SSL requires StartTLS
  41. to be used.
  42. -ssp,--ssl-storage-port <ssl storage port> Port used
  43. for TLS internode communication (default 7001)
  44. -st,--store-type <STORE-TYPE> Client SSL:
  45. type of store
  46. -t,--throttle <throttle> (deprecated) Throttle speed in Mbits (default 0 for unlimited).
  47. Use --throttle-mib instead.
  48. --throttle-mib <throttle-mib> Throttle
  49. speed in MiB/s (default 0 for unlimited)
  50. -ts,--truststore <TRUSTSTORE> Client SSL:
  51. full path to truststore
  52. -tspw,--truststore-password <TRUSTSTORE-PASSWORD> Client SSL:
  53. Password of the truststore
  54. -u,--username <username> Username for
  55. cassandra authentication
  56. -v,--verbose verbose
  57. output

The cassandra.yaml file can be provided on the command-line with -f option to set up streaming throughput, client and server encryption options. Only stream_throughput_outbound_megabits_per_sec, server_encryption_options and client_encryption_options are read from the cassandra.yaml file. You can override options read from cassandra.yaml with corresponding command line options.

A sstableloader Demo

An example shows how to use sstableloader to upload incremental backup data for the table catalogkeyspace.magazine. In addition, a snapshot of the same table is created to bulk upload, also with sstableloader.

The backups and snapshots for the catalogkeyspace.magazine table are listed as follows:

  1. $ cd ./cassandra/data/data/catalogkeyspace/magazine-446eae30c22a11e9b1350d927649052c && ls -l

results in

  1. total 0
  2. drwxrwxr-x. 2 ec2-user ec2-user 226 Aug 19 02:38 backups
  3. drwxrwxr-x. 4 ec2-user ec2-user 40 Aug 19 02:45 snapshots

The directory path structure of SSTables to be uploaded using sstableloader is used as the target keyspace/table. You can directly upload from the backups and snapshots directories respectively, if the directory structure is in the format used by sstableloader. But the directory path of backups and snapshots for SSTables is /catalogkeyspace/magazine-446eae30c22a11e9b1350d927649052c/backups and /catalogkeyspace/magazine-446eae30c22a11e9b1350d927649052c/snapshots respectively, and cannot be used to upload SSTables to catalogkeyspace.magazine table. The directory path structure must be /catalogkeyspace/magazine/ to use sstableloader. Create a new directory structure to upload SSTables with sstableloader located at /catalogkeyspace/magazine and set appropriate permissions.

  1. $ sudo mkdir -p /catalogkeyspace/magazine
  2. $ sudo chmod -R 777 /catalogkeyspace/magazine

Bulk Loading from an Incremental Backup

An incremental backup does not include the DDL for a table; the table must already exist. If the table was dropped, it can be created using the schema.cql file generated with every snapshot of a table. Prior to using sstableloader to load SSTables to the magazine table, the table must exist. The table does not need to be empty but we have used an empty table as indicated by a CQL query:

  1. SELECT * FROM magazine;

results in

  1. id | name | publisher
  2. ----+------+-----------
  3. (0 rows)

After creating the table to upload to, copy the SSTable files from the backups directory to the /catalogkeyspace/magazine/ directory.

  1. $ sudo cp ./cassandra/data/data/catalogkeyspace/magazine-446eae30c22a11e9b1350d927649052c/backups/* \
  2. /catalogkeyspace/magazine/

Run the sstableloader to upload SSTables from the /catalogkeyspace/magazine/ directory.

  1. $ sstableloader --nodes 10.0.2.238 /catalogkeyspace/magazine/

The output from the sstableloader command should be similar to this listing:

  1. $ sstableloader --nodes 10.0.2.238 /catalogkeyspace/magazine/

results in

  1. Opening SSTables and calculating sections to stream
  2. Streaming relevant part of /catalogkeyspace/magazine/na-1-big-Data.db
  3. /catalogkeyspace/magazine/na-2-big-Data.db to [35.173.233.153:7000, 10.0.2.238:7000,
  4. 54.158.45.75:7000]
  5. progress: [35.173.233.153:7000]0:1/2 88 % total: 88% 0.018KiB/s (avg: 0.018KiB/s)
  6. progress: [35.173.233.153:7000]0:2/2 176% total: 176% 33.807KiB/s (avg: 0.036KiB/s)
  7. progress: [35.173.233.153:7000]0:2/2 176% total: 176% 0.000KiB/s (avg: 0.029KiB/s)
  8. progress: [35.173.233.153:7000]0:2/2 176% [10.0.2.238:7000]0:1/2 39 % total: 81% 0.115KiB/s
  9. (avg: 0.024KiB/s)
  10. progress: [35.173.233.153:7000]0:2/2 176% [10.0.2.238:7000]0:2/2 78 % total: 108%
  11. 97.683KiB/s (avg: 0.033KiB/s)
  12. progress: [35.173.233.153:7000]0:2/2 176% [10.0.2.238:7000]0:2/2 78 %
  13. [54.158.45.75:7000]0:1/2 39 % total: 80% 0.233KiB/s (avg: 0.040KiB/s)
  14. progress: [35.173.233.153:7000]0:2/2 176% [10.0.2.238:7000]0:2/2 78 %
  15. [54.158.45.75:7000]0:2/2 78 % total: 96% 88.522KiB/s (avg: 0.049KiB/s)
  16. progress: [35.173.233.153:7000]0:2/2 176% [10.0.2.238:7000]0:2/2 78 %
  17. [54.158.45.75:7000]0:2/2 78 % total: 96% 0.000KiB/s (avg: 0.045KiB/s)
  18. progress: [35.173.233.153:7000]0:2/2 176% [10.0.2.238:7000]0:2/2 78 %
  19. [54.158.45.75:7000]0:2/2 78 % total: 96% 0.000KiB/s (avg: 0.044KiB/s)

After the sstableloader has finished loading the data, run a query the magazine table to check:

  1. SELECT * FROM magazine;

results in

  1. id | name | publisher
  2. ----+---------------------------+------------------
  3. 1 | Couchbase Magazine | Couchbase
  4. 0 | Apache Cassandra Magazine | Apache Cassandra
  5. (2 rows)

Bulk Loading from a Snapshot

Restoring a snapshot of a table to the same table can be easily accomplished:

If the directory structure needed to load SSTables to catalogkeyspace.magazine does not exist create the directories and set appropriate permissions:

  1. $ sudo mkdir -p /catalogkeyspace/magazine
  2. $ sudo chmod -R 777 /catalogkeyspace/magazine

Remove any files from the directory, so that the snapshot files can be copied without interference:

  1. $ sudo rm /catalogkeyspace/magazine/*
  2. $ cd /catalogkeyspace/magazine/
  3. $ ls -l

results in

  1. total 0

Copy the snapshot files to the /catalogkeyspace/magazine directory.

  1. $ sudo cp ./cassandra/data/data/catalogkeyspace/magazine-446eae30c22a11e9b1350d927649052c/snapshots/magazine/* \
  2. /catalogkeyspace/magazine

List the files in the /catalogkeyspace/magazine directory. The schema.cql will also be listed.

  1. $ cd /catalogkeyspace/magazine && ls -l

results in

  1. total 44
  2. -rw-r--r--. 1 root root 31 Aug 19 04:13 manifest.json
  3. -rw-r--r--. 1 root root 47 Aug 19 04:13 na-1-big-CompressionInfo.db
  4. -rw-r--r--. 1 root root 97 Aug 19 04:13 na-1-big-Data.db
  5. -rw-r--r--. 1 root root 10 Aug 19 04:13 na-1-big-Digest.crc32
  6. -rw-r--r--. 1 root root 16 Aug 19 04:13 na-1-big-Filter.db
  7. -rw-r--r--. 1 root root 16 Aug 19 04:13 na-1-big-Index.db
  8. -rw-r--r--. 1 root root 4687 Aug 19 04:13 na-1-big-Statistics.db
  9. -rw-r--r--. 1 root root 56 Aug 19 04:13 na-1-big-Summary.db
  10. -rw-r--r--. 1 root root 92 Aug 19 04:13 na-1-big-TOC.txt
  11. -rw-r--r--. 1 root root 815 Aug 19 04:13 schema.cql

Alternatively create symlinks to the snapshot folder instead of copying the data:

  1. $ mkdir <keyspace_name>
  2. $ ln -s <path_to_snapshot_folder> <keyspace_name>/<table_name>

If the magazine table was dropped, run the DDL in the schema.cql to create the table. Run the sstableloader with the following command:

  1. $ sstableloader --nodes 10.0.2.238 /catalogkeyspace/magazine/

As the output from the command indicates, SSTables get streamed to the cluster:

  1. Established connection to initial hosts
  2. Opening SSTables and calculating sections to stream
  3. Streaming relevant part of /catalogkeyspace/magazine/na-1-big-Data.db to
  4. [35.173.233.153:7000, 10.0.2.238:7000, 54.158.45.75:7000]
  5. progress: [35.173.233.153:7000]0:1/1 176% total: 176% 0.017KiB/s (avg: 0.017KiB/s)
  6. progress: [35.173.233.153:7000]0:1/1 176% total: 176% 0.000KiB/s (avg: 0.014KiB/s)
  7. progress: [35.173.233.153:7000]0:1/1 176% [10.0.2.238:7000]0:1/1 78 % total: 108% 0.115KiB/s
  8. (avg: 0.017KiB/s)
  9. progress: [35.173.233.153:7000]0:1/1 176% [10.0.2.238:7000]0:1/1 78 %
  10. [54.158.45.75:7000]0:1/1 78 % total: 96% 0.232KiB/s (avg: 0.024KiB/s)
  11. progress: [35.173.233.153:7000]0:1/1 176% [10.0.2.238:7000]0:1/1 78 %
  12. [54.158.45.75:7000]0:1/1 78 % total: 96% 0.000KiB/s (avg: 0.022KiB/s)
  13. progress: [35.173.233.153:7000]0:1/1 176% [10.0.2.238:7000]0:1/1 78 %
  14. [54.158.45.75:7000]0:1/1 78 % total: 96% 0.000KiB/s (avg: 0.021KiB/s)

Some other requirements of sstableloader that should be kept into consideration are:

  • The SSTables loaded must be compatible with the Cassandra version being loaded into.

  • Repairing tables that have been loaded into a different cluster does not repair the source tables.

  • Sstableloader makes use of port 7000 for internode communication.

  • Before restoring incremental backups, run nodetool flush to backup any data in memtables.

Using nodetool import

Importing SSTables into a table using the nodetool import command is recommended instead of the deprecated nodetool refresh command. The nodetool import command has an option to load new SSTables from a separate directory.

The command usage is as follows:

  1. nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
  2. [(-pp | --print-port)] [(-pw <password> | --password <password>)]
  3. [(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
  4. [(-u <username> | --username <username>)] import
  5. [(-c | --no-invalidate-caches)] [(-e | --extended-verify)]
  6. [(-l | --keep-level)] [(-q | --quick)] [(-r | --keep-repaired)]
  7. [(-t | --no-tokens)] [(-v | --no-verify)] [--] <keyspace> <table>
  8. <directory> ...

The arguments keyspace, table name and directory are required.

The following options are supported:

  1. -c, --no-invalidate-caches
  2. Don't invalidate the row cache when importing
  3. -e, --extended-verify
  4. Run an extended verify, verifying all values in the new SSTables
  5. -h <host>, --host <host>
  6. Node hostname or ip address
  7. -l, --keep-level
  8. Keep the level on the new SSTables
  9. -p <port>, --port <port>
  10. Remote jmx agent port number
  11. -pp, --print-port
  12. Operate in 4.0 mode with hosts disambiguated by port number
  13. -pw <password>, --password <password>
  14. Remote jmx agent password
  15. -pwf <passwordFilePath>, --password-file <passwordFilePath>
  16. Path to the JMX password file
  17. -q, --quick
  18. Do a quick import without verifying SSTables, clearing row cache or
  19. checking in which data directory to put the file
  20. -r, --keep-repaired
  21. Keep any repaired information from the SSTables
  22. -t, --no-tokens
  23. Don't verify that all tokens in the new SSTable are owned by the
  24. current node
  25. -u <username>, --username <username>
  26. Remote jmx agent username
  27. -v, --no-verify
  28. Don't verify new SSTables
  29. --
  30. This option can be used to separate command-line options from the
  31. list of argument, (useful when arguments might be mistaken for
  32. command-line options

Because the keyspace and table are specified on the command line for nodetool import, there is not the same requirement as with sstableloader, to have the SSTables in a specific directory path. When importing snapshots or incremental backups with nodetool import, the SSTables don’t need to be copied to another directory.

Importing Data from an Incremental Backup

Using nodetool import to import SSTables from an incremental backup, and restoring the table is shown below.

  1. DROP table t;

An incremental backup for a table does not include the schema definition for the table. If the schema definition is not kept as a separate backup, the schema.cql from a backup of the table may be used to create the table as follows:

  1. CREATE TABLE IF NOT EXISTS cqlkeyspace.t (
  2. id int PRIMARY KEY,
  3. k int,
  4. v text)
  5. WITH ID = d132e240-c217-11e9-bbee-19821dcea330
  6. AND bloom_filter_fp_chance = 0.01
  7. AND crc_check_chance = 1.0
  8. AND default_time_to_live = 0
  9. AND gc_grace_seconds = 864000
  10. AND min_index_interval = 128
  11. AND max_index_interval = 2048
  12. AND memtable_flush_period_in_ms = 0
  13. AND speculative_retry = '99p'
  14. AND additional_write_policy = '99p'
  15. AND comment = ''
  16. AND caching = { 'keys': 'ALL', 'rows_per_partition': 'NONE' }
  17. AND compaction = { 'max_threshold': '32', 'min_threshold': '4',
  18. 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
  19. AND compression = { 'chunk_length_in_kb': '16', 'class':
  20. 'org.apache.cassandra.io.compress.LZ4Compressor' }
  21. AND cdc = false
  22. AND extensions = { }
  23. ;

Initially the table could be empty, but does not have to be.

  1. SELECT * FROM t;
  1. id | k | v
  2. ----+---+---
  3. (0 rows)

Run the nodetool import command, providing the keyspace, table and the backups directory. Don’t copy the table backups to another directory, as with sstableloader.

  1. $ nodetool import -- cqlkeyspace t \
  2. ./cassandra/data/data/cqlkeyspace/t-d132e240c21711e9bbee19821dcea330/backups

The SSTables are imported into the table. Run a query in cqlsh to check:

  1. SELECT * FROM t;
  1. id | k | v
  2. ----+---+------
  3. 1 | 1 | val1
  4. 0 | 0 | val0
  5. (2 rows)

Importing Data from a Snapshot

Importing SSTables from a snapshot with the nodetool import command is similar to importing SSTables from an incremental backup. Shown here is an import of a snapshot for table catalogkeyspace.journal, after dropping the table to demonstrate the restore.

  1. USE CATALOGKEYSPACE;
  2. DROP TABLE journal;

Use the catalog-ks snapshot for the journal table. Check the files in the snapshot, and note the existence of the schema.cql file.

  1. $ ls -l
  1. total 44
  2. -rw-rw-r--. 1 ec2-user ec2-user 31 Aug 19 02:44 manifest.json
  3. -rw-rw-r--. 3 ec2-user ec2-user 47 Aug 19 02:38 na-1-big-CompressionInfo.db
  4. -rw-rw-r--. 3 ec2-user ec2-user 97 Aug 19 02:38 na-1-big-Data.db
  5. -rw-rw-r--. 3 ec2-user ec2-user 10 Aug 19 02:38 na-1-big-Digest.crc32
  6. -rw-rw-r--. 3 ec2-user ec2-user 16 Aug 19 02:38 na-1-big-Filter.db
  7. -rw-rw-r--. 3 ec2-user ec2-user 16 Aug 19 02:38 na-1-big-Index.db
  8. -rw-rw-r--. 3 ec2-user ec2-user 4687 Aug 19 02:38 na-1-big-Statistics.db
  9. -rw-rw-r--. 3 ec2-user ec2-user 56 Aug 19 02:38 na-1-big-Summary.db
  10. -rw-rw-r--. 3 ec2-user ec2-user 92 Aug 19 02:38 na-1-big-TOC.txt
  11. -rw-rw-r--. 1 ec2-user ec2-user 814 Aug 19 02:44 schema.cql

Copy the DDL from the schema.cql and run in cqlsh to create the catalogkeyspace.journal table:

  1. CREATE TABLE IF NOT EXISTS catalogkeyspace.journal (
  2. id int PRIMARY KEY,
  3. name text,
  4. publisher text)
  5. WITH ID = 296a2d30-c22a-11e9-b135-0d927649052c
  6. AND bloom_filter_fp_chance = 0.01
  7. AND crc_check_chance = 1.0
  8. AND default_time_to_live = 0
  9. AND gc_grace_seconds = 864000
  10. AND min_index_interval = 128
  11. AND max_index_interval = 2048
  12. AND memtable_flush_period_in_ms = 0
  13. AND speculative_retry = '99p'
  14. AND additional_write_policy = '99p'
  15. AND comment = ''
  16. AND caching = { 'keys': 'ALL', 'rows_per_partition': 'NONE' }
  17. AND compaction = { 'min_threshold': '4', 'max_threshold':
  18. '32', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
  19. AND compression = { 'chunk_length_in_kb': '16', 'class':
  20. 'org.apache.cassandra.io.compress.LZ4Compressor' }
  21. AND cdc = false
  22. AND extensions = { }
  23. ;

Run the nodetool import command to import the SSTables for the snapshot:

  1. $ nodetool import -- catalogkeyspace journal \
  2. ./cassandra/data/data/catalogkeyspace/journal-
  3. 296a2d30c22a11e9b1350d927649052c/snapshots/catalog-ks/

Subsequently run a CQL query on the journal table to check the imported data:

  1. SELECT * FROM journal;
  1. id | name | publisher
  2. ----+---------------------------+------------------
  3. 1 | Couchbase Magazine | Couchbase
  4. 0 | Apache Cassandra Magazine | Apache Cassandra
  5. (2 rows)

Bulk Loading External Data

Bulk loading external data directly is not supported by any of the tools we have discussed which include sstableloader and nodetool import. The sstableloader and nodetool import require data to be in the form of SSTables. Apache Cassandra supports a Java API for generating SSTables from input data, using the org.apache.cassandra.io.sstable.CQLSSTableWriter Java class. Subsequently, either sstableloader or nodetool import is used to bulk load the SSTables.

Generating SSTables with CQLSSTableWriter Java API

To generate SSTables using the CQLSSTableWriter class the following are required:

  • An output directory to generate the SSTable in

  • The schema for the SSTable

  • A prepared statement for the INSERT

  • A partitioner

The output directory must exist before starting. Create a directory (/sstables as an example) and set appropriate permissions.

  1. $ sudo mkdir /sstables
  2. $ sudo chmod 777 -R /sstables

To use CQLSSTableWriter in a Java application, create a Java constant for the output directory.

  1. public static final String OUTPUT_DIR = "./sstables";

CQLSSTableWriter Java API can create a user-defined type. Create a new type to store int data:

  1. String type = "CREATE TYPE CQLKeyspace.intType (a int, b int)";
  2. // Define a String variable for the SSTable schema.
  3. String schema = "CREATE TABLE CQLKeyspace.t ("
  4. + " id int PRIMARY KEY,"
  5. + " k int,"
  6. + " v1 text,"
  7. + " v2 intType,"
  8. + ")";

Define a String variable for the prepared statement to use:

  1. String insertStmt = "INSERT INTO CQLKeyspace.t (id, k, v1, v2) VALUES (?, ?, ?, ?)";

The partitioner to use only needs setting if the default partitioner Murmur3Partitioner is not used.

All these variables or settings are used by the builder class CQLSSTableWriter.Builder to create a CQLSSTableWriter object.

Create a File object for the output directory.

  1. File outputDir = new File(OUTPUT_DIR + File.separator + "CQLKeyspace" + File.separator + "t");

Obtain a CQLSSTableWriter.Builder object using static method CQLSSTableWriter.builder(). Set the following items:

  • output directory File object

  • user-defined type

  • SSTable schema

  • buffer size

  • prepared statement

  • optionally any of the other builder options

and invoke the build() method to create a CQLSSTableWriter object:

  1. CQLSSTableWriter writer = CQLSSTableWriter.builder()
  2. .inDirectory(outputDir)
  3. .withType(type)
  4. .forTable(schema)
  5. .withBufferSizeInMB(256)
  6. .using(insertStmt).build();

Set the SSTable data. If any user-defined types are used, obtain a UserType object for each type:

  1. UserType userType = writer.getUDType("intType");

Add data rows for the resulting SSTable:

  1. writer.addRow(0, 0, "val0", userType.newValue().setInt("a", 0).setInt("b", 0));
  2. writer.addRow(1, 1, "val1", userType.newValue().setInt("a", 1).setInt("b", 1));
  3. writer.addRow(2, 2, "val2", userType.newValue().setInt("a", 2).setInt("b", 2));

Close the writer, finalizing the SSTable:

  1. writer.close();

Other public methods the CQLSSTableWriter class provides are:

MethodDescription

addRow(java.util.List<java.lang.Object> values)

Adds a new row to the writer. Returns a CQLSSTableWriter object. Each provided value type should correspond to the types of the CQL column the value is for. The correspondence between java type and CQL type is the same one than the one documented at www.datastax.com/drivers/java/2.0/apidocs/com/datastax/driver/core/DataType.Name.html#asJavaC lass().

addRow(java.util.Map<java.lang.String,java.lang.Object> values)

Adds a new row to the writer. Returns a CQLSSTableWriter object. This is equivalent to the other addRow methods, but takes a map whose keys are the names of the columns to add instead of taking a list of the values in the order of the insert statement used during construction of this SSTable writer. The column names in the map keys must be in lowercase unless the declared column name is a case-sensitive quoted identifier in which case the map key must use the exact case of the column. The values parameter is a map of column name to column values representing the new row to add. If a column is not included in the map, it’s value will be null. If the map contains keys that do not correspond to one of the columns of the insert statement used when creating this SSTable writer, the corresponding value is ignored.

addRow(java.lang.Object…​ values)

Adds a new row to the writer. Returns a CQLSSTableWriter object.

CQLSSTableWriter.builder()

Returns a new builder for a CQLSSTableWriter.

close()

Closes the writer.

rawAddRow(java.nio.ByteBuffer…​ values)

Adds a new row to the writer given already serialized binary values. Returns a CQLSSTableWriter object. The row values must correspond to the bind variables of the insertion statement used when creating by this SSTable writer.

rawAddRow(java.util.List<java.nio.ByteBuffer> values)

Adds a new row to the writer given already serialized binary values. Returns a CQLSSTableWriter object. The row values must correspond to the bind variables of the insertion statement used when creating by this SSTable writer.

rawAddRow(java.util.Map<java.lang.String, java.nio.ByteBuffer> values)

Adds a new row to the writer given already serialized binary values. Returns a CQLSSTableWriter object. The row values must correspond to the bind variables of the insertion statement used when creating by this SSTable writer.

getUDType(String dataType)

Returns the User Defined type used in this SSTable Writer that can be used to create UDTValue instances.

Other public methods the CQLSSTableWriter.Builder class provides are:

MethodDescription

inDirectory(String directory)

The directory where to write the SSTables. This is a mandatory option. The directory to use should already exist and be writable.

inDirectory(File directory)

The directory where to write the SSTables. This is a mandatory option. The directory to use should already exist and be writable.

forTable(String schema)

The schema (CREATE TABLE statement) for the table for which SSTable is to be created. The provided CREATE TABLE statement must use a fully-qualified table name, one that includes the keyspace name. This is a mandatory option.

withPartitioner(IPartitioner partitioner)

The partitioner to use. By default, Murmur3Partitioner will be used. If this is not the partitioner used by the cluster for which the SSTables are created, the correct partitioner needs to be provided.

using(String insert)

The INSERT or UPDATE statement defining the order of the values to add for a given CQL row. The provided INSERT statement must use a fully-qualified table name, one that includes the keyspace name. Moreover, said statement must use bind variables since these variables will be bound to values by the resulting SSTable writer. This is a mandatory option.

withBufferSizeInMiB(int size)

The size of the buffer to use. This defines how much data will be buffered before being written as a new SSTable. This corresponds roughly to the data size that will have the created SSTable. The default is 128MB, which should be reasonable for a 1GB heap. If OutOfMemory exception gets generated while using the SSTable writer, should lower this value.

withBufferSizeInMB(int size)

Deprecated, and it will be available at least until next major release. Please use withBufferSizeInMiB(int size) which is the same method with a new name.

sorted()

Creates a CQLSSTableWriter that expects sorted inputs. If this option is used, the resulting SSTable writer will expect rows to be added in SSTable sorted order (and an exception will be thrown if that is not the case during row insertion). The SSTable sorted order means that rows are added such that their partition keys respect the partitioner order. This option should only be used if the rows can be provided in order, which is rarely the case. If the rows can be provided in order however, using this sorted might be more efficient. If this option is used, some option like withBufferSizeInMB will be ignored.

build()

Builds a CQLSSTableWriter object.