High Performance Writing
- How to achieve high performance data writing
- Sample Programs
  - Scenario
  - Sample Programs
    - note

High Performance Writing

This chapter introduces how to write data into TDengine with high throughput.

How to achieve high performance data writing

To achieve high performance writing, there are a few aspects to consider. In the following sections we will describe these important factors in achieving high performance writing.

Application Program

From the perspective of application program, you need to consider:

The data size of each single write, also known as batch size. Generally speaking, higher batch size generates better writing performance. However, once the batch size is over a specific value, you will not get any additional benefit anymore. When using SQL to write into TDengine, it’s better to put as much as possible data in single SQL. The maximum SQL length supported by TDengine is 1,048,576 bytes, i.e. 1 MB. It can be configured by parameter maxSQLLength on client side, and the default value is 65,480.
The number of concurrent connections. Normally more connections can get better result. However, once the number of connections exceeds the processing ability of the server side, the performance may downgrade.
The distribution of data to be written across tables or sub-tables. Writing to single table in one batch is more efficient than writing to multiple tables in one batch.
Data Writing Protocol.
- Prameter binding mode is more efficient than SQL because it doesn’t have the cost of parsing SQL.
- Writing to known existing tables is more efficient than wirting to uncertain tables in automatic creating mode because the later needs to check whether the table exists or not before actually writing data into it
- Writing in SQL is more efficient than writing in schemaless mode because schemaless writing creats table automatically and may alter table schema

Application programs need to take care of the above factors and try to take advantage of them. The application progam should write to single table in each write batch. The batch size needs to be tuned to a proper value on a specific system. The number of concurrent connections needs to be tuned to a proper value too to achieve the best writing throughput.

Data Source

Application programs need to read data from data source then write into TDengine. If you meet one or more of below situations, you need to setup message queues between the threads for reading from data source and the threads for writing into TDengine.

There are multiple data sources, the data generation speed of each data source is much slower than the speed of single writing thread. In this case, the purpose of message queues is to consolidate the data from multiple data sources together to increase the batch size of single write.
The speed of data generation from single data source is much higher than the speed of single writing thread. The purpose of message queue in this case is to provide buffer so that data is not lost and multiple writing threads can get data from the buffer.
The data for single table are from multiple data source. In this case the purpose of message queues is to combine the data for single table together to improve the write efficiency.

If the data source is Kafka, then the appication program is a consumer of Kafka, you can benefit from some kafka features to achieve high performance writing:

Put the data for a table in single partition of single topic so that it’s easier to put the data for each table together and write in batch
Subscribe multiple topics to accumulate data together.
Add more consumers to gain more concurrency and throughput.
Incrase the size of single fetch to increase the size of write batch.

Tune TDengine

TDengine is a distributed and high performance time series database, there are also some ways to tune TDengine to get better writing performance.

Set proper number of vgroups according to available CPU cores. Normally, we recommend 2 * number_of_cores as a starting point. If the verification result shows this is not enough to utilize CPU resources, you can use a higher value.
Set proper minTablesPerVnode, tableIncStepPerVnode, and maxVgroupsPerDb according to the number of tables so that tables are distributed even across vgroups. The purpose is to balance the workload among all vnodes so that system resources can be utilized better to get higher performance.

For more performance tuning tips, please refer to Performance Optimization and Configuration Parameters.

Sample Programs

This section will introduce the sample programs to demonstrate how to write into TDengine with high performance.

Scenario

Below are the scenario for the sample programs of high performance wrting.

Application program reads data from data source, the sample program simulates a data source by generating data
The speed of single writing thread is much slower than the speed of generating data, so the program starts multiple writing threads while each thread establish a connection to TDengine and each thread has a message queue of fixed size.
Application program maps the received data to different writing threads based on table name to make sure all the data for each table is always processed by a specific writing thread.
Each writing thread writes the received data into TDengine once the message queue becomes empty or the read data meets a threshold.

Sample Programs

The sample programs listed in this section are based on the scenario described previously. If your scenarios is different, please try to adjust the code based on the principles described in this chapter.

The sample programs assume the source data is for all the different sub tables in same super table (meters). The super table has been created before the sample program starts to writing data. Sub tables are created automatically according to received data. If there are multiple super tables in your case, please try to adjust the part of creating table automatically.

Java
Python

Program Inventory

Class	Description
FastWriteExample	Main Program
ReadTask	Read data from simulated data source and put into a queue according to the hash value of table name
WriteTask	Read data from Queue, compose a wirte batch and write into TDengine
MockDataSource	Generate data for some sub tables of super table meters
SQLWriter	WriteTask uses this class to compose SQL, create table automatically, check SQL length and write data
StmtWriter	Write in Parameter binding mode (Not finished yet)
DataBaseMonitor	Calculate the writing speed and output on console every 10 seconds

Below is the list of complete code of the classes in above table and more detailed description.

FastWriteExample

The main Program is responsible for:

Create message queues
Start writing threads
Start reading threads
Otuput writing speed every 10 seconds

The main program provides 4 parameters for tuning：

The number of reading threads, default value is 1
The number of writing threads, default alue is 2
The total number of tables in the generated data, default value is 1000. These tables are distributed evenly across all writing threads. If the number of tables is very big, it will cost much time to firstly create these tables.
The batch size of single write, default value is 3,000

The capacity of message queue also impacts performance and can be tuned by modifying program. Normally it’s always better to have a larger message queue. A larger message queue means lower possibility of being blocked when enqueueing and higher throughput. But a larger message queue consumes more memory space. The default value used in the sample programs is already big enoug.

package com.taos.example.highvolume;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.*;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.BlockingQueue;
public class FastWriteExample {
    final static Logger logger = LoggerFactory.getLogger(FastWriteExample.class);
    final static int taskQueueCapacity = 1000000;
    final static List<BlockingQueue<String>> taskQueues = new ArrayList<>();
    final static List<ReadTask> readTasks = new ArrayList<>();
    final static List<WriteTask> writeTasks = new ArrayList<>();
    final static DataBaseMonitor databaseMonitor = new DataBaseMonitor();
    public static void stopAll() {
        logger.info("shutting down");
        readTasks.forEach(task -> task.stop());
        writeTasks.forEach(task -> task.stop());
        databaseMonitor.close();
    }
    public static void main(String[] args) throws InterruptedException, SQLException {
        int readTaskCount = args.length > 0 ? Integer.parseInt(args[0]) : 1;
        int writeTaskCount = args.length > 1 ? Integer.parseInt(args[1]) : 3;
        int tableCount = args.length > 2 ? Integer.parseInt(args[2]) : 1000;
        int maxBatchSize = args.length > 3 ? Integer.parseInt(args[3]) : 3000;
        logger.info("readTaskCount={}, writeTaskCount={} tableCount={} maxBatchSize={}",
                readTaskCount, writeTaskCount, tableCount, maxBatchSize);
        databaseMonitor.init().prepareDatabase();
        // Create task queues, whiting tasks and start writing threads.
        for (int i = 0; i < writeTaskCount; ++i) {
            BlockingQueue<String> queue = new ArrayBlockingQueue<>(taskQueueCapacity);
            taskQueues.add(queue);
            WriteTask task = new WriteTask(queue, maxBatchSize);
            Thread t = new Thread(task);
            t.setName("WriteThread-" + i);
            t.start();
        }
        // create reading tasks and start reading threads
        int tableCountPerTask = tableCount / readTaskCount;
        for (int i = 0; i < readTaskCount; ++i) {
            ReadTask task = new ReadTask(i, taskQueues, tableCountPerTask);
            Thread t = new Thread(task);
            t.setName("ReadThread-" + i);
            t.start();
        }
        Runtime.getRuntime().addShutdownHook(new Thread(FastWriteExample::stopAll));
        long lastCount = 0;
        while (true) {
            Thread.sleep(10000);
            long numberOfTable = databaseMonitor.getTableCount();
            long count = databaseMonitor.count();
            logger.info("numberOfTable={} count={} speed={}", numberOfTable, count, (count - lastCount) / 10);
            lastCount = count;
        }
    }
}

view source code

ReadTask

ReadTask reads data from data source. Each ReadTask is associated with a simulated data source, each data source generates data for a group of specific tables, and the data of any table is only generated from a single specific data source.

ReadTask puts data in message queue in blocking mode. That means, the putting operation is blocked if the message queue is full.

package com.taos.example.highvolume;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.Iterator;
import java.util.List;
import java.util.concurrent.BlockingQueue;
class ReadTask implements Runnable {
    private final static Logger logger = LoggerFactory.getLogger(ReadTask.class);
    private final int taskId;
    private final List<BlockingQueue<String>> taskQueues;
    private final int queueCount;
    private final int tableCount;
    private boolean active = true;
    public ReadTask(int readTaskId, List<BlockingQueue<String>> queues, int tableCount) {
        this.taskId = readTaskId;
        this.taskQueues = queues;
        this.queueCount = queues.size();
        this.tableCount = tableCount;
    }
    /**
     * Assign data received to different queues.
     * Here we use the suffix number in table name.
     * You are expected to define your own rule in practice.
     *
     * @param line record received
     * @return which queue to use
     */
    public int getQueueId(String line) {
        String tbName = line.substring(0, line.indexOf(',')); // For example: tb1_101
        String suffixNumber = tbName.split("_")[1];
        return Integer.parseInt(suffixNumber) % this.queueCount;
    }
    @Override
    public void run() {
        logger.info("started");
        Iterator<String> it = new MockDataSource("tb" + this.taskId, tableCount);
        try {
            while (it.hasNext() && active) {
                String line = it.next();
                int queueId = getQueueId(line);
                taskQueues.get(queueId).put(line);
            }
        } catch (Exception e) {
            logger.error("Read Task Error", e);
        }
    }
    public void stop() {
        logger.info("stop");
        this.active = false;
    }
}

view source code

WriteTask

package com.taos.example.highvolume;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.concurrent.BlockingQueue;
class WriteTask implements Runnable {
    private final static Logger logger = LoggerFactory.getLogger(WriteTask.class);
    private final int maxBatchSize;
    // the queue from which this writing task get raw data.
    private final BlockingQueue<String> queue;
    // A flag indicate whether to continue.
    private boolean active = true;
    public WriteTask(BlockingQueue<String> taskQueue, int maxBatchSize) {
        this.queue = taskQueue;
        this.maxBatchSize = maxBatchSize;
    }
    @Override
    public void run() {
        logger.info("started");
        String line = null; // data getting from the queue just now.
        SQLWriter writer = new SQLWriter(maxBatchSize);
        try {
            writer.init();
            while (active) {
                line = queue.poll();
                if (line != null) {
                    // parse raw data and buffer the data.
                    writer.processLine(line);
                } else if (writer.hasBufferedValues()) {
                    // write data immediately if no more data in the queue
                    writer.flush();
                } else {
                    // sleep a while to avoid high CPU usage if no more data in the queue and no buffered records, .
                    Thread.sleep(100);
                }
            }
            if (writer.hasBufferedValues()) {
                writer.flush();
            }
        } catch (Exception e) {
            String msg = String.format("line=%s, bufferedCount=%s", line, writer.getBufferedCount());
            logger.error(msg, e);
        } finally {
            writer.close();
        }
    }
    public void stop() {
        logger.info("stop");
        this.active = false;
    }
}

view source code

MockDataSource

package com.taos.example.highvolume;
import java.util.Iterator;
/**
 * Generate test data
 */
class MockDataSource implements Iterator {
    private String tbNamePrefix;
    private int tableCount;
    private long maxRowsPerTable = 1000000000L;
    // 100 milliseconds between two neighbouring rows.
    long startMs = System.currentTimeMillis() - maxRowsPerTable * 100;
    private int currentRow = 0;
    private int currentTbId = -1;
    // mock values
    String[] location = {"California.LosAngeles", "California.SanDiego", "California.SanJose", "California.Campbell", "California.SanFrancisco"};
    float[] current = {8.8f, 10.7f, 9.9f, 8.9f, 9.4f};
    int[] voltage = {119, 116, 111, 113, 118};
    float[] phase = {0.32f, 0.34f, 0.33f, 0.329f, 0.141f};
    public MockDataSource(String tbNamePrefix, int tableCount) {
        this.tbNamePrefix = tbNamePrefix;
        this.tableCount = tableCount;
    }
    @Override
    public boolean hasNext() {
        currentTbId += 1;
        if (currentTbId == tableCount) {
            currentTbId = 0;
            currentRow += 1;
        }
        return currentRow < maxRowsPerTable;
    }
    @Override
    public String next() {
        long ts = startMs + 100 * currentRow;
        int groupId = currentTbId % 5 == 0 ? currentTbId / 5 : currentTbId / 5 + 1;
        StringBuilder sb = new StringBuilder(tbNamePrefix + "_" + currentTbId + ","); // tbName
        sb.append(ts).append(','); // ts
        sb.append(current[currentRow % 5]).append(','); // current
        sb.append(voltage[currentRow % 5]).append(','); // voltage
        sb.append(phase[currentRow % 5]).append(','); // phase
        sb.append(location[currentRow % 5]).append(','); // location
        sb.append(groupId); // groupID
        return sb.toString();
    }
}

view source code

SQLWriter

package com.taos.example.highvolume;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.*;
import java.util.HashMap;
import java.util.Map;
/**
 * A helper class encapsulate the logic of writing using SQL.
 * <p>
 * The main interfaces are two methods:
 * <ol>
 *     <li>{@link SQLWriter#processLine}, which receive raw lines from WriteTask and group them by table names.</li>
 *     <li>{@link  SQLWriter#flush}, which assemble INSERT statement and execute it.</li>
 * </ol>
 * <p>
 * There is a technical skill worth mentioning: we create table as needed when "table does not exist" error occur instead of creating table automatically using syntax "INSET INTO tb USING stb".
 * This ensure that checking table existence is a one-time-only operation.
 * </p>
 *
 * </p>
 */
public class SQLWriter {
    final static Logger logger = LoggerFactory.getLogger(SQLWriter.class);
    private Connection conn;
    private Statement stmt;
    /**
     * current number of buffered records
     */
    private int bufferedCount = 0;
    /**
     * Maximum number of buffered records.
     * Flush action will be triggered if bufferedCount reached this value,
     */
    private int maxBatchSize;
    /**
     * Maximum SQL length.
     */
    private int maxSQLLength;
    /**
     * Map from table name to column values. For example:
     * "tb001" -> "(1648432611249,2.1,114,0.09) (1648432611250,2.2,135,0.2)"
     */
    private Map<String, String> tbValues = new HashMap<>();
    /**
     * Map from table name to tag values in the same order as creating stable.
     * Used for creating table.
     */
    private Map<String, String> tbTags = new HashMap<>();
    public SQLWriter(int maxBatchSize) {
        this.maxBatchSize = maxBatchSize;
    }
    /**
     * Get Database Connection
     *
     * @return Connection
     * @throws SQLException
     */
    private static Connection getConnection() throws SQLException {
        String jdbcURL = System.getenv("TDENGINE_JDBC_URL");
        return DriverManager.getConnection(jdbcURL);
    }
    /**
     * Create Connection and Statement
     *
     * @throws SQLException
     */
    public void init() throws SQLException {
        conn = getConnection();
        stmt = conn.createStatement();
        stmt.execute("use test");
        ResultSet rs = stmt.executeQuery("show variables");
        while (rs.next()) {
            String configName = rs.getString(1);
            if ("maxSQLLength".equals(configName)) {
                maxSQLLength = Integer.parseInt(rs.getString(2));
                logger.info("maxSQLLength={}", maxSQLLength);
            }
        }
    }
    /**
     * Convert raw data to SQL fragments, group them by table name and cache them in a HashMap.
     * Trigger writing when number of buffered records reached maxBachSize.
     *
     * @param line raw data get from task queue in format: tbName,ts,current,voltage,phase,location,groupId
     */
    public void processLine(String line) throws SQLException {
        bufferedCount += 1;
        int firstComma = line.indexOf(',');
        String tbName = line.substring(0, firstComma);
        int lastComma = line.lastIndexOf(',');
        int secondLastComma = line.lastIndexOf(',', lastComma - 1);
        String value = "(" + line.substring(firstComma + 1, secondLastComma) + ") ";
        if (tbValues.containsKey(tbName)) {
            tbValues.put(tbName, tbValues.get(tbName) + value);
        } else {
            tbValues.put(tbName, value);
        }
        if (!tbTags.containsKey(tbName)) {
            String location = line.substring(secondLastComma + 1, lastComma);
            String groupId = line.substring(lastComma + 1);
            String tagValues = "('" + location + "'," + groupId + ')';
            tbTags.put(tbName, tagValues);
        }
        if (bufferedCount == maxBatchSize) {
            flush();
        }
    }
    /**
     * Assemble INSERT statement using buffered SQL fragments in Map {@link SQLWriter#tbValues} and execute it.
     * In case of "Table does not exit" exception, create all tables in the sql and retry the sql.
     */
    public void flush() throws SQLException {
        StringBuilder sb = new StringBuilder("INSERT INTO ");
        for (Map.Entry<String, String> entry : tbValues.entrySet()) {
            String tableName = entry.getKey();
            String values = entry.getValue();
            String q = tableName + " values " + values + " ";
            if (sb.length() + q.length() > maxSQLLength) {
                executeSQL(sb.toString());
                logger.warn("increase maxSQLLength or decrease maxBatchSize to gain better performance");
                sb = new StringBuilder("INSERT INTO ");
            }
            sb.append(q);
        }
        executeSQL(sb.toString());
        tbValues.clear();
        bufferedCount = 0;
    }
    private void executeSQL(String sql) throws SQLException {
        try {
            stmt.executeUpdate(sql);
        } catch (SQLException e) {
            // convert to error code defined in taoserror.h
            int errorCode = e.getErrorCode() & 0xffff;
            if (errorCode == 0x362 || errorCode == 0x218) {
                // Table does not exist
                createTables();
                executeSQL(sql);
            } else {
                logger.error("Execute SQL: {}", sql);
                throw e;
            }
        } catch (Throwable throwable) {
            logger.error("Execute SQL: {}", sql);
            throw throwable;
        }
    }
    /**
     * Create tables in batch using syntax:
     * <p>
     * CREATE TABLE [IF NOT EXISTS] tb_name1 USING stb_name TAGS (tag_value1, ...) [IF NOT EXISTS] tb_name2 USING stb_name TAGS (tag_value2, ...) ...;
     * </p>
     */
    private void createTables() throws SQLException {
        StringBuilder sb = new StringBuilder("CREATE TABLE ");
        for (String tbName : tbValues.keySet()) {
            String tagValues = tbTags.get(tbName);
            sb.append("IF NOT EXISTS ").append(tbName).append(" USING meters TAGS ").append(tagValues).append(" ");
        }
        String sql = sb.toString();
        try {
            stmt.executeUpdate(sql);
        } catch (Throwable throwable) {
            logger.error("Execute SQL: {}", sql);
            throw throwable;
        }
    }
    public boolean hasBufferedValues() {
        return bufferedCount > 0;
    }
    public int getBufferedCount() {
        return bufferedCount;
    }
    public void close() {
        try {
            stmt.close();
        } catch (SQLException e) {
        }
        try {
            conn.close();
        } catch (SQLException e) {
        }
    }
}

view source code

DataBaseMonitor

package com.taos.example.highvolume;
import java.sql.*;
/**
 * Prepare target database.
 * Count total records in database periodically so that we can estimate the writing speed.
 */
public class DataBaseMonitor {
    private Connection conn;
    private Statement stmt;
    public DataBaseMonitor init() throws SQLException {
        if (conn == null) {
            String jdbcURL = System.getenv("TDENGINE_JDBC_URL");
            conn = DriverManager.getConnection(jdbcURL);
            stmt = conn.createStatement();
        }
        return this;
    }
    public void close() {
        try {
            stmt.close();
        } catch (SQLException e) {
        }
        try {
            conn.close();
        } catch (SQLException e) {
        }
    }
    public void prepareDatabase() throws SQLException {
        stmt.execute("DROP DATABASE IF EXISTS test");
        stmt.execute("CREATE DATABASE test");
        stmt.execute("CREATE STABLE test.meters (ts TIMESTAMP, current FLOAT, voltage INT, phase FLOAT) TAGS (location BINARY(64), groupId INT)");
    }
    public Long count() throws SQLException {
        if (!stmt.isClosed()) {
            ResultSet result = stmt.executeQuery("SELECT count(*) from test.meters");
            result.next();
            return result.getLong(1);
        }
        return null;
    }
    /**
     * show test.stables;
     *
     *               name              |      created_time       | columns |  tags  |   tables    |
     * ============================================================================================
     *  meters                         | 2022-07-20 08:39:30.902 |       4 |      2 |      620000 |
     */
    public  Long getTableCount() throws SQLException {
        if (!stmt.isClosed()) {
            ResultSet result = stmt.executeQuery("show test.stables");
            result.next();
            return result.getLong(5);
        }
        return null;
    }
}

view source code

Steps to Launch

Launch Java Sample Program

You need to set environment variable TDENGINE_JDBC_URL before launching the program. If TDengine Server is setup on localhost, then the default value for user name, password and port can be used, like below:

TDENGINE_JDBC_URL="jdbc:TAOS://localhost:6030?user=root&password=taosdata"

Launch in IDE

Clone TDengine repolitory

git clone git@github.com:taosdata/TDengine.git --depth 1

Use IDE to open docs/examples/java directory
Configure environment variable TDENGINE_JDBC_URL, you can also configure it before launching the IDE, if so you can skip this step.
Run class com.taos.example.highvolume.FastWriteExample

Launch on server

If you want to launch the sample program on a remote server, please follow below steps:

Package the sample programs. Execute below command under directory TDengine/docs/examples/java ：
```
mvn package
```
Create examples/java directory on the server
```
mkdir -p examples/java
```
Copy dependencies (below commands assume you are working on a local Windows host and try to launch on a remote Linux host)
- Copy dependent packages
```
scp -r .\target\lib <user>@<host>:~/examples/java
```
- Copy the jar of sample programs
```
scp -r .\target\javaexample-1.0.jar <user>@<host>:~/examples/java
```
Configure environment variable Edit ~/.bash_profile or ~/.bashrc and add below:
```
export TDENGINE_JDBC_URL="jdbc:TAOS://localhost:6030?user=root&password=taosdata"
```
If your TDengine server is not deployed on localhost or doesn’t use default port, you need to change the above URL to correct value in your environment.

Launch the sample program

java -classpath lib/*:javaexample-1.0.jar  com.taos.example.highvolume.FastWriteExample <read_thread_count>  <white_thread_count> <total_table_count> <max_batch_size>

The sample program doesn’t exit unless you press CTRL + C to terminate it. Below is the output of running on a server of 16 cores, 64GB memory and SSD hard disk.

root@vm85$ java -classpath lib/*:javaexample-1.0.jar  com.taos.example.highvolume.FastWriteExample 2 12
18:56:35.896 [main] INFO  c.t.e.highvolume.FastWriteExample - readTaskCount=2, writeTaskCount=12 tableCount=1000 maxBatchSize=3000
18:56:36.011 [WriteThread-0] INFO  c.taos.example.highvolume.WriteTask - started
18:56:36.015 [WriteThread-0] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
18:56:36.021 [WriteThread-1] INFO  c.taos.example.highvolume.WriteTask - started
18:56:36.022 [WriteThread-1] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
18:56:36.031 [WriteThread-2] INFO  c.taos.example.highvolume.WriteTask - started
18:56:36.032 [WriteThread-2] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
18:56:36.041 [WriteThread-3] INFO  c.taos.example.highvolume.WriteTask - started
18:56:36.042 [WriteThread-3] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
18:56:36.093 [WriteThread-4] INFO  c.taos.example.highvolume.WriteTask - started
18:56:36.094 [WriteThread-4] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
18:56:36.099 [WriteThread-5] INFO  c.taos.example.highvolume.WriteTask - started
18:56:36.100 [WriteThread-5] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
18:56:36.100 [WriteThread-6] INFO  c.taos.example.highvolume.WriteTask - started
18:56:36.101 [WriteThread-6] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
18:56:36.103 [WriteThread-7] INFO  c.taos.example.highvolume.WriteTask - started
18:56:36.104 [WriteThread-7] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
18:56:36.105 [WriteThread-8] INFO  c.taos.example.highvolume.WriteTask - started
18:56:36.107 [WriteThread-8] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
18:56:36.108 [WriteThread-9] INFO  c.taos.example.highvolume.WriteTask - started
18:56:36.109 [WriteThread-9] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
18:56:36.156 [WriteThread-10] INFO  c.taos.example.highvolume.WriteTask - started
18:56:36.157 [WriteThread-11] INFO  c.taos.example.highvolume.WriteTask - started
18:56:36.158 [WriteThread-10] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
18:56:36.158 [ReadThread-0] INFO  com.taos.example.highvolume.ReadTask - started
18:56:36.158 [ReadThread-1] INFO  com.taos.example.highvolume.ReadTask - started
18:56:36.158 [WriteThread-11] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
18:56:46.369 [main] INFO  c.t.e.highvolume.FastWriteExample - count=18554448 speed=1855444
18:56:56.946 [main] INFO  c.t.e.highvolume.FastWriteExample - count=39059660 speed=2050521
18:57:07.322 [main] INFO  c.t.e.highvolume.FastWriteExample - count=59403604 speed=2034394
18:57:18.032 [main] INFO  c.t.e.highvolume.FastWriteExample - count=80262938 speed=2085933
18:57:28.432 [main] INFO  c.t.e.highvolume.FastWriteExample - count=101139906 speed=2087696
18:57:38.921 [main] INFO  c.t.e.highvolume.FastWriteExample - count=121807202 speed=2066729
18:57:49.375 [main] INFO  c.t.e.highvolume.FastWriteExample - count=142952417 speed=2114521
18:58:00.689 [main] INFO  c.t.e.highvolume.FastWriteExample - count=163650306 speed=2069788
18:58:11.646 [main] INFO  c.t.e.highvolume.FastWriteExample - count=185019808 speed=2136950

Program Inventory

Sample programs in Python uses multi-process and cross-process message queues.

Function/CLass	Description
main Function	Program entry point, create child processes and message queues
run_monitor_process Function	Create database, super table, calculate writing speed and output to console
run_read_task Function	Read data and distribute to message queues
MockDataSource Class	Simulate data source, return next 1,000 rows of each table
run_write_task Function	Read as much as possible data from message queue and write in batch
SQLWriter Class	Write in SQL and create table utomatically
StmtWriter Class	Write in parameter binding mode (not finished yet)

main function

main function is responsible for creating message queues and fork child processes, there are 3 kinds of child processes:

Monitoring process, initializes database and calculating writing speed
Reading process (n), reads data from data source
Writing process (m), wirtes data into TDengine

main function provides 5 parameters:

The number of reading tasks, default value is 1
The number of writing tasks, default value is 1
The number of tables, default value is 1,000
The capacity of message queue, default value is 1,000,000 bytes
The batch size in single write, default value is 3000

def main():
    set_global_config()
    logging.info(f"READ_TASK_COUNT={READ_TASK_COUNT}, WRITE_TASK_COUNT={WRITE_TASK_COUNT}, "
                 f"TABLE_COUNT={TABLE_COUNT}, QUEUE_SIZE={QUEUE_SIZE}, MAX_BATCH_SIZE={MAX_BATCH_SIZE}")
    monitor_process = Process(target=run_monitor_process)
    monitor_process.start()
    time.sleep(3)  # waiting for database ready.
    task_queues: List[Queue] = []
    # create task queues
    for i in range(WRITE_TASK_COUNT):
        queue = Queue(max_size_bytes=QUEUE_SIZE)
        task_queues.append(queue)
    # create write processes
    for i in range(WRITE_TASK_COUNT):
        p = Process(target=run_write_task, args=(i, task_queues[i]))
        p.start()
        logging.debug(f"WriteTask-{i} started with pid {p.pid}")
        write_processes.append(p)
    # create read processes
    for i in range(READ_TASK_COUNT):
        queues = assign_queues(i, task_queues)
        p = Process(target=run_read_task, args=(i, queues))
        p.start()
        logging.debug(f"ReadTask-{i} started with pid {p.pid}")
        read_processes.append(p)
    try:
        monitor_process.join()
    except KeyboardInterrupt:
        monitor_process.terminate()
        [p.terminate() for p in read_processes]
        [p.terminate() for p in write_processes]
        [q.close() for q in task_queues]
def assign_queues(read_task_id, task_queues):
    """
    Compute target queues for a specific read task.
    """
    ratio = WRITE_TASK_COUNT / READ_TASK_COUNT
    from_index = math.floor(read_task_id * ratio)
    end_index = math.ceil((read_task_id + 1) * ratio)
    return task_queues[from_index:end_index]
if __name__ == '__main__':
    main()

view source code

run_monitor_process

Monitoring process initilizes database and monitoring writing speed.

def run_monitor_process():
    log = logging.getLogger("DataBaseMonitor")
    conn = get_connection()
    conn.execute("DROP DATABASE IF EXISTS test")
    conn.execute("CREATE DATABASE test")
    conn.execute("CREATE STABLE test.meters (ts TIMESTAMP, current FLOAT, voltage INT, phase FLOAT) "
                 "TAGS (location BINARY(64), groupId INT)")
    def get_count():
        res = conn.query("SELECT count(*) FROM test.meters")
        rows = res.fetch_all()
        return rows[0][0] if rows else 0
    last_count = 0
    while True:
        time.sleep(10)
        count = get_count()
        log.info(f"count={count} speed={(count - last_count) / 10}")
        last_count = count

view source code

run_read_task function

Reading process reads data from other data system and distributes to the message queue allocated for it.


def run_read_task(task_id: int, task_queues: List[Queue]):
    table_count_per_task = TABLE_COUNT // READ_TASK_COUNT
    data_source = MockDataSource(f"tb{task_id}", table_count_per_task)
    try:
        for batch in data_source:
            for table_id, rows in batch:
                # hash data to different queue
                i = table_id % len(task_queues)
                # block putting forever when the queue is full
                task_queues[i].put_many(rows, block=True, timeout=-1)
    except KeyboardInterrupt:
        pass

view source code

MockDataSource

Below is the simulated data source, we assume table name exists in each generated data.

import time
class MockDataSource:
    samples = [
        "8.8,119,0.32,California.LosAngeles,0",
        "10.7,116,0.34,California.SanDiego,1",
        "9.9,111,0.33,California.SanJose,2",
        "8.9,113,0.329,California.Campbell,3",
        "9.4,118,0.141,California.SanFrancisco,4"
    ]
    def __init__(self, tb_name_prefix, table_count):
        self.table_name_prefix = tb_name_prefix + "_"
        self.table_count = table_count
        self.max_rows = 10000000
        self.current_ts = round(time.time() * 1000) - self.max_rows * 100
        # [(tableId, tableName, values),]
        self.data = self._init_data()
    def _init_data(self):
        lines = self.samples * (self.table_count // 5 + 1)
        data = []
        for i in range(self.table_count):
            table_name = self.table_name_prefix + str(i)
            data.append((i, table_name, lines[i]))  # tableId, row
        return data
    def __iter__(self):
        self.row = 0
        return self
    def __next__(self):
        """
        next 1000 rows for each table.
        return: {tableId:[row,...]}
        """
        # generate 1000 timestamps
        ts = []
        for _ in range(1000):
            self.current_ts += 100
            ts.append(str(self.current_ts))
        # add timestamp to each row
        # [(tableId, ["tableName,ts,current,voltage,phase,location,groupId"])]
        result = []
        for table_id, table_name, values in self.data:
            rows = [table_name + ',' + t + ',' + values for t in ts]
            result.append((table_id, rows))
        return result

view source code

run_write_task function

Writing process tries to read as much as possible data from message queue and writes in batch.

def run_write_task(task_id: int, queue: Queue):
    from sql_writer import SQLWriter
    log = logging.getLogger(f"WriteTask-{task_id}")
    writer = SQLWriter(get_connection)
    lines = None
    try:
        while True:
            try:
                # get as many as possible
                lines = queue.get_many(block=False, max_messages_to_get=MAX_BATCH_SIZE)
                writer.process_lines(lines)
            except Empty:
                time.sleep(0.01)
    except KeyboardInterrupt:
        pass
    except BaseException as e:
        log.debug(f"lines={lines}")
        raise e

view source code

SQLWriter

SQLWriter class encapsulates the logic of composing SQL and writing data. Please be noted that the tables have not been created before writing, but are created automatically when catching the exception of table doesn’t exist. For other exceptions caught, the SQL which caused the exception are logged for you to debug. This class also checks the SQL length, if the SQL length is closed to maxSQLLength the SQL will be executed immediately. To improve writing efficiency, it’s better to increase maxSQLLength properly.

import logging
import taos
class SQLWriter:
    log = logging.getLogger("SQLWriter")
    def __init__(self, get_connection_func):
        self._tb_values = {}
        self._tb_tags = {}
        self._conn = get_connection_func()
        self._max_sql_length = self.get_max_sql_length()
        self._conn.execute("USE test")
    def get_max_sql_length(self):
        rows = self._conn.query("SHOW variables").fetch_all()
        for r in rows:
            name = r[0]
            if name == "maxSQLLength":
                return int(r[1])
        return 1024 * 1024
    def process_lines(self, lines: str):
        """
        :param lines: [[tbName,ts,current,voltage,phase,location,groupId]]
        """
        for line in lines:
            ps = line.split(",")
            table_name = ps[0]
            value = '(' + ",".join(ps[1:-2]) + ') '
            if table_name in self._tb_values:
                self._tb_values[table_name] += value
            else:
                self._tb_values[table_name] = value
            if table_name not in self._tb_tags:
                location = ps[-2]
                group_id = ps[-1]
                tag_value = f"('{location}',{group_id})"
                self._tb_tags[table_name] = tag_value
        self.flush()
    def flush(self):
        """
        Assemble INSERT statement and execute it.
        When the sql length grows close to MAX_SQL_LENGTH, the sql will be executed immediately, and a new INSERT statement will be created.
        In case of "Table does not exit" exception, tables in the sql will be created and the sql will be re-executed.
        """
        sql = "INSERT INTO "
        sql_len = len(sql)
        buf = []
        for tb_name, values in self._tb_values.items():
            q = tb_name + " VALUES " + values
            if sql_len + len(q) >= self._max_sql_length:
                sql += " ".join(buf)
                self.execute_sql(sql)
                sql = "INSERT INTO "
                sql_len = len(sql)
                buf = []
            buf.append(q)
            sql_len += len(q)
        sql += " ".join(buf)
        self.execute_sql(sql)
        self._tb_values.clear()
    def execute_sql(self, sql):
        try:
            self._conn.execute(sql)
        except taos.Error as e:
            error_code = e.errno & 0xffff
            # Table does not exit
            if error_code == 0x362 or error_code == 0x218:
                self.create_tables()
            else:
                self.log.error("Execute SQL: %s", sql)
                raise e
        except BaseException as baseException:
            self.log.error("Execute SQL: %s", sql)
            raise baseException
    def create_tables(self):
        sql = "CREATE TABLE "
        for tb in self._tb_values.keys():
            tag_values = self._tb_tags[tb]
            sql += "IF NOT EXISTS " + tb + " USING meters TAGS " + tag_values + " "
        try:
            self._conn.execute(sql)
        except BaseException as e:
            self.log.error("Execute SQL: %s", sql)
            raise e

view source code

Steps to Launch

Launch Sample Program in Python

Prerequisities
- TDengine client driver has been installed
- Python3 has been installed, the the version >= 3.8
- TDengine Python connector taospy has been installed
Install faster-fifo to replace python builtin multiprocessing.Queue
```
pip3 install faster-fifo
```
Click the “Copy” in the above sample programs to copy fast_write_example.py 、 sql_writer.py and mockdatasource.py.

Execute the program

python3  fast_write_example.py <READ_TASK_COUNT> <WRITE_TASK_COUNT> <TABLE_COUNT> <QUEUE_SIZE> <MAX_BATCH_SIZE>

Below is the output of running on a server of 16 cores, 64GB memory and SSD hard disk.

root@vm85$ python3 fast_write_example.py  8 8
2022-07-14 19:13:45,869 [root] - READ_TASK_COUNT=8, WRITE_TASK_COUNT=8, TABLE_COUNT=1000, QUEUE_SIZE=1000000, MAX_BATCH_SIZE=3000
2022-07-14 19:13:48,882 [root] - WriteTask-0 started with pid 718347
2022-07-14 19:13:48,883 [root] - WriteTask-1 started with pid 718348
2022-07-14 19:13:48,884 [root] - WriteTask-2 started with pid 718349
2022-07-14 19:13:48,884 [root] - WriteTask-3 started with pid 718350
2022-07-14 19:13:48,885 [root] - WriteTask-4 started with pid 718351
2022-07-14 19:13:48,885 [root] - WriteTask-5 started with pid 718352
2022-07-14 19:13:48,886 [root] - WriteTask-6 started with pid 718353
2022-07-14 19:13:48,886 [root] - WriteTask-7 started with pid 718354
2022-07-14 19:13:48,887 [root] - ReadTask-0 started with pid 718355
2022-07-14 19:13:48,888 [root] - ReadTask-1 started with pid 718356
2022-07-14 19:13:48,889 [root] - ReadTask-2 started with pid 718357
2022-07-14 19:13:48,889 [root] - ReadTask-3 started with pid 718358
2022-07-14 19:13:48,890 [root] - ReadTask-4 started with pid 718359
2022-07-14 19:13:48,891 [root] - ReadTask-5 started with pid 718361
2022-07-14 19:13:48,892 [root] - ReadTask-6 started with pid 718364
2022-07-14 19:13:48,893 [root] - ReadTask-7 started with pid 718365
2022-07-14 19:13:56,042 [DataBaseMonitor] - count=6676310 speed=667631.0
2022-07-14 19:14:06,196 [DataBaseMonitor] - count=20004310 speed=1332800.0
2022-07-14 19:14:16,366 [DataBaseMonitor] - count=32290310 speed=1228600.0
2022-07-14 19:14:26,527 [DataBaseMonitor] - count=44438310 speed=1214800.0
2022-07-14 19:14:36,673 [DataBaseMonitor] - count=56608310 speed=1217000.0
2022-07-14 19:14:46,834 [DataBaseMonitor] - count=68757310 speed=1214900.0
2022-07-14 19:14:57,280 [DataBaseMonitor] - count=80992310 speed=1223500.0
2022-07-14 19:15:07,689 [DataBaseMonitor] - count=93805310 speed=1281300.0
2022-07-14 19:15:18,020 [DataBaseMonitor] - count=106111310 speed=1230600.0
2022-07-14 19:15:28,356 [DataBaseMonitor] - count=118394310 speed=1228300.0
2022-07-14 19:15:38,690 [DataBaseMonitor] - count=130742310 speed=1234800.0
2022-07-14 19:15:49,000 [DataBaseMonitor] - count=143051310 speed=1230900.0
2022-07-14 19:15:59,323 [DataBaseMonitor] - count=155276310 speed=1222500.0
2022-07-14 19:16:09,649 [DataBaseMonitor] - count=167603310 speed=1232700.0
2022-07-14 19:16:19,995 [DataBaseMonitor] - count=179976310 speed=1237300.0

note

Don’t establish connection to TDengine in the parent process if using Python connector in multi-process way, otherwise all the connections in child processes are blocked always. This is a known issue.