IoT Fleet Management - Confluent Kafka, KSQL, Apache Spark

Overview

This is an end-to-end functional application with source code and installation instructions available on GitHub. It is a blueprint for an IoT application built on top of YugabyteDB (using the Cassandra-compatible YCQL API) as the database, Confluent Kafka as the message broker, KSQL or Apache Spark Streaming for real-time analytics and Spring Boot as the application framework.

Scenario

Assume that a fleet management company wants to track their fleet of vehicles which are delivering shipments. The vehicles performing the shipments are of different types (18 Wheelers, buses, large trucks, etc), and the shipments themselves happen over 3 routes (Route-37, Route-82, Route-43). The company wants to track:

  • the breakdown of their vehicle types per shipment delivery route
  • which vehicles are near road closures so that they can predict delays in deliveries

This app renders a dashboard showing both of the above. Below is a view of the real-time, auto-refreshing dashboard.

YB IoT Fleet Management Dashboard

Application architecture

This application has the following subcomponents:

  • Data Store - YugabyteDB for storing raw events from Kafka as well as the aggregates from the Data Processor
  • Data Producer - Test program writing into Kafka
  • Data Processor - KSQL or Apache Spark Streaming reading from Kafka, computing the aggregates and store results in the Data Store
  • Data Dashboard - Spring Boot app using web sockets, jQuery and bootstrap

We will look at each of these components in detail. Below is an architecture diagram showing how these components fit together.

Confluent Kafka, KSQL, and YugabyteDB (CKY Stack)

App architecture with the CKY stack is shown below. The same Kafka Connect Sink Connector for YugabyteDB is used for storing both the raw events as well as the aggregate data (that’s generated using KSQL).

YB IoT Fleet Management Architecture with KSQL

Spark, Kafka, and YugabyteDB (SKY Stack)

App architecture with the SKY stack is shown below. The Kafka Connect Sink Connector for YugabyteDB is used for storing raw events from Kafka to YugabyteDB. The aggregate data generated via Apache Spark Streaming is persisted in YugabyteDB using the Spark-Cassandra Connector.

YB IoT Fleet Management Architecture with Apache Spark

Data store

Stores all the user-facing data. YugabyteDB is used here, with the Cassandra-compatible YCQL as the programming language.

All the data is stored in the keyspace TrafficKeySpace:

  1. CREATE KEYSPACE IF NOT EXISTS TrafficKeySpace

There is one table for storing the raw events. Note the default_time_to_live value below to ensure that raw events get auto-expired after the specified period of time. This is to ensure that the raw events do not consume up all the storage in the database and efficiently deleted from the database a short while after their aggregates have been computed.

  1. CREATE TABLE TrafficKeySpace.Origin_Table (
  2. vehicleId text,
  3. routeId text,
  4. vehicleType text,
  5. longitude text,
  6. latitude text,
  7. timeStamp timestamp,
  8. speed double,
  9. fuelLevel double,
  10. PRIMARY KEY ((vehicleId), timeStamp))
  11. WITH default_time_to_live = 3600;

There are three tables that hold the user-facing data - Total_Traffic for the lifetime traffic information, Window_Traffic for the last 30 seconds of traffic and poi_traffic for the traffic near a point of interest (road closures). The data processor constantly updates these tables, and the dashboard reads from these tables. Below are the schemas for these tables.

  1. CREATE TABLE TrafficKeySpace.Total_Traffic (
  2. routeId text,
  3. vehicleType text,
  4. totalCount bigint,
  5. timeStamp timestamp,
  6. recordDate text,
  7. PRIMARY KEY (routeId, recordDate, vehicleType)
  8. );
  9. CREATE TABLE TrafficKeySpace.Window_Traffic (
  10. routeId text,
  11. vehicleType text,
  12. totalCount bigint,
  13. timeStamp timestamp,
  14. recordDate text,
  15. PRIMARY KEY (routeId, recordDate, vehicleType)
  16. );
  17. CREATE TABLE TrafficKeySpace.poi_traffic(
  18. vehicleid text,
  19. vehicletype text,
  20. distance bigint,
  21. timeStamp timestamp,
  22. PRIMARY KEY (vehicleid)
  23. );

Data producer

A program that generates random test data and publishes it to the Kafka topic iot-data-event. This emulates the data received from the connected vehicles using a message broker in the real world.

A single data point is a JSON payload and looks as follows:

  1. {
  2. "vehicleId":"0bf45cac-d1b8-4364-a906-980e1c2bdbcb",
  3. "vehicleType":"Taxi",
  4. "routeId":"Route-37",
  5. "longitude":"-95.255615",
  6. "latitude":"33.49808",
  7. "timestamp":"2017-10-16 12:31:03",
  8. "speed":49.0,
  9. "fuelLevel":38.0
  10. }

The Kafka Connect Sink Connector for YugabyteDB reads the above iot-data-event topic, transforms the event into a YCQL INSERT statement and then calls YugabyteDB to persist the event in the TrafficKeySpace.Origin_Table table.

Data processor

KSQL

KSQL is the open source streaming SQL engine for Apache Kafka. It provides an easy-to-use yet powerful interactive SQL interface for stream processing on Kafka, without the need to write code in a programming language such as Java or Python. It supports a wide range of streaming operations, including data filtering, transformations, aggregations, joins, windowing, and sessionization.

The first step in using KSQL is to create a stream from the raw events as shown below.

  1. CREATE STREAM traffic_stream (
  2. vehicleId varchar,
  3. vehicleType varchar,
  4. routeId varchar,
  5. timeStamp varchar,
  6. latitude varchar,
  7. longitude varchar)
  8. WITH (
  9. KAFKA_TOPIC='iot-data-event',
  10. VALUE_FORMAT='json',
  11. TIMESTAMP='timeStamp',
  12. TIMESTAMP_FORMAT='yyyy-MM-dd HH:mm:ss');

Various aggreations/queries can now be run on the above stream with results of each type of query stored in a topic of its own. This application uses 3 such queries/topics. Thereafter, the Kafka Connect Sink Connector for YugabyteDB reads these 3 topics and persists the results into the 3 corresponding tables in YugabyteDB.

  1. CREATE TABLE total_traffic
  2. WITH ( PARTITIONS=1,
  3. KAFKA_TOPIC='total_traffic',
  4. TIMESTAMP='timeStamp',
  5. TIMESTAMP_FORMAT='yyyy-MM-dd HH:mm:ss') AS
  6. SELECT routeId,
  7. vehicleType,
  8. count(vehicleId) AS totalCount,
  9. max(rowtime) AS timeStamp,
  10. TIMESTAMPTOSTRING(max(rowtime), 'yyyy-MM-dd') AS recordDate
  11. FROM traffic_stream
  12. GROUP BY routeId, vehicleType;
  1. CREATE TABLE window_traffic
  2. WITH ( TIMESTAMP='timeStamp',
  3. KAFKA_TOPIC='window_traffic',
  4. TIMESTAMP_FORMAT='yyyy-MM-dd HH:mm:ss',
  5. PARTITIONS=1) AS
  6. SELECT routeId,
  7. vehicleType,
  8. count(vehicleId) AS totalCount,
  9. max(rowtime) AS timeStamp,
  10. TIMESTAMPTOSTRING(max(rowtime), 'yyyy-MM-dd') AS recordDate
  11. FROM traffic_stream
  12. WINDOW HOPPING (SIZE 30 SECONDS, ADVANCE BY 10 SECONDS)
  13. GROUP BY routeId, vehicleType;
  1. CREATE STREAM poi_traffic
  2. WITH ( PARTITIONS=1,
  3. KAFKA_TOPIC='poi_traffic',
  4. TIMESTAMP='timeStamp',
  5. TIMESTAMP_FORMAT='yyyy-MM-dd HH:mm:ss') AS
  6. SELECT vehicleId,
  7. vehicleType,
  8. cast(GEO_DISTANCE(cast(latitude AS double),cast(longitude AS double),33.877495,-95.50238,'KM') AS bigint) AS distance,
  9. timeStamp
  10. FROM traffic_stream
  11. WHERE GEO_DISTANCE(cast(latitude AS double),cast(longitude AS double),33.877495,-95.50238,'KM') < 30;

Apache Spark streaming

This is a Apache Spark streaming application that consumes the data stream from the Kafka topic, converts them into meaningful insights and writes the resulting aggregate data back to YugabyteDB.

Spark communicates with YugabyteDB using the Spark-Cassandra connector. This is done as follows:

  1. SparkConf conf =
  2. new SparkConf().setAppName(prop.getProperty("com.iot.app.spark.app.name"))
  3. .set("spark.cassandra.connection.host",prop.getProperty("com.iot.app.cassandra.host"))

The data is consumed from a Kafka stream and collected in 5 second batches. This is achieved as follows:

  1. JavaStreamingContext jssc = new JavaStreamingContext(conf,Durations.seconds(5));
  2. JavaPairInputDStream<String, IoTData> directKafkaStream =
  3. KafkaUtils.createDirectStream(jssc,
  4. String.class,
  5. IoTData.class,
  6. StringDecoder.class,
  7. IoTDataDecoder.class,
  8. kafkaParams,
  9. topicsSet
  10. );

It computes the following:

  • Compute a breakdown by vehicle type and the shipment route across all the vehicles and shipments done so far.
  • Compute the above breakdown for active shipments. This is done by computing the breakdown by vehicle type and shipment route for the last 30 seconds.
  • Detect the vehicles which are within a 20 mile radius of a given Point of Interest (POI), which represents a road-closure.

Data dashboard

This is a Spring Boot application which queries the data from YugabyteDB and pushes the data to the webpage using Web Sockets and jQuery. The data is pushed to the web page in fixed intervals so data will be refreshed automatically. Dashboard displays data in charts and tables. This web page uses bootstrap.js to display the dashboard containing charts and tables.

We create entity classes for the three tables Total_Traffic, Window_Traffic and Poi_Traffic, and DAO interfaces for all the entities extending CassandraRepository. For example, we create the DAO class for TotalTrafficData entity as follows.

  1. @Repository
  2. public interface TotalTrafficDataRepository extends CassandraRepository<TotalTrafficData> {
  3. @Query("SELECT * FROM traffickeyspace.total_traffic WHERE recorddate = ? ALLOW FILTERING")
  4. Iterable<TotalTrafficData> findTrafficDataByDate(String date);
  5. }

In order to connect to YugabyteDB cluster and get connection for database operations, we write the assandraConfig class. This is done as follows:

  1. public class CassandraConfig extends AbstractCassandraConfiguration {
  2. @Bean
  3. public CassandraClusterFactoryBean cluster() {
  4. // Create a Cassandra cluster to access YugabyteDB using CQL.
  5. CassandraClusterFactoryBean cluster = new CassandraClusterFactoryBean();
  6. // Set the database host.
  7. cluster.setContactPoints(environment.getProperty("com.iot.app.cassandra.host"));
  8. // Set the database port.
  9. cluster.setPort(Integer.parseInt(environment.getProperty("com.iot.app.cassandra.port")));
  10. return cluster;
  11. }
  12. }

Note that currently the Dashboard does not use the raw events table and relies only on the data stored in the aggregates tables.

Summary

This application is a blue print for building IoT applications. The instructions to build and run the application, as well as the source code can be found in the IoT Fleet Management GitHub repository.