7-生态集成 - Spark TsFile - 《IoTDB用户手册 (V0.9.x)》

TsFile-Spark-Connector用户指南
附录B：旧注

TsFile-Spark-Connector用户指南

1. 关于TsFile-Spark-Connector

TsFile-Spark-Connector对Tsfile类型的外部数据源实现Spark的支持。这使用户可以通过Spark读取，写入和查询Tsfile。

使用此连接器，您可以

从本地文件系统或hdfs将单个TsFile加载到Spark
从本地文件系统或hdfs将特定目录中的所有文件加载到Spark
将数据从Spark写入TsFile

2. 系统要求

Spark版本	Scala 版本	Java 版本	TsFile
`2.4.3`	`2.11.8`	`1.8`	`0.9.3`

注意：有关如何下载和使用TsFile的更多信息，请参见以下链接: https://github.com/apache/incubator-iotdb/tree/master/tsfile.

3. 快速开始

本地模式

在本地模式下使用TsFile-Spark-Connector启动Spark：

./<spark-shell-path>  --jars  tsfile-spark-connector.jar,tsfile-0.9.3-jar-with-dependencies.jar

注意：

是您的spark-shell的真实路径。
多个jar包用逗号分隔，没有任何空格。
有关如何获取TsFile的信息，请参见https://github.com/apache/incubator-iotdb/tree/master/tsfile。

分布式模式

在分布式模式下使用TsFile-Spark-Connector启动Spark（即，Spark集群通过spark-shell连接）：

. /<spark-shell-path>   --jars  tsfile-spark-connector.jar,tsfile-0.9.3-jar-with-dependencies.jar  --master spark://ip:7077

注意：

是您的spark-shell的真实路径。
多个jar包用逗号分隔，没有任何空格。
有关如何获取TsFile的信息，请参见https://github.com/apache/incubator-iotdb/tree/master/tsfile。

4. 数据类型对应

TsFile数据类型	SparkSQL数据类型
BOOLEAN	BooleanType
INT32	IntegerType
INT64	LongType
FLOAT	FloatType
DOUBLE	DoubleType
TEXT	StringType

5. 模式推断

显示TsFile的方式取决于架构。以以下TsFile结构为例：TsFile模式中有三个度量：状态，温度和硬件。这三种测量的基本信息如下：

名称	类型	编码
status	Boolean	PLAIN
temperature	Float	RLE
hardware	Text	PLAIN

TsFile中的现有数据如下：

device:root.ln.wf01.wt01				device:root.ln.wf02.wt02
status		temperature		hardware		status
time	value	time	value	time	value	time	value
1	True	1	2.2	2	“aaa”	1	True
3	True	2	2.2	4	“bbb”	2	False
5	False	3	2.1	6	“ccc”	4	True

相应的SparkSQL表如下：

time	root.ln.wf02.wt02.temperature	root.ln.wf02.wt02.status	root.ln.wf02.wt02.hardware	root.ln.wf01.wt01.temperature	root.ln.wf01.wt01.status	root.ln.wf01.wt01.hardware
1	null	true	null	2.2	true	null
2	null	false	aaa	2.2	null	null
3	null	null	null	2.1	true	null
4	null	true	bbb	null	null	null
5	null	null	null	null	false	null
6	null	null	ccc	null	null	null

您还可以使用如下所示的窄表形式：（您可以参阅第6部分，了解如何使用窄表形式）

time	device_name	status	hardware	temperature
1	root.ln.wf02.wt01	true	null	2.2
1	root.ln.wf02.wt02	true	null	null
2	root.ln.wf02.wt01	null	null	2.2
2	root.ln.wf02.wt02	false	aaa	null
3	root.ln.wf02.wt01	true	null	2.1
4	root.ln.wf02.wt02	true	bbb	null
5	root.ln.wf02.wt01	false	null	null
6	root.ln.wf02.wt02	null	ccc	null

6. Scala API

注意：请记住预先分配必要的读写权限。

示例1：从本地文件系统读取

import org.apache.iotdb.tsfile._
val wide_df = spark.read.tsfile("test.tsfile")  
wide_df.show
val narrow_df = spark.read.tsfile("test.tsfile", true)  
narrow_df.show

示例2：从hadoop文件系统读取

import org.apache.iotdb.tsfile._
val wide_df = spark.read.tsfile("hdfs://localhost:9000/test.tsfile") 
wide_df.show
val narrow_df = spark.read.tsfile("hdfs://localhost:9000/test.tsfile", true)  
narrow_df.show

示例3：从特定目录读取

import org.apache.iotdb.tsfile._
val df = spark.read.tsfile("hdfs://localhost:9000/usr/hadoop") 
df.show

注1：现在不支持目录中所有TsFile的全局时间排序。

注2：具有相同名称的度量应具有相同的架构。

示例4：广泛形式的查询

import org.apache.iotdb.tsfile._
val df = spark.read.tsfile("hdfs://localhost:9000/test.tsfile") 
df.createOrReplaceTempView("tsfile_table")
val newDf = spark.sql("select * from tsfile_table where `device_1.sensor_1`>0 and `device_1.sensor_2` < 22")
newDf.show

import org.apache.iotdb.tsfile._
val df = spark.read.tsfile("hdfs://localhost:9000/test.tsfile") 
df.createOrReplaceTempView("tsfile_table")
val newDf = spark.sql("select count(*) from tsfile_table")
newDf.show

示例5：缩小形式的查询

import org.apache.iotdb.tsfile._
val df = spark.read.tsfile("hdfs://localhost:9000/test.tsfile", true) 
df.createOrReplaceTempView("tsfile_table")
val newDf = spark.sql("select * from tsfile_table where device_name = 'root.ln.wf02.wt02' and temperature > 5")
newDf.show

import org.apache.iotdb.tsfile._
val df = spark.read.tsfile("hdfs://localhost:9000/test.tsfile", true) 
df.createOrReplaceTempView("tsfile_table")
val newDf = spark.sql("select count(*) from tsfile_table")
newDf.show

例子6：以宽写形式

// we only support wide_form table to write
import org.apache.iotdb.tsfile._
val df = spark.read.tsfile("hdfs://localhost:9000/test.tsfile") 
df.show
df.write.tsfile("hdfs://localhost:9000/output")
val newDf = spark.read.tsfile("hdfs://localhost:9000/output")
newDf.show

示例6：以窄写形式

// we only support wide_form table to write
import org.apache.iotdb.tsfile._
val df = spark.read.tsfile("hdfs://localhost:9000/test.tsfile", true) 
df.show
df.write.tsfile("hdfs://localhost:9000/output", true)
val newDf = spark.read.tsfile("hdfs://localhost:9000/output", true)
newDf.show

附录A：模式推断的旧设计

显示TsFile的方式与TsFile Schema有关。以以下TsFile结构为例：TsFile架构中有三个度量：状态，温度和硬件。这三个度量的基本信息如下：

名称	类型	编码
status	Boolean	PLAIN
temperature	Float	RLE
hardware	Text	PLAIN

测量的基本信息

文件中的现有数据如下：

delta_object:root.ln.wf01.wt01				delta_object:root.ln.wf02.wt02				delta_object:root.sgcc.wf03.wt01
status		temperature		hardware		status		status		temperature
time	value	time	value	time	value	time	value	time	value	time	value
1	True	1	2.2	2	“aaa”	1	True	2	True	3	3.3
3	True	2	2.2	4	“bbb”	2	False	3	True	6	6.6
5	False	3	2.1	6	“ccc”	4	True	4	True	8	8.8
7	True	4	2.0	8	“ddd”	5	False	6	True	9	9.9

A set of time-series data

有两种显示方法：

默认方式

将创建两列来存储设备的完整路径：time（LongType）和delta_object（StringType）。

time : Timestamp, LongType
delta_object : Delta_object ID, StringType

接下来，为每个度量创建一列以存储特定数据。 SparkSQL表结构如下：

time(LongType)	delta_object(StringType)	status(BooleanType)	temperature(FloatType)	hardware(StringType)
1	root.ln.wf01.wt01	True	2.2	null
1	root.ln.wf02.wt02	True	null	null
2	root.ln.wf01.wt01	null	2.2	null
2	root.ln.wf02.wt02	False	null	“aaa”
2	root.sgcc.wf03.wt01	True	null	null
3	root.ln.wf01.wt01	True	2.1	null
3	root.sgcc.wf03.wt01	True	3.3	null
4	root.ln.wf01.wt01	null	2.0	null
4	root.ln.wf02.wt02	True	null	“bbb”
4	root.sgcc.wf03.wt01	True	null	null
5	root.ln.wf01.wt01	False	null	null
5	root.ln.wf02.wt02	False	null	null
5	root.sgcc.wf03.wt01	True	null	null
6	root.ln.wf02.wt02	null	null	“ccc”
6	root.sgcc.wf03.wt01	null	6.6	null
7	root.ln.wf01.wt01	True	null	null
8	root.ln.wf02.wt02	null	null	“ddd”
8	root.sgcc.wf03.wt01	null	8.8	null
9	root.sgcc.wf03.wt01	null	9.9	null

展开delta_object列

用“。”展开设备列。分成多个列，忽略根目录“ root”。方便进行更丰富的聚合操作。如果用户希望使用这种显示方式，则需要在表创建语句中设置参数“ delta \ _object \ _name”（请参阅本手册第5.1节中的示例5），如本示例中的那样，参数“ delta \ _object \ _name“设置为” root.device.turbine“。路径层的数量必须是一对一的。此时，将为设备路径的每一层（“根”层除外）创建一列。列名是参数中的名称，值是设备相应层的名称。接下来，将为每个度量创建一列以存储特定数据。

然后，SparkSQL表结构如下：

time(LongType)	group(StringType)	field(StringType)	device(StringType)	status(BooleanType)	temperature(FloatType)	hardware(StringType)
1	ln	wf01	wt01	True	2.2	null
1	ln	wf02	wt02	True	null	null
2	ln	wf01	wt01	null	2.2	null
2	ln	wf02	wt02	False	null	“aaa”
2	sgcc	wf03	wt01	True	null	null
3	ln	wf01	wt01	True	2.1	null
3	sgcc	wf03	wt01	True	3.3	null
4	ln	wf01	wt01	null	2.0	null
4	ln	wf02	wt02	True	null	“bbb”
4	sgcc	wf03	wt01	True	null	null
5	ln	wf01	wt01	False	null	null
5	ln	wf02	wt02	False	null	null
5	sgcc	wf03	wt01	True	null	null
6	ln	wf02	wt02	null	null	“ccc”
6	sgcc	wf03	wt01	null	6.6	null
7	ln	wf01	wt01	True	null	null
8	ln	wf02	wt02	null	null	“ddd”
8	sgcc	wf03	wt01	null	8.8	null
9	sgcc	wf03	wt01	null	9.9	null

TsFile-Spark-Connector可以在SparkSQL By SparkSQL中将一个或多个TsFiles显示为表。它还允许用户指定一个目录或使用通配符来匹配多个目录。如果有多个TsFile，则所有TsFile中的度量的并集将保留在表中，并且具有相同名称的度量默认情况下将具有相同的数据类型。请注意，如果存在名称相同但数据类型不同的情况，则TsFile-Spark-Connector将无法保证结果的正确性。

写入过程是将一个DataFrame写入一个或多个TsFiles。默认情况下，需要包括两列：time和delta_object。其余的列用作“度量”。如果用户想将第二个表结构写回到TsFile，则可以设置“ delta \ _object \ _name”参数（请参阅本手册5.1节的5.1节）。

附录B：旧注

注意：检查Spark根目录中的jar软件包，并分别用libthrift-0.9.1.jar和libfb303-0.9.1.jar替换libthrift-0.9.3.jar和libfb303-0.9.3.jar。