TiSpark 快速上手

为了让大家快速体验 TiSpark,通过 TiUP 安装的 TiDB 集群中默认已集成 Spark 和 TiSpark jar 包。

部署信息

  • Spark 默认部署在 TiDB 实例部署目录下 spark 目录中
  • TiSpark jar 包默认部署在 Spark 部署目录 jars 文件夹下:spark/jars/tispark-${name_with_version}.jar
  • TiSpark 示例数据和导入脚本可点击 TiSpark 示例数据下载。

    1. tispark-sample-data/

环境准备

在 TiDB 实例上安装 JDK

Oracle JDK 官方下载页面 下载 JDK 1.8 当前最新版,本示例中下载的版本为 jdk-8u141-linux-x64.tar.gz

解压并根据您的 JDK 部署目录设置环境变量,编辑 ~/.bashrc 文件,比如:

  1. export JAVA_HOME=/home/pingcap/jdk1.8.0_144 &&
  2. export PATH=$JAVA_HOME/bin:$PATH

验证 JDK 有效性:

  1. java -version
  1. java version "1.8.0_144"
  2. Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
  3. Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

导入样例数据

假设 TiDB 集群已启动,其中一台 TiDB 实例服务 IP 为 192.168.0.2,端口为 4000,用户名为 root, 密码为空。

  1. wget http://download.pingcap.org/tispark-sample-data.tar.gz && \
  2. tar -zxvf tispark-sample-data.tar.gz && \
  3. cd tispark-sample-data

修改 sample_data.sh 中 TiDB 登录信息,比如:

  1. mysql --local-infile=1 -h 192.168.0.2 -P 4000 -u root < dss.ddl

执行脚本

  1. ./sample_data.sh

注意:

执行脚本的机器上需要安装 MySQL client,CentOS 用户可通过 yum -y install mysql来安装。

登录 TiDB 并验证数据包含 TPCH_001 库及以下表:

  1. mysql -uroot -P4000 -h192.168.0.2
  1. show databases;
  1. +--------------------+
  2. | Database |
  3. +--------------------+
  4. | INFORMATION_SCHEMA |
  5. | PERFORMANCE_SCHEMA |
  6. | TPCH_001 |
  7. | mysql |
  8. | test |
  9. +--------------------+
  10. 5 rows in set (0.00 sec)
  1. use TPCH_001;
  1. Reading table information for completion of table and column names
  2. You can turn off this feature to get a quicker startup with -A
  3. Database changed
  1. show tables;
  1. +--------------------+
  2. | Tables_in_TPCH_001 |
  3. +--------------------+
  4. | CUSTOMER |
  5. | LINEITEM |
  6. | NATION |
  7. | ORDERS |
  8. | PART |
  9. | PARTSUPP |
  10. | REGION |
  11. | SUPPLIER |
  12. +--------------------+
  13. 8 rows in set (0.00 sec)

使用范例

进入 spark 部署目录启动 spark-shell:

  1. cd spark &&
  2. bin/spark-shell

然后像使用原生 Spark 一样查询 TiDB 表:

  1. scala> spark.sql("select count(*) from lineitem").show

结果为

  1. +--------+
  2. |count(1)|
  3. +--------+
  4. | 60175|
  5. +--------+

下面执行另一个复杂一点的 Spark SQL:

  1. scala> spark.sql(
  2. """select
  3. | l_returnflag,
  4. | l_linestatus,
  5. | sum(l_quantity) as sum_qty,
  6. | sum(l_extendedprice) as sum_base_price,
  7. | sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
  8. | sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
  9. | avg(l_quantity) as avg_qty,
  10. | avg(l_extendedprice) as avg_price,
  11. | avg(l_discount) as avg_disc,
  12. | count(*) as count_order
  13. |from
  14. | lineitem
  15. |where
  16. | l_shipdate <= date '1998-12-01' - interval '90' day
  17. |group by
  18. | l_returnflag,
  19. | l_linestatus
  20. |order by
  21. | l_returnflag,
  22. | l_linestatus
  23. """.stripMargin).show

结果为:

  1. +------------+------------+---------+--------------+--------------+
  2. |l_returnflag|l_linestatus| sum_qty|sum_base_price|sum_disc_price|
  3. +------------+------------+---------+--------------+--------------+
  4. | A| F|380456.00| 532348211.65|505822441.4861|
  5. | N| F| 8971.00| 12384801.37| 11798257.2080|
  6. | N| O|742802.00| 1041502841.45|989737518.6346|
  7. | R| F|381449.00| 534594445.35|507996454.4067|
  8. +------------+------------+---------+--------------+--------------+
  9. (续)
  10. -----------------+---------+------------+--------+-----------+
  11. sum_charge| avg_qty| avg_price|avg_disc|count_order|
  12. -----------------+---------+------------+--------+-----------+
  13. 526165934.000839|25.575155|35785.709307|0.050081| 14876|
  14. 12282485.056933|25.778736|35588.509684|0.047759| 348|
  15. 1029418531.523350|25.454988|35691.129209|0.049931| 29181|
  16. 528524219.358903|25.597168|35874.006533|0.049828| 14902|
  17. -----------------+---------+------------+--------+-----------+

更多样例请参考 pingcap/tispark-test