相比于 Kylin 3.x,Kylin 4.0 实现了全新 spark 构建引擎和 parquet 存储,使 kylin 不依赖 hadoop 环境部署成为可能。与在 AWS EMR 之上部署 Kylin 3.x 相比,直接在 AWS EC2 实例上部署 Kylin 4.0 存在以下优势:
1. 节省成本。相比 AWS EMR 节点,AWS EC2 节点的成本更低。
2. 更加灵活。在 EC2 节点上,用户可以更加自主选择自己所需的服务以及组件进行安装部署。
3. 去 Hadoop。Hadoop 生态比较重,需要投入一定的人力成本进行维护,去 Hadoop 可以更加贴近云原生。

在实现了支持在 Spark Standalone 模式下进行构建和查询的功能之后,我们在 AWS 的 EC2 实例上对无 Hadoop 部署 Kylin 4.0 做了尝试,并成功构建 Cube 和进行了查询。

环境准备

  • 按照需求申请 AWS EC2 Linux 实例
  • 创建 Amazon RDS for Mysql 作为 Kylin 以及 Hive 元数据库
  • S3 作为 Kylin 存储

组件版本信息

此处提供的版本信息是我们在测试时选用的版本信息,如果用户需要使用其他的版本进行部署,可以自行更换,保证组件版本之间兼容即可。

  • JDK 1.8
  • Hive 2.3.9
  • Zookeeper 3.4.13
  • Kylin 4.0 for spark3
  • Spark 3.1.1
  • Hadoop 3.2.0(不需要启动)

安装过程

1 配置环境变量

  • 配置环境变量并使其生效

    1. vim /etc/profile
    2. # 在 profile 文件末尾添加以下内容
    3. export JAVA_HOME=/usr/local/java/jdk1.8.0_291
    4. export JRE_HOME=${JAVA_HOME}/jre
    5. export HADOOP_HOME=/etc/hadoop/hadoop-3.2.0
    6. export HIVE_HOME=/etc/hadoop/hive
    7. export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
    8. export PATH=$HIVE_HOME/bin:$HIVE_HOME/conf:${HADOOP_HOME}/bin:${JAVA_HOME}/bin:$PATH
    9. # 保存以上文件内容后执行以下命令
    10. source /etc/profile

2 安装 JDK 1.8

  • 下载 jdk1.8 到准备好的 EC2 实例,解压到 /usr/local/java 目录:

    1. mkdir /usr/local/java
    2. tar -xvf java-1.8.0-openjdk.tar -C /usr/local/java

3 配置 Hadoop

  • 下载 Hadoop 并解压

    1. wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
    2. mkdir /etc/hadoop
    3. tar -xvf hadoop-3.2.0.tar.gz -C /etc/hadoop
  • copy 连接 S3 所需 jar 包到 hadoop 类加载路径,否则可能会出现 ClassNotFound 类型报错

    1. cd /etc/hadoop
    2. cp hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar hadoop-3.2.0/share/hadoop/common/lib/
    3. cp hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar hadoop-3.2.0/share/hadoop/common/lib/
  • 修改 core-site.xml,配置 aws 账号信息以及 endpoint,以下为示例内容

    1. <?xml version="1.0" encoding="UTF-8"?>
    2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    3. <!--
    4. Licensed under the Apache License, Version 2.0 (the "License");
    5. you may not use this file except in compliance with the License.
    6. You may obtain a copy of the License at
    7. http://www.apache.org/licenses/LICENSE-2.0
    8. Unless required by applicable law or agreed to in writing, software
    9. distributed under the License is distributed on an "AS IS" BASIS,
    10. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    11. See the License for the specific language governing permissions and
    12. limitations under the License. See accompanying LICENSE file.
    13. -->
    14. <!-- Put site-specific property overrides in this file. -->
    15. <configuration>
    16. <property>
    17. <name>fs.s3a.access.key</name>
    18. <value>SESSION-ACCESS-KEY</value>
    19. </property>
    20. <property>
    21. <name>fs.s3a.secret.key</name>
    22. <value>SESSION-SECRET-KEY</value>
    23. </property>
    24. <property>
    25. <name>fs.s3a.endpoint</name>
    26. <value>s3.$REGION.amazonaws.com</value>
    27. </property>
    28. </configuration>

4 安装 Hive

  • 下载 Hive 并解压

    1. wget https://downloads.apache.org/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
    2. tar -xvf apache-hive-2.3.9-bin.tar.gz -C /etc/hadoop
    3. mv /etc/hadoop/apache-hive-2.3.9-bin /etc/hadoop/hive
  • 编辑 hive 配置文件 vim ${HIVE_HOME}/conf/hive-site.xml,请提前启动 Amazon RDS for Mysql database,获取连接 URI、用户名和密码。

    注意:正确配置 VPC 和安全组,以保证 EC2 实例可以正常访问数据库。

    hive-site.xml文件示例内容如下:

    1. <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
    3. Licensed to the Apache Software Foundation (ASF) under one or more
    4. contributor license agreements. See the NOTICE file distributed with
    5. this work for additional information regarding copyright ownership.
    6. The ASF licenses this file to You under the Apache License, Version 2.0
    7. (the "License"); you may not use this file except in compliance with
    8. the License. You may obtain a copy of the License at
    9. http://www.apache.org/licenses/LICENSE-2.0
    10. Unless required by applicable law or agreed to in writing, software
    11. distributed under the License is distributed on an "AS IS" BASIS,
    12. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    13. See the License for the specific language governing permissions and
    14. limitations under the License.
    15. --><configuration>
    16. <!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
    17. <!-- WARNING!!! Any changes you make to this file will be ignored by Hive. -->
    18. <!-- WARNING!!! You must make your changes in hive-site.xml instead. -->
    19. <!-- Hive Execution Parameters -->
    20. <property>
    21. <name>javax.jdo.option.ConnectionPassword</name>
    22. <value>password</value>
    23. <description>password to use against metastore database</description>
    24. </property>
    25. <property>
    26. <name>javax.jdo.option.ConnectionURL</name>
    27. <value>jdbc:mysql://host-name:3306/hive?createDatabaseIfNotExist=true</value>
    28. <description>JDBC connect string for a JDBC metastore</description>
    29. </property>
    30. <property>
    31. <name>javax.jdo.option.ConnectionDriverName</name>
    32. <value>com.mysql.jdbc.Driver</value>
    33. <description>Driver class name for a JDBC metastore</description>
    34. </property>
    35. <property>
    36. <name>javax.jdo.option.ConnectionUserName</name>
    37. <value>admin</value>
    38. <description>Username to use against metastore database</description>
    39. </property>
    40. <property>
    41. <name>hive.metastore.schema.verification</name>
    42. <value>false</value>
    43. <description>
    44. Enforce metastore schema version consistency.
    45. True: Verify that version information stored in metastore matches with one from Hive jars. Also disable automatic
    46. schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
    47. proper metastore schema migration. (Default)
    48. False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
    49. </description>
    50. </property>
    51. </configuration>
  • Hive 元数据初始化

    1. # 下载 mysql-jdbc 的 jar 包放置在 $HIVE_HOME/lib 目录下
    2. cp mysql-connector-java-5.1.47.jar $HIVE_HOME/lib
    3. bin/schematool -dbType mysql -initSchema
    4. mkdir $HIVE_HOME/logs
    5. nohup $HIVE_HOME/bin/hive --service metastore >> $HIVE_HOME/logs/hivemetastorelog.log 2>&1 &

    注意:如果在这个步骤中出现了如下报错:

    1. java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V

    这是由于 hive2 中 guava 包版本与 hadoop3 的 guava 版本不一致导致的,请使用 $HADOOP_HOME/share/hadoop/common/lib/ 目录下的 guava jar 替换 $HIVE_HOME/lib 目录中的 guava jar。

  • 为防止后续过程中出现 jar 包冲突,需要从 hive 的类加载路径中移除一些 spark 以及 scala 相关的 jar 包

    1. rm $HIVE_HOME/lib/spark-* $HIVE_HOME/spark_jar
    2. rm $HIVE_HOME/lib/jackson-module-scala_2.11-2.6.5.jar

    注:此处只列出了我们在测试过程中遇到的产生冲突的 jar 包,如果用户在遇到类似 jar 包冲突的问题,可以根据类加载路径判断哪些 jar 包产生了冲突并移除相关 jar 包。建议当相同 jar 包产生版本冲突时,保留 spark 类加载路径下的 jar 包版本。

5 部署 Spark Standalone

  • 下载 Spark 3.1.1 并解压

    1. wget http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
    2. tar -xvf spark-3.1.1-bin-hadoop3.2.tgz -C /etc/hadoop
    3. mv /etc/hadoop/spark-3.1.1-bin-hadoop3.2 /etc/hadoop/spark
    4. export SPARK_HOME=/etc/hadoop/spark
  • Copy 连接 S3 所需 jar 包

    1. cp $HADOOP_HOME/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar $SPARK_HOME/jars
    2. cp $HADOOP_HOME/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar $SPARK_HOME/jars
    3. cp mysql-connector-java-5.1.47.jar $SPARK_HOME/jars
  • Copy hive 配置文件及 mysql-jdbc

    1. cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf
  • 启动 Spark master 和 worker

    1. $SPARK_HOME/bin/start-master.sh
    2. $SPARK_HOME/bin/start-worker.sh spark://hostname:7077

6 部署 Zookeeper 伪集群

  • 下载 zookeeper 安装包并解压

    1. wget http://archive.apache.org/dist/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz
    2. tar -xvf zookeeper-3.4.13.tar.gz -C /etc/hadoop
    3. mv /etc/hadoop/zookeeper-3.4.13 /etc/hadoop/zookeeper
  • 修改 zookeeper 配置文件,启动三节点 zookeeper 伪集群

    1. cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg /etc/hadoop/zookeeper/conf/zoo1.cfg
    2. cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg /etc/hadoop/zookeeper/conf/zoo2.cfg
    3. cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg /etc/hadoop/zookeeper/conf/zoo3.cfg
  • 依次修改上述三个配置文件,添加如下内容:

    1. server.1=localhost:2287:3387
    2. server.2=localhost:2288:3388
    3. server.3=localhost:2289:3389
    4. dataDir=/tmp/zookeeper/zk1/data
    5. dataLogDir=/tmp/zookeeper/zk1/log
    6. clientPort=2181
  • 创建所需文件夹和文件

    1. mkdir /tmp/zookeeper/zk1/data
    2. mkdir /tmp/zookeeper/zk1/log
    3. mkdir /tmp/zookeeper/zk2/data
    4. mkdir /tmp/zookeeper/zk2/log
    5. mkdir /tmp/zookeeper/zk3/data
    6. mkdir /tmp/zookeeper/zk3/log
    7. vim /tmp/zookeeper/zk1/data/myid
    8. vim /tmp/zookeeper/zk2/data/myid
    9. vim /tmp/zookeeper/zk3/data/myid
  • 启动 zookeeper 集群

    1. /etc/hadoop/zookeeper/bin/zkServer.sh start /etc/hadoop/zookeeper/conf/zoo1.cfg
    2. /etc/hadoop/zookeeper/bin/zkServer.sh start /etc/hadoop/zookeeper/conf/zoo2.cfg
    3. /etc/hadoop/zookeeper/bin/zkServer.sh start /etc/hadoop/zookeeper/conf/zoo3.cfg

7 启动 kylin

  • 下载 kylin 4.0 二进制包并解压

    1. wget https://mirror-hk.koddos.net/apache/kylin/apache-kylin-4.0.0/apache-kylin-4.0.0-bin.tar.gz
    2. tar -xvf apache-kylin-4.0.0-bin.tar.gz /etc/hadoop
    3. export KYLIN_HOME=/etc/hadoop/apache-kylin-4.0.0-bin
    4. mkdir $KYLIN_HOME/ext
    5. cp mysql-connector-java-5.1.47.jar $KYLIN_HOME/ext
  • 修改配置文件 vim $KYLIN_HOME/conf/kylin.properties

    1. kylin.metadata.url=kylin_metadata@jdbc,url=jdbc:mysql://hostname:3306/kylin,username=root,password=password,maxActive=10,maxIdle=10
    2. kylin.env.zookeeper-connect-string=hostname
    3. kylin.engine.spark-conf.spark.master=spark://hostname:7077
    4. kylin.engine.spark-conf.spark.submit.deployMode=client
    5. kylin.env.hdfs-working-dir=s3://bucket/kylin
    6. kylin.engine.spark-conf.spark.eventLog.dir=s3://bucket/kylin/spark-history
    7. kylin.engine.spark-conf.spark.history.fs.logDirectory=s3://bucket/kylin/spark-history
    8. kylin.query.spark-conf.spark.master=spark://hostname:7077
  • 执行 bin/kylin.sh start

  • Kylin 启动时可能会遇到 ClassNotFound 类型报错,可参考以下方法解决后重启 kylin:

    1. # 下载 commons-collections-3.2.2.jar
    2. cp commons-collections-3.2.2.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
    3. # 下载 commons-configuration-1.3.jar
    4. cp commons-configuration-1.3.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
    5. cp $HADOOP_HOME/share/hadoop/common/lib/aws-java-sdk-bundle-1.11.563.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
    6. cp $HADOOP_HOME/share/hadoop/common/lib/hadoop-aws-3.2.2.jar $HADOOP_HOME/tomcat/webapps/kylin/WEB-INF/lib/