Spark on Kubernetes

Spark - 图1

Kubernetes 从 v1.8 开始支持原生的Apache Spark应用(需要Spark支持Kubernetes,比如v2.2.0-kubernetes-0.4.0),可以通过 spark-submit 命令直接提交Kubernetes任务。比如计算圆周率

  1. bin/spark-submit \
  2. --deploy-mode cluster \
  3. --class org.apache.spark.examples.SparkPi \
  4. --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
  5. --kubernetes-namespace default \
  6. --conf spark.executor.instances=5 \
  7. --conf spark.app.name=spark-pi \
  8. --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.4.0 \
  9. --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.4.0 \
  10. local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.4.0.jar

或者使用Python版本

  1. bin/spark-submit \
  2. --deploy-mode cluster \
  3. --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
  4. --kubernetes-namespace <k8s-namespace> \
  5. --conf spark.executor.instances=5 \
  6. --conf spark.app.name=spark-pi \
  7. --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver-py:v2.2.0-kubernetes-0.4.0 \
  8. --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor-py:v2.2.0-kubernetes-0.4.0 \
  9. --jars local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.4.0.jar \
  10. --py-files local:///opt/spark/examples/src/main/python/sort.py \
  11. local:///opt/spark/examples/src/main/python/pi.py 10

Spark on Kubernetes部署

Kubernetes 示例github上提供了一个详细的spark部署方法,由于步骤复杂,这里简化一些部分让大家安装的时候不用去多设定一些东西。

部署条件

  • 一个kubernetes群集,可参考集群部署
  • kube-dns正常运作

创建一个命名空间

namespace-spark-cluster.yaml

  1. apiVersion: v1
  2. kind: Namespace
  3. metadata:
  4. name: "spark-cluster"
  5. labels:
  6. name: "spark-cluster"
  1. $ kubectl create -f examples/staging/spark/namespace-spark-cluster.yaml

这边原文提到需要将kubectl的执行环境转到spark-cluster,这边为了方便我们不这样做,而是将之后的佈署命名空间都加入spark-cluster

部署Master Service

建立一个replication controller,来运行Spark Master服务

  1. kind: ReplicationController
  2. apiVersion: v1
  3. metadata:
  4. name: spark-master-controller
  5. namespace: spark-cluster
  6. spec:
  7. replicas: 1
  8. selector:
  9. component: spark-master
  10. template:
  11. metadata:
  12. labels:
  13. component: spark-master
  14. spec:
  15. containers:
  16. - name: spark-master
  17. image: gcr.io/google_containers/spark:1.5.2_v1
  18. command: ["/start-master"]
  19. ports:
  20. - containerPort: 7077
  21. - containerPort: 8080
  22. resources:
  23. requests:
  24. cpu: 100m
  1. $ kubectl create -f spark-master-controller.yaml

创建master服务

spark-master-service.yaml

  1. kind: Service
  2. apiVersion: v1
  3. metadata:
  4. name: spark-master
  5. namespace: spark-cluster
  6. spec:
  7. ports:
  8. - port: 7077
  9. targetPort: 7077
  10. name: spark
  11. - port: 8080
  12. targetPort: 8080
  13. name: http
  14. selector:
  15. component: spark-master
  1. $ kubectl create -f spark-master-service.yaml

检查Master 是否正常运行

  1. $ kubectl get pod -n spark-cluster
  2. spark-master-controller-qtwm8 1/1 Running 0 6d
  1. $ kubectl logs spark-master-controller-qtwm8 -n spark-cluster
  2. 17/08/07 02:34:54 INFO Master: Registered signal handlers for [TERM, HUP, INT]
  3. 17/08/07 02:34:54 INFO SecurityManager: Changing view acls to: root
  4. 17/08/07 02:34:54 INFO SecurityManager: Changing modify acls to: root
  5. 17/08/07 02:34:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
  6. 17/08/07 02:34:55 INFO Slf4jLogger: Slf4jLogger started
  7. 17/08/07 02:34:55 INFO Remoting: Starting remoting
  8. 17/08/07 02:34:55 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077]
  9. 17/08/07 02:34:55 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
  10. 17/08/07 02:34:55 INFO Master: Starting Spark master at spark://spark-master:7077
  11. 17/08/07 02:34:55 INFO Master: Running Spark version 1.5.2
  12. 17/08/07 02:34:56 INFO Utils: Successfully started service 'MasterUI' on port 8080.
  13. 17/08/07 02:34:56 INFO MasterWebUI: Started MasterWebUI at http://10.2.6.12:8080
  14. 17/08/07 02:34:56 INFO Utils: Successfully started service on port 6066.
  15. 17/08/07 02:34:56 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
  16. 17/08/07 02:34:56 INFO Master: I have been elected leader! New state: ALIVE

若master 已经被建立与运行,我们可以透过Spark开发的webUI来察看我们spark的群集状况,我们将佈署specialized proxy

spark-ui-proxy-controller.yaml

  1. kind: ReplicationController
  2. apiVersion: v1
  3. metadata:
  4. name: spark-ui-proxy-controller
  5. namespace: spark-cluster
  6. spec:
  7. replicas: 1
  8. selector:
  9. component: spark-ui-proxy
  10. template:
  11. metadata:
  12. labels:
  13. component: spark-ui-proxy
  14. spec:
  15. containers:
  16. - name: spark-ui-proxy
  17. image: elsonrodriguez/spark-ui-proxy:1.0
  18. ports:
  19. - containerPort: 80
  20. resources:
  21. requests:
  22. cpu: 100m
  23. args:
  24. - spark-master:8080
  25. livenessProbe:
  26. httpGet:
  27. path: /
  28. port: 80
  29. initialDelaySeconds: 120
  30. timeoutSeconds: 5
  1. $ kubectl create -f spark-ui-proxy-controller.yaml

提供一个service做存取,这边原文是使用LoadBalancer type,这边我们改成NodePort,如果你的kubernetes运行环境是在cloud provider,也可以参考原文作法

spark-ui-proxy-service.yaml

  1. kind: Service
  2. apiVersion: v1
  3. metadata:
  4. name: spark-ui-proxy
  5. namespace: spark-cluster
  6. spec:
  7. ports:
  8. - port: 80
  9. targetPort: 80
  10. nodePort: 30080
  11. selector:
  12. component: spark-ui-proxy
  13. type: NodePort
  1. $ kubectl create -f spark-ui-proxy-service.yaml

部署完后你可以利用kubecrl proxy来察看你的Spark群集状态

  1. $ kubectl proxy --port=8001

可以透过http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-master:8080
察看,若kubectl中断就无法这样观察了,但我们再先前有设定nodeport
所以也可以透过任意台node的端口30080去察看
例如:http://10.201.2.34:30080
10.201.2.34是群集的其中一台node,这边可换成你自己的

部署 Spark workers

要先确定Matser是再运行的状态

spark-worker-controller.yaml

  1. kind: ReplicationController
  2. apiVersion: v1
  3. metadata:
  4. name: spark-worker-controller
  5. namespace: spark-cluster
  6. spec:
  7. replicas: 2
  8. selector:
  9. component: spark-worker
  10. template:
  11. metadata:
  12. labels:
  13. component: spark-worker
  14. spec:
  15. containers:
  16. - name: spark-worker
  17. image: gcr.io/google_containers/spark:1.5.2_v1
  18. command: ["/start-worker"]
  19. ports:
  20. - containerPort: 8081
  21. resources:
  22. requests:
  23. cpu: 100m
  1. $ kubectl create -f spark-worker-controller.yaml
  2. replicationcontroller "spark-worker-controller" created

透过指令察看运行状况

  1. $ kubectl get pod -n spark-cluster
  2. spark-master-controller-qtwm8 1/1 Running 0 6d
  3. spark-worker-controller-4rxrs 1/1 Running 0 6d
  4. spark-worker-controller-z6f21 1/1 Running 0 6d
  5. spark-ui-proxy-controller-d4br2 1/1 Running 4 6d

也可以透过上面建立的WebUI服务去察看

基本上到这边Spark的群集已经建立完成了

创建 Zeppelin UI

我们可以利用Zeppelin UI经由web notebook直接去执行我们的任务,
详情可以看Zeppelin UI Spark architecture

zeppelin-controller.yaml

  1. kind: ReplicationController
  2. apiVersion: v1
  3. metadata:
  4. name: zeppelin-controller
  5. namespace: spark-cluster
  6. spec:
  7. replicas: 1
  8. selector:
  9. component: zeppelin
  10. template:
  11. metadata:
  12. labels:
  13. component: zeppelin
  14. spec:
  15. containers:
  16. - name: zeppelin
  17. image: gcr.io/google_containers/zeppelin:v0.5.6_v1
  18. ports:
  19. - containerPort: 8080
  20. resources:
  21. requests:
  22. cpu: 100m
  1. $ kubectl create -f zeppelin-controller.yaml
  2. replicationcontroller "zeppelin-controller" created

然后一样佈署Service

zeppelin-service.yaml

  1. kind: Service
  2. apiVersion: v1
  3. metadata:
  4. name: zeppelin
  5. namespace: spark-cluster
  6. spec:
  7. ports:
  8. - port: 80
  9. targetPort: 8080
  10. nodePort: 30081
  11. selector:
  12. component: zeppelin
  13. type: NodePort
  1. $ kubectl create -f zeppelin-service.yaml

可以看到我们把NodePort设再30081,一样可以透过任意台node的30081 port 访问 zeppelin UI。

通过命令行访问pyspark(记得把pod名字换成你自己的):

  1. $ kubectl exec -it zeppelin-controller-8f14f -n spark-cluster pyspark
  2. Python 2.7.9 (default, Mar 1 2015, 12:57:24)
  3. [GCC 4.9.2] on linux2
  4. Type "help", "copyright", "credits" or "license" for more information.
  5. 17/08/14 01:59:22 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
  6. Welcome to
  7. ____ __
  8. / __/__ ___ _____/ /__
  9. _\ \/ _ \/ _ `/ __/ '_/
  10. /__ / .__/\_,_/_/ /_/\_\ version 1.5.2
  11. /_/
  12. Using Python version 2.7.9 (default, Mar 1 2015 12:57:24)
  13. SparkContext available as sc, HiveContext available as sqlContext.
  14. >>>

接着就能使用Spark的服务了,如有错误欢迎更正。

zeppelin常见问题

  • zeppelin的镜像非常大,所以再pull时会花上一些时间,而size大小的问题现在也正在解决中,详情可参考 issue #17231
  • 在GKE的平台上, kubectl post-forward 可能有些不稳定,如果你看现zeppelin 的状态为Disconnected,port-forward可能已经失败你需要去重新启动它,详情可参考 #12179

参考文档