Spark使用OSS Select加速数据查询( 二 )


-rw-r--r-- root/root 67758 2018-10-30 16:11 spark-2.2.0-oss-select-0.1.0-SNAPSHOT/jettison-1.1.jar
-rw-r--r-- root/root 57264 2018-10-30 16:11 spark-2.2.0-oss-select-0.1.0-SNAPSHOT/json-20170516.jar
-rw-r--r-- root/root 890168 2018-10-30 16:11 spark-2.2.0-oss-select-0.1.0-SNAPSHOT/jaxb-impl-2.2.3-1.jar
-rw-r--r-- root/root 458739 2018-10-30 16:11 spark-2.2.0-oss-select-0.1.0-SNAPSHOT/jersey-core-1.9.jar
-rw-r--r-- root/root 147952 2018-10-30 16:11 spark-2.2.0-oss-select-0.1.0-SNAPSHOT/jersey-json-1.9.jar
-rw-r--r-- root/root 788137 2018-10-30 16:11 spark-2.2.0-oss-select-0.1.0-SNAPSHOT/aliyun-java-sdk-ecs-4.2.0.jar
-rw-r--r-- root/root 153115 2018-10-30 16:11 spark-2.2.0-oss-select-0.1.0-SNAPSHOT/jdom-1.1.jar
-rw-r--r-- root/root 65437 2018-10-31 14:41 spark-2.2.0-oss-select-0.1.0-SNAPSHOT/aliyun-oss-select-spark_2.11-0.1.0-SNAPSHOT.jar

  • 进入${CDH_HOME}/lib/spark/jars目录,执行如下命令:
     
    [root@cdh-master jars]# pwd
    /opt/cloudera/parcels/CDH/lib/spark/jars
    [root@cdh-master jars]# rm -f aliyun-sdk-oss-2.8.3.jar
    [root@cdh-master jars]# ln -s ../../../jars/aliyun-oss-select-spark_2.11-0.1.0-SNAPSHOT.jar aliyun-oss-select-spark_2.11-0.1.0-SNAPSHOT.jar
    [root@cdh-master jars]# ln -s ../../../jars/aliyun-java-sdk-core-3.4.0.jar aliyun-java-sdk-core-3.4.0.jar
    [root@cdh-master jars]# ln -s ../../../jars/aliyun-java-sdk-ecs-4.2.0.jar aliyun-java-sdk-ecs-4.2.0.jar
    [root@cdh-master jars]# ln -s ../../../jars/aliyun-java-sdk-ram-3.0.0.jar aliyun-java-sdk-ram-3.0.0.jar
    [root@cdh-master jars]# ln -s ../../../jars/aliyun-java-sdk-sts-3.0.0.jar aliyun-java-sdk-sts-3.0.0.jar
    [root@cdh-master jars]# ln -s ../../../jars/aliyun-sdk-oss-3.3.0.jar aliyun-sdk-oss-3.3.0.jar
    [root@cdh-master jars]# ln -s ../../../jars/jdom-1.1.jar jdom-1.1.jar
  • 对比测试测试环境:使用spark on yarn进行对比测试,其中Node Manager节点是4个,每个节点最多可以运行4个container,每个container配备的资源是1核2GB内存 。
    测试数据:共630MB,包含3列,分别是姓名、公司和年龄 。
     
    ot@cdh-master jars]# hadoop fs -ls oss://select-test-sz/people/
    Found 10 items
    -rw-rw-rw-163079930 2018-10-30 17:03 oss://select-test-sz/people/part-00000
    -rw-rw-rw-163079930 2018-10-30 17:03 oss://select-test-sz/people/part-00001
    -rw-rw-rw-163079930 2018-10-30 17:05 oss://select-test-sz/people/part-00002
    -rw-rw-rw-163079930 2018-10-30 17:05 oss://select-test-sz/people/part-00003
    -rw-rw-rw-163079930 2018-10-30 17:06 oss://select-test-sz/people/part-00004
    -rw-rw-rw-163079930 2018-10-30 17:12 oss://select-test-sz/people/part-00005
    -rw-rw-rw-163079930 2018-10-30 17:14 oss://select-test-sz/people/part-00006
    -rw-rw-rw-163079930 2018-10-30 17:14 oss://select-test-sz/people/part-00007
    -rw-rw-rw-163079930 2018-10-30 17:15 oss://select-test-sz/people/part-00008
    -rw-rw-rw-163079930 2018-10-30 17:16 oss://select-test-sz/people/part-00009进入到${CDH_HOME}/lib/spark/,启动spark-shell ,分别测试使用OSS Select查询数据和不使用OSS Select查询数据:
     
    [root@cdh-master spark]# ./bin/spark-shell
    WARNING: User-defined SPARK_HOME (/opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/spark) overrides detected (/opt/cloudera/parcels/CDH/lib/spark).
    WARNING: Running spark-class from user-defined location.
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://x.x.x.x:4040
    Spark context available as 'sc' (master = yarn, app id = application_1540887123331_0008).
    Spark session available as 'spark'.
    Welcome to
    ______
    / __/_____ _____/ /__
    _ / _ / _ `/ __/'_/
    /___/ .__/_,_/_/ /_/_version 2.2.0-cdh6.0.1
    /_/

    Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
    Type in expressions to have them evaluated.
    Type :help for more information.

    scala> val sqlContext = spark.sqlContext

    推荐阅读