高动态数据集分析

今天的数据是动态的和由应用程序驱动的。一个新的数据类型和新的数据模型,生成数据集,是商业应用的发展趋势,比如网络,社交,移动和互联网的新时代的增长。这些应用程序被快速迭代,这些数据模型和半结构化,无 Schema 结合,并快速发展。半结构化数据模型可以是复杂的/嵌套的,无 Schema 的,并且能够在每一行中具有不同的字段,并不断地发展,以满足业务需求的不断增加和删除。

本教程向你展示如何来查询动态数据,例如 JSON,并在几分钟之内从任何数据类型中,得到结果。在本例中使用的数据集是从 Yelp 的签到数据,它具有以下结构:

  1. check-in
  2. {
  3. 'type': 'checkin',
  4. 'business_id': (encrypted business id),
  5. 'checkin_info': {
  6. '0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
  7. '1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
  8. ...
  9. '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
  10. ...
  11. '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
  12. }, # if there was no checkin for a hour-day block it will not be in the dataset
  13. }

值得在这段底部重复评论:

  1. # if there was no checkin for a hour-day block it will not be in the dataset.

元素的名称,在 checkin_info 看到预期的结果之前可以每行的变化。数据,虽然简单,但是是高动态的数据。在你会在 Hadoop 上使用任何 SQL 技术,对数据进行分析时,无需首先在数据集上建立一个扁平的关系结构。

第一步:首先下载 Drill,如果你还没有下载,那么请先在你的机器上下载。

  1. http://drill.apache.org/download/
  2. tar -xvf apache-drill-1.4.0.tar

安装 Drill 在你的本机上(或者是本机的虚拟机上)。你不需要 Hadoop。

第二步:启动 Drill Shell。

  1. bin/drill-embedded

第三步:使用 SQL 去分析数据

首先,让我们先看看数据集:

  1. 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` limit 2;
  2. +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+------------------------+
  3. | checkin_info | type | business_id |
  4. +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+------------------------+
  5. | {"3-4":1,"13-5":1,"6-6":1,"14-5":1,"14-6":1,"14-2":1,"14-3":1,"19-0":1,"11-5":1,"13-2":1,"11-6":2,"11-3":1,"12-6":1,"6-5":1,"5-5":1,"9-2":1,"9-5":1,"9-6":1,"5-2":1,"7-6":1,"7-5":1,"7-4":1,"17-5":1,"8-5":1,"10-2":1,"10-5":1,"10-6":1} | checkin | JwUE5GmEO-sH1FuwJgKBlQ |
  6. | {"6-6":2,"6-5":1,"7-6":1,"7-5":1,"8-5":2,"10-5":1,"9-3":1,"12-5":1,"15-3":1,"15-5":1,"15-6":1,"16-3":1,"10-0":1,"15-4":1,"10-4":1,"8-2":1} | checkin | uGykseHzyS5xAMWoN6YUqA |
  7. +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+------------------------+

注意:该文档输出只是一个示例,Drill 的输出结果并不是对齐的。

你可以直接查询 JSON 文件。Schema 的定义在 Hive 中并不是必需的。checkin_info 列在第一行和第二行是不同的。

Drill 提供了一个函数叫 KVGEN(Key 和 Value 的生成器),它用于工作在复杂数据,在包含动态的 maps 集合和未知的元素名称,比如 checkin_info。KVGEN 将动态的 Map 转化为键值对,键代表动态元素名称的数组。

让我们应用 KVGEN 到 checkin_info 元素中去生成键值对。

  1. 0: jdbc:drill:zk=local> SELECT KVGEN(checkin_info) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` LIMIT 2;
  2. +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  3. | checkins |
  4. +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  5. | [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14-5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19-0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11-3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9-2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7-6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8-5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] |
  6. | [{"key":"6-6","value":2},{"key":"6-5","value":1},{"key":"7-6","value":1},{"key":"7-5","value":1},{"key":"8-5","value":2},{"key":"10-5","value":1},{"key":"9-3","value":1},{"key":"12-5","value":1},{"key":"15-3","value":1},{"key":"15-5","value":1},{"key":"15-6","value":1},{"key":"16-3","value":1},{"key":"10-0","value":1},{"key":"15-4","value":1},{"key":"10-4","value":1},{"key":"8-2","value":1}] |
  7. +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Drill 还提供了其他的函数去操作复杂的数据,该函数叫做 “Flatten”,它用于拆分键值对集合从 KVGEN 中返回的结果当中。

  1. 0: jdbc:drill:zk=local> SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` LIMIT 20;
  2. +--------------------------+
  3. | checkins |
  4. +--------------------------+
  5. | {"key":"3-4","value":1} |
  6. | {"key":"13-5","value":1} |
  7. | {"key":"6-6","value":1} |
  8. | {"key":"14-5","value":1} |
  9. | {"key":"14-6","value":1} |
  10. | {"key":"14-2","value":1} |
  11. | {"key":"14-3","value":1} |
  12. | {"key":"19-0","value":1} |
  13. | {"key":"11-5","value":1} |
  14. | {"key":"13-2","value":1} |
  15. | {"key":"11-6","value":2} |
  16. | {"key":"11-3","value":1} |
  17. | {"key":"12-6","value":1} |
  18. | {"key":"6-5","value":1} |
  19. | {"key":"5-5","value":1} |
  20. | {"key":"9-2","value":1} |
  21. | {"key":"9-5","value":1} |
  22. | {"key":"9-6","value":1} |
  23. | {"key":"5-2","value":1} |
  24. | {"key":"7-6","value":1} |
  25. +--------------------------+

你可以快速的获取数据通过应用 KVGEN 和 FLATTEN 函数于数据集上,不需要花时间去定义 Schema 和数据存储的中间格式。

在输出的扁平数据中,你使用标准的 SQL 功能,例如 filters,aggregates 和 sort。让我们先看一些例子。

获取 Yelp 数据集中的总评论中,从 checkin-ins 记录中

  1. 0: jdbc:drill:zk=local> SELECT SUM(checkintbl.checkins.`value`) AS TotalCheckins FROM (
  2. . . . . . . . . . . . > SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl
  3. . . . . . . . . . . . > ;
  4. +---------------+
  5. | TotalCheckins |
  6. +---------------+
  7. | 4713811 |
  8. +---------------+

获取星期天晚上的数据

  1. 0: jdbc:drill:zk=local> SELECT SUM(checkintbl.checkins.`value`) AS SundayMidnightCheckins FROM (
  2. . . . . . . . . . . . > SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl WHERE checkintbl.checkins.key='23-0';
  3. +------------------------+
  4. | SundayMidnightCheckins |
  5. +------------------------+
  6. | 8575 |
  7. +------------------------+

获取一周中每天的数据

  1. 0: jdbc:drill:zk=local> SELECT `right`(checkintbl.checkins.key,1) WeekDay,sum(checkintbl.checkins.`value`) TotalCheckins from (
  2. . . . . . . . . . . . > select flatten(kvgen(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl GROUP BY `right`(checkintbl.checkins.key,1) ORDER BY TotalCheckins;
  3. +------------+---------------+
  4. | WeekDay | TotalCheckins |
  5. +------------+---------------+
  6. | 1 | 545626 |
  7. | 0 | 555038 |
  8. | 2 | 555747 |
  9. | 3 | 596296 |
  10. | 6 | 735830 |
  11. | 4 | 788073 |
  12. | 5 | 937201 |
  13. +------------+---------------+

获取一天中每小时的数据

  1. 0: jdbc:drill:zk=local> SELECT SUBSTR(checkintbl.checkins.key,1,strpos(checkintbl.checkins.key,'-')-1) AS HourOfTheDay ,SUM(checkintbl.checkins.`value`) TotalCheckins FROM (
  2. . . . . . . . . . . . > SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl GROUP BY SUBSTR(checkintbl.checkins.key,1,strpos(checkintbl.checkins.key,'-')-1) ORDER BY TotalCheckins;
  3. +--------------+---------------+
  4. | HourOfTheDay | TotalCheckins |
  5. +--------------+---------------+
  6. | 3 | 20357 |
  7. | 4 | 21076 |
  8. | 2 | 28116 |
  9. | 5 | 33842 |
  10. | 1 | 45467 |
  11. | 6 | 54174 |
  12. | 0 | 74127 |
  13. | 7 | 96329 |
  14. | 23 | 102009 |
  15. | 8 | 130091 |
  16. | 22 | 140338 |
  17. | 9 | 162913 |
  18. | 21 | 211949 |
  19. | 10 | 220687 |
  20. | 15 | 261384 |
  21. | 14 | 276188 |
  22. | 16 | 292547 |
  23. | 20 | 293783 |
  24. | 13 | 328373 |
  25. | 11 | 338675 |
  26. | 17 | 374186 |
  27. | 19 | 385381 |
  28. | 12 | 399797 |
  29. | 18 | 422022 |
  30. +--------------+---------------+

总结

本教程中,你浏览结构化和半结构化的数据时,没有任何的 Schema 管理和 ETL。