Yelp 数据集分析
Apache Drill 是开源项目发展速度最快的项目之一,社区每月会发布一个版本。关键的不同点是 Drill 具有其敏捷性和灵活性。在会议上提出 SQL On Hadoop,用于完成低延时的性能指标,Drill 允许用户在没有 ETL 或事先没有定义 Schema 的情况下,完成数据的分析。数据可以为任意格式,例如 text,JSON 或是 Parquet。数据可以是简单类型,例如:string,integer,date 或是更加复杂结构数据,例如嵌套的 Map 集合和数组。数据可以在任何文件系统中存在,本地或是分布式中,例如 HDFS 或是 S3。Drill,有一个“无 Schema”的途径,它能够让你只需要几分钟,就能从你的数据中获得想要的值。
让我们快速的通过需要安装 Drill 和需要运行 Yelp 数据集的步骤。从 Yelp 中下载公开的示例数据集并将其格式化为 JSON。
安装和启动 Drill
下载 Apache Drill 到你本地机器
在本地进行 Drill 实验,按照 快速指导的步骤进行实验。
另外,如果你想扩展你的环境,你可以将 Drill 部署在分布式环境下。
下面,让我们尝试用一些 SQL 示例,去理解 Drill 如何简易的分析数据。
注释:
你需要替换你本地路径的 Yelp 数据集在角括号中,针对你运行的每个查询。
在 Drill 中查询数据
1.可视化 Yelp 的业务数据
0: jdbc:drill:zk=local> !set maxwidth 10000
0: jdbc:drill:zk=local> select * from
dfs.`<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`
limit 1;
+------------------------+----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+--------------------------------+---------+--------------+-------------------+-------------+-------+-------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+
| business_id | full_address | hours | open | categories | city | review_count | name | longitude | state | stars | latitude | attributes | type | neighborhoods |
+------------------------+----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+--------------------------------+---------+--------------+-------------------+-------------+-------+-------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+
| vcNAWiLM4dR7D2nwwJ7nCA | 4840 E Indian School Rd Ste 101, Phoenix, AZ 85018 | fill in{"Tuesday":{"close":"17:00","open":"08:00"},"Friday":{"close":"17:00","open":"08:00"},"Monday":{"close":"17:00","open":"08:00"},"Wednesday":{"close":"17:00","open":"08:00"},"Thursday":{"close":"17:00","open":"08:00"},"Sunday":{},"Saturday":{}} | true | ["Doctors","Health & Medical"] | Phoenix | 7 | Eric Goldberg, MD | -111.983758 | AZ | 3.5 | 33.499313 | {"By Appointment Only":true,"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} | business | [] |
+------------------------+----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+--------------------------------+---------+--------------+-------------------+-------------+-------+-------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+
注释:
本文件以示例为目的对 Drill 结果进行输出。这种输出是不对称的(即:不能完全匹配)。
你可以直接查询自定义的文件,例如 JSON,Parquet 和 text 文件。它们不需要创建和定义元数据在 Hive 的元数据中。
2.进一步探索业务数据集
在数据集中的浏览总数:
0: jdbc:drill:zk=local> select sum(review_count) as totalreviews
from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`;
+--------------+
| totalreviews |
+--------------+
| 1236445 |
+--------------+
国家和城市总数的评论:
0: jdbc:drill:zk=local> select state, city, count(*) totalreviews
from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`
group by state, city order by count(*) desc limit 10;
+------------+------------+--------------+
| state | city | totalreviews |
+------------+------------+--------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+--------------+
每一个企业星级评定的平均次数
0: jdbc:drill:zk=local> select stars,trunc(avg(review_count)) reviewsavg
from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`
group by stars order by stars desc;
+------------+------------+
| stars | reviewsavg |
+------------+------------+
| 5.0 | 8.0 |
| 4.5 | 28.0 |
| 4.0 | 48.0 |
| 3.5 | 35.0 |
| 3.0 | 26.0 |
| 2.5 | 16.0 |
| 2.0 | 11.0 |
| 1.5 | 9.0 |
| 1.0 | 4.0 |
+------------+------------+
具有高评论数的顶级企业(1000)
0: jdbc:drill:zk=local> select name, state, city, `review_count` from
dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`
where review_count > 1000 order by `review_count` desc limit 10;
+-------------------------------+-------------+------------+---------------+
| name | state | city | review_count |
+-------------------------------+-------------+------------+---------------+
| Mon Ami Gabi | NV | Las Vegas | 4084 |
| Earl of Sandwich | NV | Las Vegas | 3655 |
| Wicked Spoon | NV | Las Vegas | 3408 |
| The Buffet | NV | Las Vegas | 2791 |
| Serendipity 3 | NV | Las Vegas | 2682 |
| Bouchon | NV | Las Vegas | 2419 |
| The Buffet at Bellagio | NV | Las Vegas | 2404 |
| Bacchanal Buffet | NV | Las Vegas | 2369 |
| The Cosmopolitan of Las Vegas | NV | Las Vegas | 2253 |
| Aria Hotel & Casino | NV | Las Vegas | 2224 |
+-------------------------------+-------------+----------------------------+
星期六开放和关闭时间为数不多的企业
0: jdbc:drill:zk=local> select b.name, b.hours.Saturday.`open`,
b.hours.Saturday.`close`
from
dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`
b limit 10;
+----------------------------+------------+------------+
| name | EXPR$1 | EXPR$2 |
+----------------------------+------------+------------+
| Eric Goldberg, MD | 08:00 | 17:00 |
| Pine Cone Restaurant | null | null |
| Deforest Family Restaurant | 06:00 | 22:00 |
| Culver's | 10:30 | 22:00 |
| Chang Jiang Chinese Kitchen| 11:00 | 22:00 |
| Charter Communications | null | null |
| Air Quality Systems | null | null |
| McFarland Public Library | 09:00 | 20:00 |
| Green Lantern Restaurant | 06:00 | 02:00 |
| Spartan Animal Hospital | 07:30 | 18:00 |
+----------------------------+------------+------------+
需要注意的是,Drill 是如何进行多层嵌套的。
3.从业务数据集中获取设备
注意,在 Yelp 的业务数据集中的列属性每一行都是不同的,代表企业可以拥有单独的设备。Drill 可以快速的,容易的去访问这些数据集并改变它们的 Schema。
首先,改变 Drill 的工作在所有的 text 模式中(这样我们可以看一看所有的数据)。
0: jdbc:drill:zk=local> alter system set `store.json.all_text_mode` = true;
+------------+-----------------------------------+
| ok | summary |
+------------+-----------------------------------+
| true | store.json.all_text_mode updated. |
+------------+-----------------------------------+
然后,我们去查询数据的属性。
0: jdbc:drill:zk=local> select attributes from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` limit 10;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| attributes |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"By Appointment Only":"true","Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
| {"Take-out":"true","Good For":{"dessert":"false","latenight":"false","lunch":"true","dinner":"false","breakfast":"false","brunch":"false"},"Caters":"false","Noise Level":"averag |
| {"Take-out":"true","Good For":{"dessert":"false","latenight":"false","lunch":"false","dinner":"false","breakfast":"false","brunch":"true"},"Caters":"false","Noise Level":"quiet" |
| {"Take-out":"true","Good For":{},"Takes Reservations":"false","Delivery":"false","Ambience":{},"Parking":{"garage":"false","street":"false","validated":"false","lot":"true","val |
| {"Take-out":"true","Good For":{},"Ambience":{},"Parking":{},"Has TV":"false","Outdoor Seating":"false","Attire":"casual","Music":{},"Hair Types Specialized In":{},"Payment Types |
| {"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
| {"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
| {"Good For":{},"Ambience":{},"Parking":{},"Wi-Fi":"free","Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
| {"Take-out":"true","Good For":{"dessert":"false","latenight":"false","lunch":"false","dinner":"true","breakfast":"false","brunch":"false"},"Noise Level":"average" |
| {"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
注释:
本文件以示例为目的对 Drill 结果进行输出。这种输出是不对称的(即:不能完全匹配)。
关闭所有的文本模式,这样我们就可以继续执行数据的算术运算。
0: jdbc:drill:zk=local> alter system set `store.json.all_text_mode` = false;
+-------+------------------------------------+
| ok | summary |
+-------+------------------------------------+
| true | store.json.all_text_mode updated. |
+-------+------------------------------------+
4.在业务数据集中探索餐厅
数据集中的餐厅数
0: jdbc:drill:zk=local> select count(*) as TotalRestaurants from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` where true=repeated_contains(categories,'Restaurants');
+------------------+
| TotalRestaurants |
+------------------+
| 14303 |
+------------------+
顶级餐厅的评论数量
0: jdbc:drill:zk=local> select name,state,city,`review_count` from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` where true=repeated_contains(categories,'Restaurants') order by `review_count` desc limit 10;
+------------------------+-------+-----------+--------------+
| name | state | city | review_count |
+------------------------+-------+-----------+--------------+
| Mon Ami Gabi | NV | Las Vegas | 4084 |
| Earl of Sandwich | NV | Las Vegas | 3655 |
| Wicked Spoon | NV | Las Vegas | 3408 |
| The Buffet | NV | Las Vegas | 2791 |
| Serendipity 3 | NV | Las Vegas | 2682 |
| Bouchon | NV | Las Vegas | 2419 |
| The Buffet at Bellagio | NV | Las Vegas | 2404 |
| Bacchanal Buffet | NV | Las Vegas | 2369 |
| Hash House A Go Go | NV | Las Vegas | 2201 |
| Mesa Grill | NV | Las Vegas | 2004 |
+------------------------+-------+-----------+--------------+
顶级餐厅的种类清淡列表
0: jdbc:drill:zk=local> select name,repeated_count(categories) as categorycount, categories from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` where true=repeated_contains(categories,'Restaurants') order by repeated_count(categories) desc limit 10;
+---------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| name | categorycount | categories |
+---------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| Binion's Hotel & Casino | 10 | ["Arts &,Entertainment","Restaurants","Bars","Casinos","Event,Planning &,Services","Lounges","Nightlife","Hotels &,Travel","American] |
| Stage Deli | 10 | ["Arts &,Entertainment","Food","Hotels","Desserts","Delis","Casinos","Sandwiches","Hotels,& Travel","Restaurants","Event Planning &,Services"] |
| Jillian's | 9 | ["Arts &,Entertainment","American (Traditional)","Music,Venues","Bars","Dance,Clubs","Nightlife","Bowling","Active,Life","Restaurants"] |
| Hotel Chocolat | 9 | ["Coffee &,Tea","Food","Cafes","Chocolatiers &,Shops","Specialty Food","Event Planning &,Services","Hotels & Travel","Hotels","Restaurants"] |
| Hotel du Vin & Bistro Edinburgh | 9 | ["Modern,European","Bars","French","Wine,Bars","Event Planning &,Services","Nightlife","Hotels &,Travel","Hotels","Restaurants"] |
| Elixir | 9 | ["Arts &,Entertainment","American (Traditional)","Music,Venues","Bars","Cocktail,Bars","Nightlife","American (New)","Local,Flavor","Restaurants"] |
| Tocasierra Spa and Fitness | 8 | ["Beauty &,Spas","Gyms","Medical Spas","Health &,Medical","Fitness & Instruction","Active,Life","Day Spas","Restaurants"] |
| Costa Del Sol At Sunset Station | 8 | ["Steakhouses","Mexican","Seafood","Event,Planning & Services","Hotels &,Travel","Italian","Restaurants","Hotels"] |
| Scottsdale Silverado Golf Club | 8 | ["Fashion","Shopping","Sporting,Goods","Active Life","Golf","American,(New)","Sports Wear","Restaurants"] |
| House of Blues | 8 | ["Arts & Entertainment","Music Venues","Restaurants","Hotels","Event Planning & Services","Hotels & Travel","American (New)","Nightlife"] |
+---------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
注释:
本文件以示例为目的对 Drill 结果进行输出。这种输出是不对称的(即:不能完全匹配)。
评论数第一品种统计
0: jdbc:drill:zk=local> select categories[0], count(categories[0]) as categorycount
from dfs.`/<path-to-yelp-dataset>/yelp_academic_dataset_business.json`
group by categories[0]
order by count(categories[0]) desc limit 10;
+----------------------+---------------+
| EXPR$0 | categorycount |
+----------------------+---------------+
| Food | 4294 |
| Shopping | 1885 |
| Active Life | 1676 |
| Bars | 1366 |
| Local Services | 1351 |
| Mexican | 1284 |
| Hotels & Travel | 1283 |
| Fast Food | 963 |
| Arts & Entertainment | 906 |
| Hair Salons | 901 |
+----------------------+---------------+
5.探索 Yelp 的评论于数据集并结合业务数据
接下来,让我们来一起看看 Yelp 评论数据集的内容。
0: jdbc:drill:zk=local> select *
from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_review.json` limit 1;
+---------------------------------+------------------------+------------------------+-------+------------+----------------------------------------------------------------------+--------+------------------------+
| votes | user_id | review_id | stars | date | text | type | business_id |
+---------------------------------+------------------------+------------------------+-------+------------+----------------------------------------------------------------------+--------+------------------------+
| {"funny":0,"useful":2,"cool":1} | Xqd0DzHaiyRqVH3WRG7hzg | 15SdjuK7DmYqUAj6rjGowg | 5 | 2007-05-17 | dr. goldberg offers everything i look for in a general practitioner. | review | vcNAWiLM4dR7D2nwwJ7nCA |
+---------------------------------+------------------------+------------------------+-------+------------+----------------------------------------------------------------------+--------+------------------------+
评论最酷的顶级企业
请注意,我们结合Yelp业务数据集,对 Yelp 的评论数据进行全面的评论统计,它涵盖每个评论的细节信息。
0: jdbc:drill:zk=local> Select b.name
from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` b
where b.business_id in (SELECT r.business_id
FROM dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_review.json` r
GROUP BY r.business_id having sum(r.votes.cool) > 2000
order by sum(r.votes.cool) desc);
+-------------------------------+
| name |
+-------------------------------+
| Earl of Sandwich |
| XS Nightclub |
| The Cosmopolitan of Las Vegas |
| Wicked Spoon |
+-------------------------------+
结合业务数据和评论数据集去创建一个视图
注意 Drill 的视图是一个轻量级的,并且可以在本地文件系统中创建。在独立模式中有一个 dfs.tmp 工作空间。我们可以用它来创建视图(或者你也可以定义在本地或分布式文件系统的工作空空间)。如果你想用物理的方式替代逻辑方式用于持久化,你可以用创建表(CREATE TABLE AS
)的语法
0: jdbc:drill:zk=local> create or replace view dfs.tmp.businessreviews as
Select b.name,b.stars,b.state,b.city,r.votes.funny,r.votes.useful,r.votes.cool, r.`date`
from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` b, dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_review.json` r
where r.business_id=b.business_id
+------------+-----------------------------------------------------------------+
| ok | summary |
+------------+-----------------------------------------------------------------+
| true | View 'businessreviews' created successfully in 'dfs.tmp' schema |
+------------+-----------------------------------------------------------------+
让我们从视图中获得记录的总数。
0: jdbc:drill:zk=local> select count(*) as Total from dfs.tmp.businessreviews;
+------------+
| Total |
+------------+
| 1125458 |
+------------+
除了这些问题,你可以使用 Drill 的 SQL 功能得到许多深刻的见解。如果你不乐意去手动编写查询,你可以使用一个 BI 分析工具,例如 Tableau/MicroStrategy 使用 ODBC/JDBC 驱动 去查询原始文件,Hive,HBase 数据或 Drill 创建的视图。
Apache Drill 提供了一个自由和灵活的探索数据的方式,是之前我们从未见过的 SQL 技术。社区正在使用嵌套数据并且能够在即将发布的版本中支持改变 Schema。
FLATTEN 函数能够用于动态的,半结构化数据,以便我们能够使用更加深入的 SQL 功能。下面是一个示例查询:
为每个业务得到一个扁平的类别列表:
0: jdbc:drill:zk=local> select name, flatten(categories) as category
from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` limit 20;
+-----------------------------+---------------------------------+
| name | category |
+-----------------------------+---------------------------------+
| Eric Goldberg, MD | Doctors |
| Eric Goldberg, MD | Health & Medical |
| Pine Cone Restaurant | Restaurants |
| Deforest Family Restaurant | American (Traditional) |
| Deforest Family Restaurant | Restaurants |
| Culver's | Food |
| Culver's | Ice Cream & Frozen Yogurt |
| Culver's | Fast Food |
| Culver's | Restaurants |
| Chang Jiang Chinese Kitchen | Chinese |
| Chang Jiang Chinese Kitchen | Restaurants |
| Charter Communications | Television Stations |
| Charter Communications | Mass Media |
| Air Quality Systems | Home Services |
| Air Quality Systems | Heating & Air Conditioning/HVAC |
| McFarland Public Library | Libraries |
| McFarland Public Library | Public Services & Government |
| Green Lantern Restaurant | American (Traditional) |
| Green Lantern Restaurant | Restaurants |
| Spartan Animal Hospital | Veterinarians |
+-----------------------------+---------------------------------+
顶级的品种评论
0: jdbc:drill:zk=local> select celltbl.catl, count(celltbl.catl) categorycnt
from (select flatten(categories) catl from dfs.`/yelp_academic_dataset_business.json` ) celltbl
group by celltbl.catl
order by count(celltbl.catl) desc limit 10 ;
+------------------+-------------+
| catl | categorycnt |
+------------------+-------------+
| Restaurants | 14303 |
| Shopping | 6428 |
| Food | 5209 |
| Beauty & Spas | 3421 |
| Nightlife | 2870 |
| Bars | 2378 |
| Health & Medical | 2351 |
| Automotive | 2241 |
| Home Services | 1957 |
| Fashion | 1897 |
+------------------+-------------+
敬请期待社区发布更多的版本和活动。