Hive练习
题目1、
该题目的两个核心思路:
1、自连接
2、倒推法现有这么一批数据,现要求出:
每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数三个字段的意思:
用户名,月份,访问次数A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,11最后结果展示:用户 月份 最大访问次数 总访问次数 当月访问次数
A 2015-01 33 33 33
A 2015-02 33 43 10
A 2015-03 38 81 38
B 2015-01 30 30 30
B 2015-02 30 45 15
B 2015-03 44 89 44
第一步
select name,mon,sum(num) from hive1
group by name,mon order by name,mon;当月访问次数:
A 2015-01 33
A 2015-02 10
A 2015-03 38
B 2015-01 30
B 2015-02 15
B 2015-03 44
第二步:
select name,mon,sum(num)
from hive1 b,
(select name,mon,sum(num) from hive1 group by name,mon order by name,mon)a
where b.mon <= a.mon and a.name=b.name
group by a.name,a.mon;总访问次数:
name mon _c2
A 2015-01 33
A 2015-02 43
A 2015-03 81
B 2015-01 30
B 2015-02 45
B 2015-03 89
第三步:
select name,mon,max(b.s)
from
(select name,mon,sum(num) s from hive1 group by name,mon order by name,mon)b,
(select name,mon,sum(num) from hive1 group by name,mon order by name,mon)a
where b.mon <= a.mon and a.name=b.name
group by a.name,a.mon;最大访问次数:
name mon _c2
A 2015-01 33
A 2015-02 33
A 2015-03 38
B 2015-01 30
B 2015-02 30
B 2015-03 44
第四步:
select aa.name,aa.mon,cc.ss,aa.tt,bb.mm
from
(select name,mon,sum(num) tt
from hive1 b,
(select name,mon,sum(num) from hive1 group by name,mon order by name,mon)a
where b.mon <= a.mon and a.name=b.name
group by a.name,a.mon)aa
join
(select name,mon,sum(num) mm from hive1 group by name,mon order by name,mon)bb
on aa.name=bb.name and aa.mon=bb.mon
join
(select name,mon,max(b.s) ss
from
(select name,mon,sum(num) s from hive1 group by name,mon order by name,mon)b,
(select name,mon,sum(num) from hive1 group by name,mon order by name,mon)a
where b.mon <= a.mon and a.name=b.name
group by a.name,a.mon)cc
on aa.name=cc.name and aa.mon=cc.mon;连接三张表:
aa.name aa.mon cc.ss aa.tt bb.mm
A 2015-01 33 33 33
A 2015-02 33 43 10
A 2015-03 38 81 38
B 2015-01 30 30 30
B 2015-02 30 45 15
B 2015-03 44 89 44
题目2、
核心思路: 行 转 列
1、// 建表语句:
CREATE TABLE `course2` (`id` int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,`sid` int(11) DEFAULT NULL,`course` varchar(255) DEFAULT NULL,`score` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;// 插入数据
// 字段解释:id, 学号, 课程, 成绩
INSERT INTO `course2` VALUES (1, 1, 'yuwen', 43);
INSERT INTO `course2` VALUES (2, 1, 'shuxue', 55);
INSERT INTO `course2` VALUES (3, 2, 'yuwen', 77);
INSERT INTO `course2` VALUES (4, 2, 'shuxue', 88);
INSERT INTO `course2` VALUES (5, 3, 'yuwen', 98);
INSERT INTO `course2` VALUES (6, 3, 'shuxue', 65);求:所有数学课程成绩 大于 语文课程成绩的学生的学号
select * from course2;
1 1 yuwen 43
2 1 shuxue 55
3 2 yuwen 77
4 2 shuxue 88
5 3 yuwen 98
6 3 shuxue 65
第一步:select sid,
case course when "yuwen" then score else 0 end as yuwen,
case course when "shuxue" then score else 0 end as shuxue,
case course when "yingyu" then score else 0 end as yingyu
from course2;1 43 0 0
1 0 55 0
2 77 0 0
2 0 88 0
3 98 0 0
3 0 65 0
第二步:
select sid,
sum(case course when "yuwen" then score else 0 end) as yuwen,
sum(case course when "shuxue" then score else 0 end) as shuxue,
sum(case course when "yingyu" then score else 0 end) as yingyu
from course2 group by sid;1 43 55 0
2 77 88 0
3 98 65 0
第三步:
select sid from (
select sid,
sum(case course when "yuwen" then score else 0 end) as yuwen,
sum(case course when "shuxue" then score else 0 end) as shuxue,
sum(case course when "yingyu" then score else 0 end) as yingyu
from course2 group by sid
) b where b.shuxue > b.yuwen;1
2
题目3、
思路:分组,跟字符串切分
2014010114
2014010216
2014010317
2014010410
2014010506
2012010609
2012010732
2012010812
2012010919
2012011023
2001010116
2001010212
2001010310
2001010411
2001010529
2013010619
2013010722
2013010812
2013010929
2013011023
2008010105
2008010216
2008010337
2008010414
2008010516
2007010619
2007010712
2007010812
2007010999
2007011023
2010010114
2010010216
2010010317
2010010410
2010010506
2015010649
2015010722
2015010812
2015010999
2015011023
比如:2010012325表示在2010年01月23日的气温为25度。现在要求使用hive,计算每一年出现过的最大气温的日期+温度。
要计算出每一年的最大气温。我用
select substr(data,1,4),max(substr(data,9,2)) from table2 group by substr(data,1,4);
出来的是 年份 + 温度 这两列数据例如 2015 99但是如果我是想select 的是:具体每一年最大气温的那一天 + 温度 。例如 20150109 99
请问该怎么执行hive语句。。
group by 只需要substr(data,1,4),
但是select substr(data,1,8),又不在group by 的范围内。
是我陷入了思维死角。一直想不出所以然。。求大神指点一下。
在select 如果所需要的。不在group by的条件里。这种情况如何去分析?
解题:
select substr(data,1,4) y,max(substr(data,9,2)) m from hive2 group by
substr(data,1,4);
2001 29
2007 99
2008 37
2010 17
2012 32
2013 29
2014 17
2015 99
select substr(data,1,4) y,substr(data,5,4) d,max(substr(data,9,2)) m from
hive2 group by substr(data,1,4),substr(data,5,4);
2001 0101 16
2001 0102 12
2001 0103 10
2001 0104 11
2001 0105 29
2007 0106 19
2007 0107 12
2007 0108 12
2007 0109 99
2007 0110 23
2008 0101 05
2008 0102 16
2008 0103 37
2008 0104 14
2008 0105 16
2010 0101 14
2010 0102 16
2010 0103 17
2010 0104 10
2010 0105 06
2012 0106 09
2012 0107 32
2012 0108 12
2012 0109 19
2012 0110 23
2013 0106 19
2013 0107 22
2013 0108 12
2013 0109 29
2013 0110 23
2014 0101 14
2014 0102 16
2014 0103 17
2014 0104 10
2014 0105 06
2015 0106 49
2015 0107 22
2015 0108 12
2015 0109 99
2015 0110 23
两个连接:
select concat(a.y,b.d),b.m from
(select substr(data,1,4) y,max(substr(data,9,2)) m from hive2 group by
substr(data,1,4)) a
join
(select substr(data,1,4) y,substr(data,5,4) d,max(substr(data,9,2)) m from
hive2 group by substr(data,1,4),substr(data,5,4)) b
on a.y=b.y and a.m=b.m;
20010105 29
20070109 99
20080103 37
20100103 17
20120107 32
20130109 29
20140103 17
20150109 99
题目四:
现有一份以下格式的数据:
表示有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门:id course
1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e编写Hive的HQL语句来实现以下结果:
表中的1表示选修,表中的0表示未选修id a b c d e f
1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0
第一步:
select id,concat_ws(',',collect_list(course)) from hive4 group by id;1 a ,b ,c ,e
2 a ,c ,d ,f
3 a ,b ,c ,e
第二步:
select a.id,
if(a.c like '%a%',1,0)a,
if(a.c like '%b%',1,0)b,
if(a.c like '%c%',1,0)c,
if(a.c like '%d%',1,0)d,
if(a.c like '%e%',1,0)e,
if(a.c like '%f%',1,0)f
from
(select id,concat_ws(',',collect_list(course))c from hive4 group by id)a;1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0
题目五:
现有如下格式的一份数据:店铺,月份,金额
shop,mon,money
a,01,150
a,01,200
b,01,1000
b,01,800
c,01,250
c,01,220
b,01,6000
a,02,2000
a,02,3000
b,02,1000
b,02,1500
c,02,350
c,02,280
a,03,350
a,03,250需求:编写Hive的HQL语句求出每个店铺的当月销售额和累计到当月的总销售额
每个店铺的当月销售额:
select shop,mon,sum(money) from hive3 group by shop,mon;a 01 350
a 02 5000
a 03 600
b 01 7800
b 02 2500
c 01 470
c 02 630
累计到当月的总销售额:
select a.shop, a.mon,sum(b.money) from hive3 b,
(select shop,mon,sum(money) from hive3 group by shop,mon) a
where a.shop=b.shop and a.mon >= b.mon group by a.shop,a.mon;a 01 350
a 02 5350
a 03 5950
b 01 7800
b 02 10300
c 01 470
c 02 1100
连接在一起:
select aa.*,bb.ss from
(select shop,mon,sum(money) from hive3 group by shop,mon)aa,
(select a.shop, a.mon,sum(b.money)ss from hive3 b,
(select shop,mon,sum(money) from hive3 group by shop,mon) a
where a.shop=b.shop and a.mon >= b.mon group by a.shop,a.mon)bb
where aa.shop=bb.shop and aa.mon=bb.mon;a 01 350 350
a 02 5000 5350
a 03 600 5950
b 01 7800 7800
b 02 2500 10300
c 01 470 470
c 02 630 1100
Hive练习
题目1、
该题目的两个核心思路:
1、自连接
2、倒推法现有这么一批数据,现要求出:
每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数三个字段的意思:
用户名,月份,访问次数A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,11最后结果展示:用户 月份 最大访问次数 总访问次数 当月访问次数
A 2015-01 33 33 33
A 2015-02 33 43 10
A 2015-03 38 81 38
B 2015-01 30 30 30
B 2015-02 30 45 15
B 2015-03 44 89 44
第一步
select name,mon,sum(num) from hive1
group by name,mon order by name,mon;当月访问次数:
A 2015-01 33
A 2015-02 10
A 2015-03 38
B 2015-01 30
B 2015-02 15
B 2015-03 44
第二步:
select name,mon,sum(num)
from hive1 b,
(select name,mon,sum(num) from hive1 group by name,mon order by name,mon)a
where b.mon <= a.mon and a.name=b.name
group by a.name,a.mon;总访问次数:
name mon _c2
A 2015-01 33
A 2015-02 43
A 2015-03 81
B 2015-01 30
B 2015-02 45
B 2015-03 89
第三步:
select name,mon,max(b.s)
from
(select name,mon,sum(num) s from hive1 group by name,mon order by name,mon)b,
(select name,mon,sum(num) from hive1 group by name,mon order by name,mon)a
where b.mon <= a.mon and a.name=b.name
group by a.name,a.mon;最大访问次数:
name mon _c2
A 2015-01 33
A 2015-02 33
A 2015-03 38
B 2015-01 30
B 2015-02 30
B 2015-03 44
第四步:
select aa.name,aa.mon,cc.ss,aa.tt,bb.mm
from
(select name,mon,sum(num) tt
from hive1 b,
(select name,mon,sum(num) from hive1 group by name,mon order by name,mon)a
where b.mon <= a.mon and a.name=b.name
group by a.name,a.mon)aa
join
(select name,mon,sum(num) mm from hive1 group by name,mon order by name,mon)bb
on aa.name=bb.name and aa.mon=bb.mon
join
(select name,mon,max(b.s) ss
from
(select name,mon,sum(num) s from hive1 group by name,mon order by name,mon)b,
(select name,mon,sum(num) from hive1 group by name,mon order by name,mon)a
where b.mon <= a.mon and a.name=b.name
group by a.name,a.mon)cc
on aa.name=cc.name and aa.mon=cc.mon;连接三张表:
aa.name aa.mon cc.ss aa.tt bb.mm
A 2015-01 33 33 33
A 2015-02 33 43 10
A 2015-03 38 81 38
B 2015-01 30 30 30
B 2015-02 30 45 15
B 2015-03 44 89 44
题目2、
核心思路: 行 转 列
1、// 建表语句:
CREATE TABLE `course2` (`id` int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,`sid` int(11) DEFAULT NULL,`course` varchar(255) DEFAULT NULL,`score` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;// 插入数据
// 字段解释:id, 学号, 课程, 成绩
INSERT INTO `course2` VALUES (1, 1, 'yuwen', 43);
INSERT INTO `course2` VALUES (2, 1, 'shuxue', 55);
INSERT INTO `course2` VALUES (3, 2, 'yuwen', 77);
INSERT INTO `course2` VALUES (4, 2, 'shuxue', 88);
INSERT INTO `course2` VALUES (5, 3, 'yuwen', 98);
INSERT INTO `course2` VALUES (6, 3, 'shuxue', 65);求:所有数学课程成绩 大于 语文课程成绩的学生的学号
select * from course2;
1 1 yuwen 43
2 1 shuxue 55
3 2 yuwen 77
4 2 shuxue 88
5 3 yuwen 98
6 3 shuxue 65
第一步:select sid,
case course when "yuwen" then score else 0 end as yuwen,
case course when "shuxue" then score else 0 end as shuxue,
case course when "yingyu" then score else 0 end as yingyu
from course2;1 43 0 0
1 0 55 0
2 77 0 0
2 0 88 0
3 98 0 0
3 0 65 0
第二步:
select sid,
sum(case course when "yuwen" then score else 0 end) as yuwen,
sum(case course when "shuxue" then score else 0 end) as shuxue,
sum(case course when "yingyu" then score else 0 end) as yingyu
from course2 group by sid;1 43 55 0
2 77 88 0
3 98 65 0
第三步:
select sid from (
select sid,
sum(case course when "yuwen" then score else 0 end) as yuwen,
sum(case course when "shuxue" then score else 0 end) as shuxue,
sum(case course when "yingyu" then score else 0 end) as yingyu
from course2 group by sid
) b where b.shuxue > b.yuwen;1
2
题目3、
思路:分组,跟字符串切分
2014010114
2014010216
2014010317
2014010410
2014010506
2012010609
2012010732
2012010812
2012010919
2012011023
2001010116
2001010212
2001010310
2001010411
2001010529
2013010619
2013010722
2013010812
2013010929
2013011023
2008010105
2008010216
2008010337
2008010414
2008010516
2007010619
2007010712
2007010812
2007010999
2007011023
2010010114
2010010216
2010010317
2010010410
2010010506
2015010649
2015010722
2015010812
2015010999
2015011023
比如:2010012325表示在2010年01月23日的气温为25度。现在要求使用hive,计算每一年出现过的最大气温的日期+温度。
要计算出每一年的最大气温。我用
select substr(data,1,4),max(substr(data,9,2)) from table2 group by substr(data,1,4);
出来的是 年份 + 温度 这两列数据例如 2015 99但是如果我是想select 的是:具体每一年最大气温的那一天 + 温度 。例如 20150109 99
请问该怎么执行hive语句。。
group by 只需要substr(data,1,4),
但是select substr(data,1,8),又不在group by 的范围内。
是我陷入了思维死角。一直想不出所以然。。求大神指点一下。
在select 如果所需要的。不在group by的条件里。这种情况如何去分析?
解题:
select substr(data,1,4) y,max(substr(data,9,2)) m from hive2 group by
substr(data,1,4);
2001 29
2007 99
2008 37
2010 17
2012 32
2013 29
2014 17
2015 99
select substr(data,1,4) y,substr(data,5,4) d,max(substr(data,9,2)) m from
hive2 group by substr(data,1,4),substr(data,5,4);
2001 0101 16
2001 0102 12
2001 0103 10
2001 0104 11
2001 0105 29
2007 0106 19
2007 0107 12
2007 0108 12
2007 0109 99
2007 0110 23
2008 0101 05
2008 0102 16
2008 0103 37
2008 0104 14
2008 0105 16
2010 0101 14
2010 0102 16
2010 0103 17
2010 0104 10
2010 0105 06
2012 0106 09
2012 0107 32
2012 0108 12
2012 0109 19
2012 0110 23
2013 0106 19
2013 0107 22
2013 0108 12
2013 0109 29
2013 0110 23
2014 0101 14
2014 0102 16
2014 0103 17
2014 0104 10
2014 0105 06
2015 0106 49
2015 0107 22
2015 0108 12
2015 0109 99
2015 0110 23
两个连接:
select concat(a.y,b.d),b.m from
(select substr(data,1,4) y,max(substr(data,9,2)) m from hive2 group by
substr(data,1,4)) a
join
(select substr(data,1,4) y,substr(data,5,4) d,max(substr(data,9,2)) m from
hive2 group by substr(data,1,4),substr(data,5,4)) b
on a.y=b.y and a.m=b.m;
20010105 29
20070109 99
20080103 37
20100103 17
20120107 32
20130109 29
20140103 17
20150109 99
题目四:
现有一份以下格式的数据:
表示有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门:id course
1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e编写Hive的HQL语句来实现以下结果:
表中的1表示选修,表中的0表示未选修id a b c d e f
1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0
第一步:
select id,concat_ws(',',collect_list(course)) from hive4 group by id;1 a ,b ,c ,e
2 a ,c ,d ,f
3 a ,b ,c ,e
第二步:
select a.id,
if(a.c like '%a%',1,0)a,
if(a.c like '%b%',1,0)b,
if(a.c like '%c%',1,0)c,
if(a.c like '%d%',1,0)d,
if(a.c like '%e%',1,0)e,
if(a.c like '%f%',1,0)f
from
(select id,concat_ws(',',collect_list(course))c from hive4 group by id)a;1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0
题目五:
现有如下格式的一份数据:店铺,月份,金额
shop,mon,money
a,01,150
a,01,200
b,01,1000
b,01,800
c,01,250
c,01,220
b,01,6000
a,02,2000
a,02,3000
b,02,1000
b,02,1500
c,02,350
c,02,280
a,03,350
a,03,250需求:编写Hive的HQL语句求出每个店铺的当月销售额和累计到当月的总销售额
每个店铺的当月销售额:
select shop,mon,sum(money) from hive3 group by shop,mon;a 01 350
a 02 5000
a 03 600
b 01 7800
b 02 2500
c 01 470
c 02 630
累计到当月的总销售额:
select a.shop, a.mon,sum(b.money) from hive3 b,
(select shop,mon,sum(money) from hive3 group by shop,mon) a
where a.shop=b.shop and a.mon >= b.mon group by a.shop,a.mon;a 01 350
a 02 5350
a 03 5950
b 01 7800
b 02 10300
c 01 470
c 02 1100
连接在一起:
select aa.*,bb.ss from
(select shop,mon,sum(money) from hive3 group by shop,mon)aa,
(select a.shop, a.mon,sum(b.money)ss from hive3 b,
(select shop,mon,sum(money) from hive3 group by shop,mon) a
where a.shop=b.shop and a.mon >= b.mon group by a.shop,a.mon)bb
where aa.shop=bb.shop and aa.mon=bb.mon;a 01 350 350
a 02 5000 5350
a 03 600 5950
b 01 7800 7800
b 02 2500 10300
c 01 470 470
c 02 630 1100