hive总结(二)

hive 常用函数及其查询

  • hive常用的自带函数及查询

    1. show functions;

      查询hive自带的函数

    2. —查看某一个函数的具体用法;

      desc function extended 函数名称;
      栗子:desc function extended max;

    3. 聚合函数

      max(),min(),sum(),count(),avg()…

      eg:基于emp,dept表计算每个部分的工资总数,并且显示部门的名称

  • 语法:

    • 使用group by

      • 如果字段出现在select中,那么这些字段必须出现在group by里面,除非你用聚合函数修饰

        1
        2
        3
        4
        5
        6
        select deptno,sum(sal) from emp e inner join dept d on e.deptno=d.deptno group by e.deptno;
        select e.deptno,d.dname,sum(sal) from emp e inner join dept d on e.deptno=d.deptno group by e.deptno;
        FAILED: SemanticException [Error 10025]: Line 1:16 Expression not in GROUP BY key 'dname'
        select e.deptno,d.dname,sum(sal) from emp e inner join dept d on e.deptno=d.deptno group by e.deptno,d.dname;
    • having条件 –使用where就会报错

      1
      2
      3
      4
      select e.deptno,dname,sum(sal) from emp e inner join dept d on e.deptno=d.deptno group by e.deptno,d.dname where e.deptno=10;
      FAILED: ParseException line 1:99 missing EOF at 'where' near 'deptno'
      select e.deptno,dname,sum(sal) from emp e inner join dept d on e.deptno=d.deptno group by e.deptno,d.dname having e.deptno=10;
    • rand() –取随机数,范围0-1

      1
      2
      3
      4
      5
      6
      7
      语法: rand()
      rand(int exer)
      select rand(); 0.6549622205135005
      select rand(); 0.5992520662072582
      select rand(1); 0.7308781907032909
      select rand(2); 0.7311469360199058
      返回值是:double类型
    • concat() –字段拼接

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      select concat(empno,ename) noname from emp;
      noname
      7369SMITH
      7499ALLEN
      7521WARD
      7566JONES
      7654MARTIN
      7698BLAKE
      7782CLARK
      7788SCOTT
      7839KING
      7844TURNER
      7876ADAMS
      7900JAMES
      7902FORD
      7934MILLER
- substr()  --字段抽取

     
1
2
3
select substr(hiredate,1,4) from emp;
select substr(hiredate,6,2) from emp;
第一位数表示:从第几位开始截取,第二位数字代表:步长
- day() --获取日期 > select day(hiredate) from emp; - month() --获取月份 - hour() --获取小时 - 时间戳:从1970年1月1号- > unix_timestamp 北京时间转化成时间戳 `select unix_timestamp("2017-03-30 08:00:00") 1490832000` > from_unixtime 时间戳转化为北京时间 `select from_unixtime(1490832000) 2017-03-30 08:00:00` - 应用场景
1
2
3
4
5
6
浏览网页的停留时间
开始浏览 2017/10/16 08:00:00
离开时间 2017/10/16 08:10:00
unix_timestamp(2017/10/16 08:10:00) - unix_timestamp(2017/10/16 08:00:00)
得到停留的时间戳
1490832000/1000/3600
- cast() - `select cast(1490854394123/1000 as int) ==>1490854394` - case when ...then ...else ...end
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
eg:根据emp表计算每个员工一个月的薪资 sal+ com
select empno,sal+comm from emp;
7369 NULL
7499 1900.0
7521 1750.0
7566 NULL
7654 2650.0
7698 NULL
7782 NULL
7788 NULL
7839 NULL
7844 1500.0
7876 NULL
7900 NULL
7902 NULL
7934 NULL
select empno,case when comm is null then sal+0 else sal+comm end from emp;
empno _c1
7369 800.0
7499 1900.0
7521 1750.0
7566 2975.0
7654 2650.0
7698 2850.0
7782 2450.0
7788 3000.0
7839 5000.0
7844 1500.0
7876 1100.0
7900 950.0
7902 3000.0
7934 1300.0
eg:根据emp表,对于sal小于500为down,小于800且大于500定义为middle,大于800的定义为up
select empno case when sal<500 then 'down' when sal>500 and sal<800 then 'middle' else 'up' end from emp;
- 【join】
- join:两个表进行连接,把两张表的一行数据整合成一行数据
- 等值(inner)join :(inner) join ..on ...(两张表中存在相同的字段)
- 栗子:`select e.empno,d.dname from emp e (inner) join dept d on e.depyno=d.deptno;`
- 左连接left join:已左边的表为主表
- 栗子:
```sql
select e.empno,e.ename,d.deptno,d.dname from emp e left join dept d on e.deptno=d.deptno;
e.empno e.ename d.deptno d.dname
7369 SMITH 20 RESEARCH
7499 ALLEN 30 SALES
7521 WARD 30 SALES
7566 JONES 20 RESEARCH
7654 MARTIN 30 SALES
7698 BLAKE 30 SALES
7782 CLARK 10 ACCOUNTING
7788 SCOTT 20 RESEARCH
7839 KING 10 ACCOUNTING
7844 TURNER 30 SALES
7876 ADAMS 20 RESEARCH
7900 JAMES 30 SALES
7902 FORD 20 RESEARCH
7934 MILLER 10 ACCOUNTING
- 右连接right join:以右边的表为主表 - 栗子:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
select e.empno,e.ename,d.deptno,d.dname from emp e right join dept d on e.deptno=d.deptno;
e.empno e.ename d.deptno d.dname
7782 CLARK 10 ACCOUNTING
7839 KING 10 ACCOUNTING
7934 MILLER 10 ACCOUNTING
7369 SMITH 20 RESEARCH
7566 JONES 20 RESEARCH
7788 SCOTT 20 RESEARCH
7876 ADAMS 20 RESEARCH
7902 FORD 20 RESEARCH
7499 ALLEN 30 SALES
7521 WARD 30 SALES
7654 MARTIN 30 SALES
7698 BLAKE 30 SALES
7844 TURNER 30 SALES
7900 JAMES 30 SALES
NULL NULL 40 OPERATIONS
- 全连接full join
1
select e.empno,e.ename,d.deptno,d.dname from emp e full join dept d on e.deptno=d.deptno;
  • map join –小表join 大表 a,b

    • select /* + mapjoin(b) */ a.key,a.value from a join b on a.key=b.key
  • Hive中的order by,sort by ,distribute by ,cluster by 的讲解
    升序(asc) 降序(desc)

    1
    2
    3
    4
    5
    6
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=<number>
    1. order by:对全局数据的排序,只存在一个reduce

      1
      select * from emp order by empno desc; 不写desc (默认就是升序)
    2. sort by 对每一个reduce内部进行排序,全局结果来说并没有

      1
      2
      set mapreduce.job.reduces =3;
      insert overwrite local directory '/opt/hivedata/sortby' select * from emp sort by empno;
    3. distribute by 分布式排序,类似于mapreduce中的分区功能,对数据进行分区,结合sort by

      1
      2
      3
      4
      5
      insert overwrite local directory '/opt/hivedata/distributeby' select * from emp distribute by deptno sort by empno;
      ```
      4. cluster by 特殊情况 当distribute by 和sort by 修饰同一个字段的时候可以直接使用
      ```sql
      insert overwrite local directory '/opt/hivedata/clusterby' select * from emp cluster by empno;
  • hive的窗口和分析函数

    • 对于分组之后的数据进行处理
    1. 查询部门编号10的所有员工,按照薪资进行降序排列

      1
      select * from emp where deptno=10 order by sal desc;
    2. 将每个部门薪资最高的那个人显示在最后一列 max() over()

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      select empno,ename,deptno,sal,max(sal) over(partition by deptno order by sal desc) as max_sal from emp;
      empno ename deptno sal max_sal
      7698 BLAKE 30 2850.0 2850.0
      7499 ALLEN 30 1600.0 2850.0
      7844 TURNER 30 1500.0 2850.0
      7521 WARD 30 1250.0 2850.0
      7654 MARTIN 30 1250.0 2850.0
      7900 JAMES 30 950.0 2850.0
      7839 KING 10 5000.0 5000.0
      7782 CLARK 10 2450.0 5000.0
      7934 MILLER 10 1300.0 5000.0
      7788 SCOTT 20 3000.0 3000.0
      7902 FORD 20 3000.0 3000.0
      7566 JONES 20 2975.0 3000.0
      7876 ADAMS 20 1100.0 3000.0
      7369 SMITH 20 800.0 3000.0
    3. 将每个部门最后一列显示一个唯一的ID编号

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      select empno,ename,deptno,sal,row_number() over(partition by deptno order by sal desc) as rn from emp;
      empno ename deptno sal rn
      7698 BLAKE 30 2850.0 1
      7499 ALLEN 30 1600.0 2
      7844 TURNER 30 1500.0 3
      7521 WARD 30 1250.0 4
      7654 MARTIN 30 1250.0 5
      7900 JAMES 30 950.0 6
      7839 KING 10 5000.0 1
      7782 CLARK 10 2450.0 2
      7934 MILLER 10 1300.0 3
      7788 SCOTT 20 3000.0 1
      7902 FORD 20 3000.0 2
      7566 JONES 20 2975.0 3
      7876 ADAMS 20 1100.0 4
      7369 SMITH 20 800.0 5
    4. 获取每个部门薪资最高的前两位

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      select empno,ename,deptno,sal from(select empno,ename,deptno,sal,row_number() over(partition by deptno order by sal desc) as rn from emp) tmp where rn < 3
      empno ename deptno sal
      7698 BLAKE 30 2850.0
      7499 ALLEN 30 1600.0
      7839 KING 10 5000.0
      7782 CLARK 10 2450.0
      7788 SCOTT 20 3000.0
      7902 FORD 20 3000.0
  • 自定义函数

    • UDF函数(User-Defined-Function) 一进一出;
      ==》转换大小写

    • UDAF函数(User-Defined Aggregation Function) 多进一出;
      max(),min(),avg()…..

    • UDTF(User-Defined Table-Generation Function) 一进多出;
      解析IP地址

      1
      2
      3
      4
      5
      6
      7
      8
      9
      UDF函数的编写规范:
      1、自定义函数编写的时候必须要继承一个类,UDF
      2、必须实现一个或者多个evalucate方法
      3、该方法必须有返回值
      4、可以返回null
      5、建议使用hadoop的数据类型,Text
      编写自定义函数的准备工作
      安装maven,导入hive的jar包,替换仓库。
坚持原创技术分享,您的支持将鼓励我继续创作!