- Hadoop数据仓库实战
- 肖睿 兰伟 廖春琼主编
- 627字
- 2025-04-02 16:30:21
1.3.3 Hive初体验
WordCount词频统计是MapReduce的经典案例,其可对文本文件中某单词出现的次数进行统计。WordCount的官方源码如下。
package org.apache.hadoop.examples; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop .mapreduce.Reducer; import org.apache.hadoop.mapreduc e.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce. lib.output.FileOutputFormat; import org.apache.hadoop.ut il.GenericOptionsParser; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1);//VALUEOUT private Text word = new Text();//KEYOUT public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static cl ass IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable re sult = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(Str ing[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOpt ionsParser(conf, args). getRemainingArgs(); if (otherArgs.length < 2) { System.err.println("Usage: wordcount <in> [<in>...] <out>"); System.exit(2); } Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValue Class(IntWritable.class); for (int i = 0; i < otherArgs.length - 1; ++i) { FileInputFormat.addInputPath(job, new Path(otherArgs[i])); } FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
基于上述源码可知,即便是很简单的MapReduce程序也含有60多行代码。那如果使用Hive来完成WordCount需要多少行代码呢?
(1)准备数据文件mywords.txt。
$ vi mywords.txt # 输入如下单词,行内以空格分隔,行间换行符分隔 Hello World Hello Hive Hello Hadoop # 上传至HDFS $ hdfs dfs -mkdir -p /data/wordcount $ hdfs dfs -put mywords.txt /data/wordcount
(2)进入Hive CLI,编写HQL语句以完成WordCount。
$ hive # 创建表 hive> create external table lines(ling string); # 装载数据 hive> load data inpath '/data/wordcount' overwrite into table lines; # 查询统计 hive> select word,count(*) as wc from lines lateral view explode(split(line,' ')) t1 as word group by word;
(3)执行结果如下。
hive> create external table lines(line string); OK Time taken: 4.742 seconds hive> load data inpath '/data/wordcount' overwrite into table lines; Loading data to table default. lines Table default.lines stats: [numFiles=l, totalSize=36] OK Time taken: 0.514 seconds hive> select word,count(*) as wc from lines lateral view explode(spl it(line,' ')) tl as word group by word; Query ID = root_20181129142424_2948e218-ddd7-4e7c-803e-bcd3d21db21f Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified . Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> Hadoop job information for Stage-1: number of mappers: 14:24:46,642 Stage-1 map = 0%,reduce = 0% 14:24:55,246 Stage-1 map = 100%,reduce = 0%,Cumulative CPU 4.41 sec 14:25:04,715 Stage-1 map = 100%, redu ce = 100%,Cumulative CPU 9.13 sec MapReduce Total cumulative CPU time: 9 seconds 130 msec Ended Job = job_1543310100051_0033 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 9.13 sec HDFS Read: 9637 HDFS Write: 32 SUCCESS Total MapReduce CPU Time Spent: 9 seconds 130 msec OK Hadoop 1 Hello 3 Hive 1 World 1 Time taken: 47.78 seconds, Fetched: 4 row(s) hive>
由上述代码可知,仅用3条HQL语句即可完成WordCount,相比MapReduce的开发效率有了极大的提升。但是,WordCount源码和适用Hive完成的WordCount代码哪个执行效率更高呢?
前面提到,HQL最终会被翻译成MapReduce去执行,针对结果而言两者的本质没有区别,但HQL翻译过程需要额外的开销,所以WordCount源码执行效率更高,但是站在开发效率的角度考虑,损失Hive的这点性能是完全值得的。