Apache Accumulo MapReduce Example

This example uses mapreduce and accumulo to compute word counts for a set of documents. This is accomplished using a map-only mapreduce job and a accumulo table with combiners.

To run this example you will need a directory in HDFS containing text files. The accumulo readme will be used to show how to run this example.

$ hadoop fs -copyFromLocal $ACCUMULO_HOME/README /user/username/wc/Accumulo.README
$ hadoop fs -ls /user/username/wc
Found 1 items
-rw-r--r--   2 username supergroup       9359 2009-07-15 17:54 /user/username/wc/Accumulo.README

The first part of running this example is to create a table with a combiner for the column family count.

$ ./bin/accumulo shell -u username -p password
Shell - Apache Accumulo Interactive Shell
- version: 1.5.0
- instance name: instance
- instance id: 00000000-0000-0000-0000-000000000000
- 
- type 'help' for a list of available commands
- 
username@instance> createtable wordCount
username@instance wordCount> setiter -class org.apache.accumulo.core.iterators.user.SummingCombiner -p 10 -t wordCount -majc -minc -scan
SummingCombiner interprets Values as Longs and adds them together.  A variety of encodings (variable length, fixed length, or string) are available
----------> set SummingCombiner parameter all, set to true to apply Combiner to every column, otherwise leave blank. if true, columns option will be ignored.: false
----------> set SummingCombiner parameter columns, <col fam>[:<col qual>]{,<col fam>[:<col qual>]} escape non-alphanum chars using %<hex>.: count
----------> set SummingCombiner parameter lossy, if true, failed decodes are ignored. Otherwise combiner will error on failed decodes (default false): <TRUE|FALSE>: false 
----------> set SummingCombiner parameter type, <VARLEN|FIXEDLEN|STRING|fullClassName>: STRING
username@instance wordCount> quit

After creating the table, run the word count map reduce job.

$ bin/tool.sh lib/accumulo-examples-simple.jar org.apache.accumulo.examples.simple.mapreduce.WordCount -i instance -z zookeepers  --input /user/username/wc wordCount -u username -p password

11/02/07 18:20:11 INFO input.FileInputFormat: Total input paths to process : 1
11/02/07 18:20:12 INFO mapred.JobClient: Running job: job_201102071740_0003
11/02/07 18:20:13 INFO mapred.JobClient:  map 0% reduce 0%
11/02/07 18:20:20 INFO mapred.JobClient:  map 100% reduce 0%
11/02/07 18:20:22 INFO mapred.JobClient: Job complete: job_201102071740_0003
11/02/07 18:20:22 INFO mapred.JobClient: Counters: 6
11/02/07 18:20:22 INFO mapred.JobClient:   Job Counters 
11/02/07 18:20:22 INFO mapred.JobClient:     Launched map tasks=1
11/02/07 18:20:22 INFO mapred.JobClient:     Data-local map tasks=1
11/02/07 18:20:22 INFO mapred.JobClient:   FileSystemCounters
11/02/07 18:20:22 INFO mapred.JobClient:     HDFS_BYTES_READ=10487
11/02/07 18:20:22 INFO mapred.JobClient:   Map-Reduce Framework
11/02/07 18:20:22 INFO mapred.JobClient:     Map input records=255
11/02/07 18:20:22 INFO mapred.JobClient:     Spilled Records=0
11/02/07 18:20:22 INFO mapred.JobClient:     Map output records=1452

After the map reduce job completes, query the accumulo table to see word counts.

$ ./bin/accumulo shell -u username -p password
username@instance> table wordCount
username@instance wordCount> scan -b the
the count:20080906 []    75
their count:20080906 []    2
them count:20080906 []    1
then count:20080906 []    1
there count:20080906 []    1
these count:20080906 []    3
this count:20080906 []    6
through count:20080906 []    1
time count:20080906 []    3
time. count:20080906 []    1
to count:20080906 []    27
total count:20080906 []    1
tserver, count:20080906 []    1
tserver.compaction.major.concurrent.max count:20080906 []    1
...

Another example to look at is org.apache.accumulo.examples.simple.mapreduce.UniqueColumns. This example computes the unique set of columns in a table and shows how a map reduce job can directly read a tables files from HDFS.