Apache Accumulo™

Title: Apache Accumulo Shard Example Notice: Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at . http://www.apache.org/licenses/LICENSE-2.0 . Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Accumulo has an iterator called the intersecting iterator which supports querying a term index that is partitioned by document, or “sharded”. This example shows how to use the intersecting iterator through these four programs:

  • Index.java - Indexes a set of text files into an Accumulo table
  • Query.java - Finds documents containing a given set of terms.
  • Reverse.java - Reads the index table and writes a map of documents to terms into another table.
  • ContinuousQuery.java Uses the table populated by Reverse.java to select N random terms per document. Then it continuously and randomly queries those terms.

To run these example programs, create two tables like below.

username@instance> createtable shard
username@instance shard> createtable doc2term

After creating the tables, index some files. The following command indexes all of the java files in the Accumulo source code.

$ cd /local/username/workspace/accumulo/
$ find core/src server/src -name "*.java" | xargs ./bin/accumulo org.apache.accumulo.examples.simple.shard.Index -i instance -z zookeepers -t shard -u username -p password --partitions 30

The following command queries the index to find all files containing ‘foo’ and ‘bar’.

$ ./bin/accumulo org.apache.accumulo.examples.simple.shard.Query -i instance -z zookeepers -t shard -u username -p password foo bar

In order to run ContinuousQuery, we need to run Reverse.java to populate doc2term.

$ ./bin/accumulo org.apache.accumulo.examples.simple.shard.Reverse -i instance -z zookeepers --shardTable shard --doc2Term doc2term -u username -p password

Below ContinuousQuery is run using 5 terms. So it selects 5 random terms from each document, then it continually randomly selects one set of 5 terms and queries. It prints the number of matching documents and the time in seconds.

$ ./bin/accumulo org.apache.accumulo.examples.simple.shard.ContinuousQuery -i instance -z zookeepers --shardTable shard --doc2Term doc2term -u username -p password --terms 5
[public, core, class, binarycomparable, b] 2  0.081
[wordtodelete, unindexdocument, doctablename, putdelete, insert] 1  0.041
[import, columnvisibilityinterpreterfactory, illegalstateexception, cv, columnvisibility] 1  0.049
[getpackage, testversion, util, version, 55] 1  0.048
[for, static, println, public, the] 55  0.211
[sleeptime, wrappingiterator, options, long, utilwaitthread] 1  0.057
[string, public, long, 0, wait] 12  0.132