Accumulo 2.x Documentation  >>  Development  >>  Iterators

Iterators

Accumulo SortedKeyValueIterators, commonly referred to as Iterators for short, are server-side programming constructs that allow users to implement custom retrieval or computational purpose within Accumulo TabletServers. The name rightly brings forward similarities to the Java Iterator interface; however, Accumulo Iterators are more complex than Java Iterators. Notably, in addition to the expected methods to retrieve the current element and advance to the next element in the iteration, Accumulo Iterators must also support the ability to “move” (seek) to an specified point in the iteration (the Accumulo table). Accumulo Iterators are designed to be concatenated together, similar to applying a series of transformations to a list of elements. Accumulo Iterators can duplicate their underlying source to create multiple “pointers” over the same underlying data (which is extremely powerful since each stream is sorted) or they can merge multiple Iterators into a single view. In this sense, a collection of Iterators operating in tandem is close to a tree-structure than a list, but there is always a sense of a flow of Key-Value pairs through some Iterators. Iterators are not designed to act as triggers nor are they designed to operate outside of the purview of a single table.

Understanding how TabletServers invoke the methods on a SortedKeyValueIterator can be obtuse as the actual code is buried within the implementation of the TabletServer; however, it is generally unnecessary to have a strong understanding of this as the interface provides clear definitions about what each action each method should take. This chapter aims to provide a more detailed description of how Iterators are invoked, some best practices and some common pitfalls.

Instantiation

To invoke an Accumulo Iterator inside of the TabletServer, the Iterator class must be on the classpath of every TabletServer. For production environments, it is common to place a JAR file which contains the Iterator in lib/. In development environments, it is convenient to instead place the JAR file in lib/ext/ as JAR files in this directory are dynamically reloaded by the TabletServers alleviating the need to restart Accumulo while testing an Iterator. Advanced classloader features which enable other types of filesystems and per-table classpath configurations (as opposed to process-wide classpaths). These features are not covered here, but elsewhere in the user manual.

Accumulo references the Iterator class by name and uses Java reflection to instantiate the Iterator. This means that Iterators must have a public no-args constructor.

Interface

A normal implementation of the SortedKeyValueIterator defines functionality for the following methods:

void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException;

boolean hasTop();

void next() throws IOException;

void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException;

Key getTopKey();

Value getTopValue();

SortedKeyValueIterator<Key,Value> deepCopy(IteratorEnvironment env);

init

The init method is called by the TabletServer after it constructs an instance of the Iterator. This method should clear/reset any internal state in the Iterator and prepare it to process data. The first argument, the source, is the Iterator “below” this Iterator (where the client is at “top” and the Iterator for files in HDFS are at the “bottom”). The “source” Iterator provides the Key-Value pairs which this Iterator will operate upon.

The second argument, a Map of options, is made up of options provided by the user, options set in the table’s configuration, and/or options set in the containing namespace’s configuration. These options allow for Iterators to dynamically configure themselves on the fly. If no options are used in the current context (a Scan or Compaction), the Map will be empty. An example of a configuration item for an Iterator could be a pattern used to filter Key-Value pairs in a regular expression Iterator.

The third argument, the IteratorEnvironment, is a special object which provides information to this Iterator about the context in which it was invoked. Commonly, this information is not necessary to inspect. For example, if an Iterator knows that it is running in the context of a full-major compaction (reading all of the data) as opposed to a user scan (which may strongly limit the number of columns), the Iterator might make different algorithmic decisions in an attempt to optimize itself.

seek

The seek method is likely the most confusing method on the Iterator interface. The purpose of this method is to advance the stream of Key-Value pairs to a certain point in the iteration (the Accumulo table). It is common that before the implementation of this method returns some additional processing is performed which may further advance the current position past the startKey of the Range. This, however, is dependent on the functionality the iterator provides. For example, a filtering iterator would consume a number Key-Value pairs which do not meets its criteria before seek returns. The important condition for seek to meet is that this Iterator should be ready to return the first Key-Value pair, or none if no such pair is available, when the method returns. The Key-Value pair would be returned by getTopKey and getTopValue, respectively, and hasTop should return a boolean denoting whether or not there is a Key-Value pair to return.

The arguments passed to seek are as follows:

The TabletServer first provides a Range, an object which defines some collection of Accumulo Keys, which defines the Key-Value pairs that this Iterator should return. Each Range has a startKey and endKey with an inclusive flag for both. While this Range is often similar to the Range(s) set by the client on a Scanner or BatchScanner, it is not guaranteed to be a Range that the client set. Accumulo will split up larger ranges and group them together based on Tablet boundaries per TabletServer. Iterators should not attempt to implement any custom logic based on the Range(s) provided to seek and Iterators should not return any Keys that fall outside of the provided Range.

The second argument, a Collection<ByteSequence>, is the set of column families which should be retained or excluded by this Iterator. The third argument, a boolean, defines whether the collection of column families should be treated as an inclusion collection (true) or an exclusion collection (false).

It is likely that all implementations of seek will first make a call to the seek method on the “source” Iterator that was provided in the init method. The collection of column families and the boolean include argument should be passed down as well as the Range. Somewhat commonly, the Iterator will also implement some sort of additional logic to find or compute the first Key-Value pair in the provided Range. For example, a regular expression Iterator would consume all records which do not match the given pattern before returning from seek.

It is important to retain the original Range passed to this method to know when this Iterator should stop reading more Key-Value pairs. Ignoring this typically does not affect scans from a Scanner, but it will result in duplicate keys emitting from a BatchScanner if the scanned table has more than one tablet. Best practice is to never emit entries outside the seek range.

next

The next method is analogous to the next method on a Java Iterator: this method should advance the Iterator to the next Key-Value pair. For implementations that perform some filtering or complex logic, this may result in more than one Key-Value pair being inspected. This method alters some internal state that is exposed via the hasTop, getTopKey, and getTopValue methods.

The result of this method is commonly caching a Key-Value pair which getTopKey and getTopValue can later return. While there is another Key-Value pair to return, hasTop should return true. If there are no more Key-Value pairs to return from this Iterator since the last call to seek, hasTop should return false.

hasTop

The hasTop method is similar to the hasNext method on a Java Iterator in that it informs the caller if there is a Key-Value pair to be returned. If there is no pair to return, this method should return false. Like a Java Iterator, multiple calls to hasTop (without calling next) should not alter the internal state of the Iterator.

getTopKey and getTopValue

These methods simply return the current Key-Value pair for this iterator. If hasTop returns true, both of these methods should return non-null objects. If hasTop returns false, it is undefined what these methods should return. Like hasTop, multiple calls to these methods should not alter the state of the Iterator.

Users should take caution when either

  1. caching the Key/Value from getTopKey/getTopValue, for use after calling next on the source iterator. In this case, the cached Key/Value object is aliased to the reference returned by the source iterator. Iterators may reuse the same Key/Value object in a next call for performance reasons, changing the data that the cached Key/Value object references and resulting in a logic bug.
  2. modifying the Key/Value from getTopKey/getTopValue. If the source iterator reuses data stored in the Key/Value, then the source iterator may use the modified data that the Key/Value references. This may/may not result in a logic bug.

In both cases, copying the Key/Value’s data into a new object ensures iterator correctness. If neither case applies, it is safe to not copy the Key/Value. The general guideline is to be aware of who else may use Key/Value objects returned from getTopKey/getTopValue.

deepCopy

The deepCopy method is similar to the clone method from the Java Cloneable interface. Implementations of this method should return a new object of the same type as the Accumulo Iterator instance it was called on. Any internal state from the instance deepCopy was called on should be carried over to the returned copy. The returned copy should be ready to have seek called on it. The SortedKeyValueIterator interface guarantees that init will be called on an iterator before deepCopy and that init will not be called on the iterator returned by deepCopy.

Typically, implementations of deepCopy call a copy-constructor which will initialize internal data structures. As with seek, it is common for the IteratorEnvironment argument to be ignored as most Iterator implementations can be written without the explicit information the environment provides.

In the analogy of a series of Iterators representing a tree, deepCopy can be thought of as early programming assignments which implement their own tree data structures. deepCopy calls copy on its sources (the children), copies itself, attaches the copies of the children, and then returns itself.

TabletServer invocation of Iterators

The following code is a general outline for how TabletServers invoke Iterators.

List<KeyValue> batch;
Range range = getRangeFromClient();
while (!overSizeLimit(batch)) {
    SortedKeyValueIterator source = getSystemIterator();

    for (String clzName : getUserIterators()) {
        Class<?> clz = Class.forName(clzName);
        SortedKeyValueIterator iter = (SortedKeyValueIterator) clz.newInstance();
        iter.init(source, opts, env);
        source = iter;
    }

    // read a batch of data to return to client
    // the last iterator, the "top"
    SortedKeyValueIterator topIter = source;
    topIter.seek(getRangeFromUser(), ...)

    while (topIter.hasTop() && !overSizeLimit(batch)) {
        key = topIter.getTopKey()
        val = topIter.getTopValue()
        batch.add(new KeyValue(key, val)
        if (systemDataSourcesChanged()) {
            // code does not show isolation case, which will
            // keep using same data sources until a row boundary is hit
            range = new Range(key, false, range.endKey(), range.endKeyInclusive());
            break;
        }
    }
}
//return batch of key values to client

Additionally, the obtuse “re-seek” case can be outlined as the following:

// Given the above
List<KeyValue> batch = getNextBatch();

// Store off lastKeyReturned for this client
lastKeyReturned = batch.get(batch.size() - 1).getKey();

// thread goes away (client stops asking for the next batch).

// Eventually client comes back
// Setup as before...
Range userRange = getRangeFromUser();
Range actualRange = new Range(lastKeyReturned, false, userRange.getEndKey(), userRange.isEndKeyInclusive());

// Use the actualRange, not the user provided one
topIter.seek(actualRange);

Isolation

Accumulo provides a feature which clients can enable to prevent the viewing of partially applied mutations within the context of rows. If a client is submitting multiple column updates to rows at a time, isolation would ensure that a client would either see all of updates made to that row or none of the updates (until they are all applied).

When using Isolation, there are additional concerns in iterator design. A scan time iterator in accumulo reads from a set of data sources. While an iterator is reading data it has an isolated view. However, after it returns a key/value it is possible that accumulo may switch data sources and re-seek the iterator. This is done so that resources may be reclaimed. When the user does not request isolation this can occur after any key is returned. When a user enables Isolation, this will only occur after a new row is returned, in which case it will re-seek to the very beginning of the next possible row.

Abstract Iterators

A number of Abstract implementations of Iterators are provided to allow for faster creation of common patterns. The most commonly used abstract implementations are the Filter and Combiner classes. When possible these classes should be used instead as they have been thoroughly tested inside Accumulo itself.

Filter

The Filter abstract Iterator provides a very simple implementation which allows implementations to define whether or not a Key-Value pair should be returned via an accept(Key, Value) method.

Filters are extremely simple to implement; however, when the implementation is filtering a large percentage of Key-Value pairs with respect to the total number of pairs examined, it can be very inefficient. For example, if a Filter implementation can determine after examining part of the row that no other pairs in this row will be accepted, there is no mechanism to efficiently skip the remaining Key-Value pairs. Concretely, take a row which is comprised of 1000 Key-Value pairs. After examining the first 10 Key-Value pairs, it is determined that no other Key-Value pairs in this row will be accepted. The Filter must still examine each remaining 990 Key-Value pairs in this row. Another way to express this deficiency is that Filters have no means to leverage the seek method to efficiently skip large portions of Key-Value pairs.

As such, the Filter class functions well for filtering small amounts of data, but is inefficient for filtering large amounts of data. The decision to use a Filter strongly depends on the use case and distribution of data being filtered.

Combiner

The Combiner class is another common abstract Iterator. Similar to the Combiner interface define in Hadoop’s MapReduce framework, implementations of this abstract class reduce multiple Values for different versions of a Key (Keys which only differ by timestamps) into one Key-Value pair. Combiners provide a simple way to implement common operations like summation and aggregation without the need to implement the entire Accumulo Iterator interface.

One important consideration when choosing to design a Combiner is that the “reduction” operation is often best represented when it is associative and commutative. Operations which do not meet these criteria can be implemented; however, the implementation can be difficult.

A second consideration is that a Combiner is not guaranteed to see every Key-Value pair which differ only by timestamp every time it is invoked. For example, if there are 5 Key-Value pairs in a table which only differ by the timestamps 1, 2, 3, 4, and 5, it is not guaranteed that every invocation of the Combiner will see 5 timestamps. One invocation might see the Values for Keys with timestamp 1 and 4, while another invocation might see the Values for Keys with the timestamps 1, 2, 4 and 5.

Finally, when configuring an Accumulo table to use a Combiner, be sure to disable the Versioning Iterator or set the Combiner at a priority less than the Combiner (the Versioning Iterator is added at a priority of 20 by default). The Versioning Iterator will filter out multiple Key-Value pairs that differ only by timestamp and return only the Key-Value pair that has the largest timestamp.

Combiner Applications

Many applications can benefit from the ability to aggregate values across common keys. This can be done via Combiner iterators and is similar to the Reduce step in MapReduce. This provides the ability to define online, incrementally updated analytics without the overhead or latency associated with batch-oriented MapReduce jobs.

All that is needed to aggregate values of a table is to identify the fields over which values will be grouped, insert mutations with those fields as the key, and configure the table with a combining iterator that supports the summarizing operation desired.

The only restriction on an combining iterator is that the combiner developer should not assume that all values for a given key have been seen, since new mutations can be inserted at anytime. This precludes using the total number of values in the aggregation such as when calculating an average, for example.

An interesting use of combining iterators within an Accumulo table is to store feature vectors for use in machine learning algorithms. For example, many algorithms such as k-means clustering, support vector machines, anomaly detection, etc. use the concept of a feature vector and the calculation of distance metrics to learn a particular model. The columns in an Accumulo table can be used to efficiently store sparse features and their weights to be incrementally updated via the use of an combining iterator.

Best practices

Because of the flexibility that the SortedKeyValueIterator interface provides, it doesn’t directly disallow many implementations which are poor design decisions. The following are some common recommendations to follow and pitfalls to avoid in Iterator implementations.

Avoid special logic encoded in Ranges

Commonly, granular Ranges that a client passes to an Iterator from a Scanner or BatchScanner are unmodified. If a Range falls within the boundaries of a Tablet, an Iterator will often see that same Range in the seek method. However, there is no guarantee that the Range will remain unaltered from client to server. As such, Iterators should never make assumptions about the current state/context based on the Range.

The common failure condition is referred to as a “re-seek”. In the context of a Scan, TabletServers construct the “stack” of Iterators and batch up Key-Value pairs to send back to the client. When a sufficient number of Key-Value pairs are collected, it is common for the Iterators to be “torn down” until the client asks for the next batch of Key-Value pairs. This is done by the TabletServer to add fairness in ensuring one Scan does not monopolize the available resources. When the client asks for the next batch, the implementation modifies the original Range so that servers know the point to resume the iteration (to avoid returning duplicate Key-Value pairs). Specifically, the new Range is created from the original but is shortened by setting the startKey of the original Range to the Key last returned by the Scan, non-inclusive.

seeking backwards

The ability for an Iterator to “skip over” large blocks of Key-Value pairs is a major tenet behind Iterators. By seek‘ing when it is known that there is a collection of Key-Value pairs which can be ignored can greatly increase the speed of a scan as many Key-Value pairs do not have to be deserialized and processed.

While the seek method provides the Range that should be used to seek the underlying source Iterator, there is no guarantee that the implementing Iterator uses that Range to perform the seek on its “source” Iterator. As such, it is possible to seek to any Range and the interface has no assertions to prevent this from happening.

Since Iterators are allowed to seek to arbitrary Keys, it also allows Iterators to create infinite loops inside Scans that will repeatedly read the same data without end. If an arbitrary Range is constructed, it should construct a completely new Range as it allows for bugs to be introduced which will break Accumulo.

Thus, seek’s should always be thought of as making “forward progress” in the view of the total iteration. The startKey of a Range should always be greater than the current Key seen by the Iterator while the endKey of the Range should always retain the original endKey (and endKey inclusivity) of the last Range seen by your Iterator’s implementation of seek.

Take caution in constructing new data in an Iterator

Implementations of Iterator might be tempted to open BatchWriters inside of an Iterator as a means to implement triggers for writing additional data outside of their client application. The lifecycle of an Iterator is not managed in such a way that guarantees that this is safe nor efficient. Specifically, there is no way to guarantee that the internal ThreadPool inside of the BatchWriter is closed (and the thread(s) are reaped) without calling the close() method. close‘ing and recreating a BatchWriter after every Key-Value pair is also prohibitively performance limiting to be considered an option.

The only safe way to generate additional data in an Iterator is to alter the current Key-Value pair. For example, the WholeRowIterator serializes the all of the Key-Values pairs that fall within each row. A safe way to generate more data in an Iterator would be to construct an Iterator that is “higher” (at a larger priority) than the WholeRowIterator, that is, the Iterator receives the Key-Value pairs which are a serialization of many Key-Value pairs. The custom Iterator could deserialize the pairs, compute some function, and add a new Key-Value pair to the original collection, re-serializing the collection of Key-Value pairs back into a single Key-Value pair.

Any other situation is likely not guaranteed to ensure that the caller (a Scan or a Compaction) will always see all intended data that is generated.

Final things to remember

Some simple recommendations/points to keep in mind:

Method call order

On an instance of an Iterator: init is always called before seek, seek is always called before hasTop, getTopKey and getTopValue will not be called if hasTop returns false.

Teardown

As mentioned, instance of Iterators may be torn down inside of the server transparently. When a complex collection of iterators is performing some advanced functionality, they will not be torn down until a Key-Value pair is returned out of the “stack” of Iterators (and added into the batch of Key-Values to be returned to the caller). Being torn-down is equivalent to a new instance of the Iterator being creating and deepCopy being called on the new instance with the old instance provided as the argument to deepCopy. References to the old instance are removed and the object is lazily garbage collected by the JVM.

Compaction-time Iterators

When Iterators are configured to run during compactions, at the minc or majc scope, these Iterators sometimes need to make different assertions than those who only operate at scan time. Iterators won’t see the delete entries; however, Iterators will not necessarily see all of the Key-Value pairs in ever invocation. Because compactions often do not rewrite all files (only a subset of them), it is possible that the logic take this into consideration.

For example, a Combiner that runs over data at during compactions, might not see all of the values for a given Key. The Combiner must recognize this and not perform any function that would be incorrect due to the missing values.

Testing

The Iterator test harness is generalized testing framework for Accumulo Iterators that can identify common pitfalls in user-created Iterators.

Find documentation for all releases in the archive