Apache Accumulo 1.7.0

18 May 2015

Apache Accumulo 1.7.0 is a significant release that includes many important milestone features which expand the functionality of Accumulo. These include features related to security, availability, and extensibility. Nearly 700 JIRA issues were resolved in this version. Approximately two-thirds were bugs and one-third were improvements.

Below are resources for this release:

In the context of Accumulo’s Semantic Versioning guidelines, this is a “minor version”. This means that new APIs have been created, some deprecations may have been added, but no deprecated APIs have been removed. Code written against 1.6.x should work against 1.7.0, likely binary-compatible but definitely source-compatible. As always, the Accumulo developers take API compatibility very seriously and have invested much time to ensure that we meet the promises set forth to our users.

Major Changes

Updated Minimum Requirements

Apache Accumulo 1.7.0 comes with an updated set of minimum requirements.

  • Java7 is required. Java6 support is dropped.
  • Hadoop 2.2.0 or greater is required. Hadoop 1.x support is dropped.
  • ZooKeeper 3.4.x or greater is required.

Client Authentication with Kerberos

Kerberos is the de-facto means to provide strong authentication across Hadoop and other related components. Kerberos requires a centralized key distribution center to authentication users who have credentials provided by an administrator. When Hadoop is configured for use with Kerberos, all users must provide Kerberos credentials to interact with the filesystem, launch YARN jobs, or even view certain web pages.

While Accumulo has long supported operating on Kerberos-enabled HDFS, it still required Accumulo users to use password-based authentication to authenticate with Accumulo. ACCUMULO-2815 added support for allowing Accumulo clients to use the same Kerberos credentials to authenticate to Accumulo that they would use to authenticate to other Hadoop components, instead of a separate user name and password just for Accumulo.

This authentication leverages Simple Authentication and Security Layer (SASL) and GSSAPI to support Kerberos authentication over the existing Apache Thrift-based RPC infrastructure that Accumulo employs.

These additions represent a significant forward step for Accumulo, bringing its client-authentication up to speed with the rest of the Hadoop ecosystem. This results in a much more cohesive authentication story for Accumulo that resonates with the battle-tested cell-level security and authorization model already familiar to Accumulo users.

More information on configuration, administration, and application of Kerberos client authentication can be found in the Kerberos chapter of the Accumulo User Manual.

Data-Center Replication

In previous releases, Accumulo only operated within the constraints of a single installation. Because single instances of Accumulo often consist of many nodes and Accumulo’s design scales (near) linearly across many nodes, it is typical that one Accumulo is run per physical installation or data-center. ACCUMULO-378 introduces support in Accumulo to automatically copy data from one Accumulo instance to another.

This data-center replication feature is primarily applicable to users wishing to implement a disaster recovery strategy. Data can be automatically copied from a primary instance to one or more other Accumulo instances. In contrast to normal Accumulo operation, in which ingest and query are strongly consistent, data-center replication is a lazy, eventually consistent operation. This is desirable for replication, as it prevents additional latency for ingest operations on the primary instance. Additionally, the implementation of this feature can sustain prolonged outages between the primary instance and replicas without any administrative overhead.

The Accumulo User Manual contains a new chapter on replication which details the design and implementation of the feature, explains how users can configure replication, and describes special cases to consider when choosing to integrate the feature into a user application.

User-Initiated Compaction Strategies

Per-table compaction strategies were added in 1.6.0 to provide custom logic to decide which files are involved in a major compaction. In 1.7.0, the ability to specify a compaction strategy for a user-initiated compaction was added in ACCUMULO-1798. This allows surgical compactions on a subset of tablet files. Previously, a user-initiated compaction would compact all files in a tablet.

In the Java API, this new feature can be accessed in the following way:

Connection conn = ...
CompactionStrategyConfig csConfig = new CompactionStrategyConfig(strategyClassName).setOptions(strategyOpts);
CompactionConfig compactionConfig = new CompactionConfig().setCompactionStrategy(csConfig);
connector.tableOperations().compact(tableName, compactionConfig)

In ACCUMULO-3134, the shell’s compact command was modified to enable selecting which files to compact based on size, name, and path. Options were also added to the shell’s compaction command to allow setting RFile options for the compaction output. Setting the output options could be useful for testing. For example, one tablet to be compacted using snappy compression.

The following is an example shell command that compacts all files less than 10MB, if the tablet has at least two files that meet this criteria. If a tablet had a 100MB, 50MB, 7MB, and 5MB file then the 7MB and 5MB files would be compacted. If a tablet had a 100MB and 5MB file, then nothing would be done because there are not at least two files meeting the selection criteria.

compact -t foo --min-files 2 --sf-lt-esize 10M

The following is an example shell command that compacts all bulk imported files in a table.

compact -t foo --sf-ename I.*

These provided convenience options to select files execute using a specialized compaction strategy. Options were also added to the shell to specify an arbitrary compaction strategy. The option to specify an arbitrary compaction strategy is mutually exclusive with the file selection and file creation options, since those options are unique to the specialized compaction strategy provided. See compact --help in the shell for the available options.

API Clarification

The declared API in 1.6.x was incomplete. Some important classes like ColumnVisibility were not declared as Accumulo API. Significant work was done under ACCUMULO-3657 to correct the API statement and clean up the API to be representative of all classes which users are intended to interact with. The expanded and simplified API statement is in the README.

In some places in the API, non-API types were used. Ideally, public API members would only use public API types. A tool called APILyzer was created to find all API members that used non-API types. Many of the violations found by this tool were deprecated to clearly communicate that a non-API type was used. One example is a public API method that returned a class called KeyExtent. KeyExtent was never intended to be in the public API because it contains code related to Accumulo internals. KeyExtent and the API methods returning it have since been deprecated. These were replaced with a new class for identifying tablets that does not expose internals. Deprecating a type like this from the API makes the API more stable while also making it easier for contributors to change Accumulo internals without impacting the API.

The changes in ACCUMULO-3657 also included an Accumulo API regular expression for use with checkstyle. Starting with 1.7.0, projects building on Accumulo can use this checkstyle rule to ensure they are only using Accumulo’s public API. The regular expression can be found in the README.

Performance Improvements

Configurable Threadpool Size for Assignments

During start-up, the Master quickly assigns tablets to Tablet Servers. However, Tablet Servers load those assigned tablets one at a time. In 1.7, the servers will be more aggressive, and will load tablets in parallel, so long as they do not have mutations that need to be recovered.

ACCUMULO-1085 allows the size of the threadpool used in the Tablet Servers for assignment processing to be configurable.

Group-Commit Threshold as a Factor of Data Size

When ingesting data into Accumulo, the majority of time is spent in the write-ahead log. As such, this is a common place that optimizations are added. One optimization is known as “group-commit”. When multiple clients are writing data to the same Accumulo tablet, it is not efficient for each of them to synchronize the WAL, flush their updates to disk for durability, and then release the lock. The idea of group-commit is that multiple writers can queue the write for their mutations to the WAL and then wait for a sync that will satisfy the durability constraints of their batch of updates. This has a drastic improvement on performance, since many threads writing batches concurrently can “share” the same fsync.

In previous versions, Accumulo controlled the frequency in which this group-commit sync was performed as a factor of the number of clients writing to Accumulo. This was both confusing to correctly configure and also encouraged sub-par performance with few write threads. ACCUMULO-1950 introduced a new configuration property tserver.total.mutation.queue.max which defines the amount of data that is queued before a group-commit is performed in such a way that is agnostic of the number of writers. This new configuration property is much easier to reason about than the previous (now deprecated) tserver.mutation.queue.max. Users who have set tserver.mutation.queue.max in the past are encouraged to start using the new tserver.total.mutation.queue.max property.

Other improvements

Balancing Groups of Tablets

By default, Accumulo evenly spreads each table’s tablets across a cluster. In some situations, it is advantageous for query or ingest to evenly spreads groups of tablets within a table. For ACCUMULO-3439, a new balancer was added to evenly spread groups of tablets to optimize performance. This blog post provides more details about when and why users may desire to leverage this feature..

User-specified Durability

Accumulo constantly tries to balance durability with performance. Guaranteeing durability of every write to Accumulo is very difficult in a massively-concurrent environment that requires high throughput. One common area of focus is the write-ahead log, since it must eventually call fsync on the local filesystem to guarantee that data written is durable in the face of unexpected power failures. In some cases where durability can be sacrificed, either due to the nature of the data itself or redundant power supplies, ingest performance improvements can be attained.

Prior to 1.7, a user could only configure the level of durability for individual tables. With the implementation of ACCUMULO-1957, the durability can be specified by the user when creating a BatchWriter, giving users control over durability at the level of the individual writes. Every Mutation written using that BatchWriter will be written with the provided durability. This can result in substantially faster ingest rates when the durability can be relaxed.

waitForBalance API

When creating a new Accumulo table, the next step is typically adding splits to that table before starting ingest. This can be extremely important since a table without any splits will only be hosted on a single tablet server and create a ingest bottleneck until the table begins to naturally split. Adding many splits before ingesting will ensure that a table is distributed across many servers and result in high throughput when ingest first starts.

Adding splits to a table has long been a synchronous operation, but the assignment of those splits was asynchronous. A large number of splits could be processed, but it was not guaranteed that they would be evenly distributed resulting in the same problem as having an insufficient number of splits. ACCUMULO-2998 adds a new method to InstanceOperations which allows users to wait for all tablets to be balanced. This method lets users wait until tablets are appropriately distributed so that ingest can be run at full-bore immediately.

Hadoop Metrics2 Support

Accumulo has long had its own metrics system implemented using Java MBeans. This enabled metrics to be reported by Accumulo services, but consumption by other systems often required use of an additional tool like jmxtrans to read the metrics from the MBeans and send them to some other system.

ACCUMULO-1817 replaces this custom metrics system Accumulo with Hadoop Metrics2. Metrics2 has a number of benefits, the most common of which is invalidating the need for an additional process to send metrics to common metrics storage and visualization tools. With Metrics2 support, Accumulo can send its metrics to common tools like Ganglia and Graphite.

For more information on enabling Hadoop Metrics2, see the Metrics Chapter in the Accumulo User Manual.

Distributed Tracing with HTrace

HTrace has recently started gaining traction as a standalone project, especially with its adoption in HDFS. Accumulo has long had distributed tracing support via its own “Cloudtrace” library, but this wasn’t intended for use outside of Accumulo.

ACCUMULO-898 replaces Accumulo’s Cloudtrace code with HTrace. This has the benefit of adding timings (spans) from HDFS into Accumulo spans automatically.

Users who inspect traces via the Accumulo Monitor (or another system) will begin to see timings from HDFS during operations like Major and Minor compactions when running with at least Apache Hadoop 2.6.0.

VERSIONS file present in binary distribution

In the pre-built binary distribution or distributions built by users from the official source release, users will now see a VERSIONS file present in the lib/ directory alongside the Accumulo server-side jars. Because the created tarball strips off versions from the jar file names, it can require extra work to actually find what the version of each dependent jar (typically inspecting the jar’s manifest).

ACCUMULO-2863 adds a VERSIONS file to the lib/ directory which contains the Maven groupId, artifactId, and verison (GAV) information for each jar file included in the distribution.

Per-Table Volume Chooser

The VolumeChooser interface is a server-side extension point that allows user tables to provide custom logic in choosing where its files are written when multiple HDFS instances are available. By default, a randomized volume chooser implementation is used to evenly balance files across all HDFS instances.

Previously, this VolumeChooser logic was instance-wide which meant that it would affect all tables. This is potentially undesirable as it might unintentionally impact other users in a multi-tenant system. ACCUMULO-3177 introduces a new per-table property which supports configuration of a VolumeChooser. This ensures that the implementation to choose how HDFS utilization happens when multiple are available is limited to the expected subset of all tables.

Table and namespace custom properties

In order to avoid errors caused by mis-typed configuration properties, Accumulo was strict about which configuration properties could be set. However, this prevented users from setting arbitrary properties that could be used by custom balancers, compaction strategies, volume choosers, and iterators. Under ACCUMULO-2841, the ability to set arbitrary table and namespace properties was added. The properties need to be prefixed with table.custom.. The changes made in ACCUMULO-3177 and ACCUMULO-3439 leverage this new feature.

Notable Bug Fixes

SourceSwitchingIterator Deadlock

An instance of SourceSwitchingIterator, the Accumulo iterator which transparently manages whether data for a tablet read from memory (the in-memory map) or disk (HDFS after a minor compaction), was found deadlocked in a production system.

This deadlock prevented the scan and the minor compaction from ever successfully completing without restarting the tablet server. ACCUMULO-3745 fixes the inconsistent synchronization inside of the SourceSwitchingIterator to prevent this deadlock from happening in the future.

The only mitigation of this bug was to restart the tablet server that is deadlocked.

Table flush blocked indefinitely

While running the Accumulo RandomWalk distributed test, it was observed that all activity in Accumulo had stopped and there was an offline Accumulo metadata table tablet. The system first tried to flush a user tablet, but the metadata table was not online (likely due to the agitation process which stops and starts Accumulo processes during the test). After this call, a call to load the metadata tablet was queued but could not complete until the previous flush call. Thus, a deadlock occurred.

This deadlock happened because the synchronous flush call could not complete before the load tablet call completed, but the load tablet call couldn’t run because of connection caching we perform in Accumulo’s RPC layer to reduce the quantity of sockets we need to create to send data. ACCUMULO-3597 prevents this deadlock by forcing the use of a non-cached connection for the RPC message requesting a metadata tablet to be loaded.

While this feature does result in additional network resources to be used, the concern is minimal because the number of metadata tablets is typically very small with respect to the total number of tablets in the system.

The only mitigation of this bug was to restart the tablet server that is hung.

Testing

Each unit and functional test only runs on a single node, while the RandomWalk and Continuous Ingest tests run on any number of nodes. Agitation refers to randomly restarting Accumulo processes and Hadoop DataNode processes, and, in HDFS High-Availability instances, forcing NameNode fail-over.

During testing, multiple Accumulo developers noticed some stability issues with HDFS using Apache Hadoop 2.6.0 when restarting Accumulo processes and HDFS datanodes. The developers investigated these issues as a part of the normal release testing procedures, but were unable to find a definitive cause of these failures. Users are encouraged to follow ACCUMULO-2388 if they wish to follow any future developments. One possible workaround is to increase the general.rpc.timeout in the Accumulo configuration from 120s to 240s.

OS Hadoop Nodes ZooKeeper HDFS HA Tests
Gentoo N/A 1 N/A No Unit and Integration Tests
Gentoo 2.6.0 1 (2 TServers) 3.4.5 No 24hr CI w/ agitation and verification, 24hr RW w/o agitation.
Centos 6.6 2.6.0 3 3.4.6 No 24hr RW w/ agitation, 24hr CI w/o agitation, 72hr CI w/ and w/o agitation
Amazon Linux 2.6.0 20 m1large 3.4.6 No 24hr CI w/o agitation

View all releases in the archive