Apache Accumulo 2.0.0

05 Sep 2017

Apache Accumulo 2.0.0 is a significant release. These release notes are still a work in progress. The notes are fairly complete in terms of features added to 2.0, but some features may still be missing. Also, the notes will be updated in the future to point to documentation for new features in addition to issues.

Notable Changes

New API for creating connections to Accumulo

A fluent API for clients to build Connector objects was introduced in ACCUMULO-4784. The new API deprecates ClientConfiguration and introduces its own properties file called accumulo-client.properties that ships with the Accumulo tarball. See the client documentation for more information on how to use the new API.

Hadoop 3 and Java 8.

Accumulo 2.x expects at least Java 8 and Hadoop 3. It is built against Java 8 and Hadoop 3 and the binary tarball is targeted to work with a Java 8 and Hadoop 3 system. See ACCUMULO-4826 , #531 , and ACCUMULO-4299 .

Simplified Accumulo scripts and configuration files

Accumulo’s scripts and configuration were refactored in ACCUMULO-4490 to make Accumulo easier to use. The number of scripts in the bin directory of the Accumulo release tarball has been reduced from 20 scripts to the four scripts below:

  • accumulo - mostly left alone except for improved usage
  • accumulo-service - manage Accumulo processes as services
  • accumulo-cluster - manage Accumulo on cluster. Replaces start-all.sh and stop-all.sh
  • accumulo-util - combines many utility scripts into one script.

Read this blog post for more information on this change.

New Bulk Import API

A new bulk import API was added in 2.0 that has very different implementation. This new API supports the following new functionality.

  • Bulk import to an offline table.
  • Load plans that specify where files go in a table which avoids opening the files for inspection.
  • Inspection of file on the client side. Inspection of all files is done before the FATE operation starts. This results in less namenode operations and fail-fast for bad files (no longer need a fail directory).
  • A new improved algorithm to load files into tablets. This new algorithm scans the metadata table and makes asynchronous load calls to all tablets. This queues load operations on all tablets at around the same time. The async RPC calls and beforehand inspection make the bulk load FATE operation much shorter.

The shell command for doing bulk load supports the old and new API. To use the new API from the shell simply omit the failure directory argument. TODO link to javadoc for new API. See #436 , #472 , and #570 .

Summaries

Summaries enables continually generating statistics about a table with user defined functions. This feature can inform a user about what is in their table and be used by compaction strategies to make decisions. For example, using this feature it would be possible to compact all tablets where deletes are more than 25% of the data.

Scan Executors

Scan executors support prioritizing and dedicating scan resources. Each executor has a configurable number of threads and an optional custom prioritizer. Tables can be configured in a flexible way to dispatch scans to different executors.

Official Accumulo docker image was created

An official Accumulo docker images was created in ACCUMULO-4706 to make it easier for users to run Accumulo in Docker. To support running in Docker, a few changes were made to Accumulo:

  • The --upload-accumulo-site option was added to accumulo init to set properties in accumulo-site.xml to Zookeeper during initialization.
  • The -o <key>=<value> option was added to the accumulo command to override configuration that could not be set in Zookeeper.

Updated and improved Accumulo documentation

Accumulo’s documentation has been refactored with the following improvements:

  • Documentation source now lives in accumulo-website repo so changes are now immediately viewable.
  • Improved navigation using a new sidebar
  • Better linking to Javadocs, between documentation pages, and to configuration properties.

Accumulo’s documentation was also reviewed and changes were made to improve accuracy and remove out of date documentation.

Moved Accumulo Examples to its own repo

The Accumulo examples were moved out the accumulo repo to the accumulo-examples repo which has the following benefits:

  • The Accumulo examples are no longer released with Accumulo and can be continuously improved.
  • The Accumulo API version used by the examples can be updated right before Accumulo is released to test for any changes to the API that break semver.

Simplified Accumulo logging configuration

The log4j configuration of Accumulo services was improved in ACCUMULO-4588 with the following changes:

  • Logging is now configured using standard log4j JVM property ‘log4j.configuration’ in accumulo-env.sh.
  • Tarball ships with fewer log4j config files (3 rather than 6) which are all log4j properties files.
  • Log4j XML can still be used by editing accumulo-env.sh
  • Removed auditLog.xml and added audit log configuration to log4j-service properties files
  • Accumulo conf/ directory no longer has an examples/ directory. Configuration files ship in conf/ and are used by default.
  • Accumulo monitor by default will bind to 0.0.0.0 but will advertise hostname looked up in Java for log forwarding
  • Switched to use full hostnames rather than short hostnames for logging

Removed comparison of Value with byte[] in Value.equals()

Replaced the ability to use Value.equals(byte[]) to check if the contents of a Value object was equal to a given byte array in ACCUMULO-4726. To perform that check, you must now use the newly added Value.contentEquals(byte[]) method. This corrects the behavior of the equals method so that it conforms to the API contract documented in the javadoc inherited from its superclass. However, it will break any code that was relying on the undocumented and broken behavior to compare Value objects with byte arrays. Such comparisons will now always return false instead of true, even if the contents are equal.

Other Notable Changes

  • ACCUMULO-3652 - Replaced string concatenation in log statements with slf4j where applicable. Removed tserver TLevel logging class.
  • ACCUMULO-4733 - Introduced a new configuration option, crypto.security.provider, to allow for a configurable security provider for crypto. If unset, it will default to the system level Provider list.
  • ACCUMULO-4737 - Renamed configuration option crypto.cipher.algorithm.name to crypto.cipher.key.algorithm.name. The option now functions as described in the user manual. Introduced a new configuration option, crypto.wal.cipher.suite, to allow for a separate configuration for WAL files. If unset, this will default to the value for crypto.cipher.suite.
  • ACCUMULO-4449 - Removed ‘slave’ terminology and replaced with ‘tserver’ in most cases. The former ‘slaves’ config file is now named ‘tservers’. Added checks to scripts to fail if ‘slaves’ file is present.
  • ACCUMULO-4808 - Can now create table with splits and offline. Specifying splits at table creation time can be much faster than adding splits after creation.
  • ACCUMULO-4463 - Caching is now pluggable.
  • ACCUMULO-4177 - New built in cache implementation based on TinyLFU.
  • ACCUMULO-4376 ACCUMULO-4746 - Mutation and Key Fluent APIs allow easy mixing of types. For example a family of type String and qualifier of type byte[] is much easier to write using this new API.
  • ACCUMULO-4771 - The Accumulo monitor was completely rewritten.
  • ACCUMULO-4732 - Specify iterators and locality groups at table creation time.
  • ACCUMULO-4612 - Use percentages for memory related configuration.
  • ACCUMULO-1787 - Two tier compaction strategy. Support compacting small files with snappy and large files with gzip.
  • #536 - Removed mock Accumulo.
  • #438 - Added support for ZStandard compression
  • #404 - Added basic Grafana dashboard example. TODO better link?

Upgrading

View the Upgrading Accumulo documentation for guidance.

Testing

View all releases in the archive