Apache Accumulo 2.0.0

02 Aug 2019

Apache Accumulo 2.0.0 contains significant changes from 1.9 and earlier versions. It is the first major release since adopting semver and is the culmination of more than 3 years worth of work by more than 40 contributors from the Accumulo community. The following release notes highlight some of the changes. If anything is missing from this list, please contact the developers to have it included.

Notable Changes

New API for creating connections to Accumulo

A fluent API for creating Accumulo clients was introduced in ACCUMULO-4784 and #634. The Connector and ZooKeeperInstance objects have been deprecated and replaced by AccumuloClient which is created from the Accumulo entry point. The new API also deprecates ClientConfiguration and introduces its own properties file called accumulo-client.properties that ships with the Accumulo tarball. The new API has the following benefits over the old API:

  • All connection information can be specifed in properties file to create the client. This was not possible with old API.
  • The new API does not require ZooKeeperInstance to be created first before creating a client.
  • The new client is closeable and does not rely on shared static resource management
  • Clients can be created using a new Java builder, Properties object, or accumulo-client.properties
  • Clients can now be created with default settings for BatchWriter, Scanner, etc.
  • Create scanners with default authorizations. #744

See the client documentation for more information on how to use the new API.

Hadoop 3 Java 8 & 11.

Accumulo 2.x expects at least Java 8 and Hadoop 3. It is built against Java 8 and Hadoop 3 and the binary tarball is targeted to work with a Java 8 and Hadoop 3 system. See ACCUMULO-4826 , #531 , and ACCUMULO-4299 . Running with Java 11 is also supported, but Java 11 is not required.

Simplified Accumulo scripts and configuration files

Accumulo’s scripts and configuration were refactored in ACCUMULO-4490 to make Accumulo easier to use. The number of scripts in the bin directory of the Accumulo release tarball has been reduced from 20 scripts to the four scripts below:

  • accumulo - mostly left alone except for improved usage
  • accumulo-service - manage Accumulo processes as services
  • accumulo-cluster - manage Accumulo on cluster. Replaces start-all.sh and stop-all.sh
  • accumulo-util - combines many utility scripts into one script.

Read this blog post for more information on this change.

New Bulk Import API

A new bulk import API was added in 2.0 that has very different implementation. This new API supports the following new functionality.

  • Bulk import to an offline table.
  • Load plans that specify where files go in a table which avoids opening the files for inspection.
  • Inspection of file on the client side. Inspection of all files is done before the FATE operation starts. This results in less namenode operations and fail-fast for bad files (no longer need a fail directory).
  • A new improved algorithm to load files into tablets. This new algorithm scans the metadata table and makes asynchronous load calls to all tablets. This queues load operations on all tablets at around the same time. The async RPC calls and beforehand inspection make the bulk load FATE operation much shorter.

The shell command for doing bulk load supports the old and new API. To use the new API from the shell simply omit the failure directory argument. For the API, use the new fluent API. See #436 , #472 , and #570 .

Summaries

Summaries enables continually generating statistics about a table with user defined functions. This feature can inform a user about what is in their table and be used by compaction strategies to make decisions. For example, using this feature it would be possible to compact all tablets where deletes are more than 25% of the data. Another example use case is optimizing filtering compactions by enabling smart selection of files with pertinent data. Examples of filtering compactions are age off and removal of non-compliant data.

Scan Executors

Scan executors support prioritizing and dedicating scan resources. Each executor has a configurable number of threads and an optional custom prioritizer. Tables can be configured in a flexible way to dispatch scans to different executors.

SPI package

All new pluggable components introduced in 2.0 were placed under a new SPI package. The SPI package is analyzed by Apilyzer at build time to ensure plugins only use SPI and API types. This prevents plugins from using internal Accumulo types that are inherently unstable over time. Plugins created before 2.0 do use internal types and are less stable. The new pluggable interfaces should be much more stable.

Official Accumulo docker image was created

An official Accumulo docker images was created in ACCUMULO-4706 to make it easier for users to run Accumulo in Docker. To support running in Docker, a few changes were made to Accumulo:

  • The --upload-accumulo-site option was added to accumulo init to set properties in accumulo-site.xml to Zookeeper during initialization.
  • The -o <key>=<value> option was added to the accumulo command to override configuration that could not be set in Zookeeper.

Updated and improved Accumulo documentation

Accumulo’s documentation has been refactored with the following improvements:

  • Documentation source now lives in accumulo-website repo so changes are now immediately viewable.
  • Improved navigation using a new sidebar
  • Better linking to Javadocs, between documentation pages, and to configuration properties.

Accumulo’s documentation was also reviewed and changes were made to improve accuracy and remove out of date documentation.

Moved Accumulo Examples to its own repo

The Accumulo examples were moved out the accumulo repo to the accumulo-examples repo which has the following benefits:

  • The Accumulo examples are no longer released with Accumulo and can be continuously improved.
  • The Accumulo API version used by the examples can be updated right before Accumulo is released to test for any changes to the API that break semver.

Simplified Accumulo logging configuration

The log4j configuration of Accumulo services was improved in ACCUMULO-4588 with the following changes:

  • Logging is now configured using standard log4j JVM property ‘log4j.configuration’ in accumulo-env.sh.
  • Tarball ships with fewer log4j config files (3 rather than 6) which are all log4j properties files.
  • Log4j XML can still be used by editing accumulo-env.sh
  • Removed auditLog.xml and added audit log configuration to log4j-service properties files
  • Accumulo conf/ directory no longer has an examples/ directory. Configuration files ship in conf/ and are used by default.
  • Accumulo monitor by default will bind to 0.0.0.0 but will advertise hostname looked up in Java for log forwarding
  • Switched to use full hostnames rather than short hostnames for logging

Removed comparison of Value with byte[] in Value.equals()

Replaced the ability to use Value.equals(byte[]) to check if the contents of a Value object was equal to a given byte array in ACCUMULO-4726. To perform that check, you must now use the newly added Value.contentEquals(byte[]) method. This corrects the behavior of the equals method so that it conforms to the API contract documented in the javadoc inherited from its superclass. However, it will break any code that was relying on the undocumented and broken behavior to compare Value objects with byte arrays. Such comparisons will now always return false instead of true, even if the contents are equal.

Other Notable Changes

  • ACCUMULO-3652 - Replaced string concatenation in log statements with slf4j where applicable. Removed tserver TLevel logging class.
  • ACCUMULO-4449 - Removed ‘slave’ terminology and replaced with ‘tserver’ in most cases. The former ‘slaves’ config file is now named ‘tservers’. Added checks to scripts to fail if ‘slaves’ file is present.
  • ACCUMULO-4808 - Can now create table with splits and offline. Specifying splits at table creation time can be much faster than adding splits after creation.
  • ACCUMULO-4463 - Caching is now pluggable.
  • ACCUMULO-4177 - New built in cache implementation based on TinyLFU.
  • ACCUMULO-4376 ACCUMULO-4746 - Mutation and Key Fluent APIs allow easy mixing of types. For example a family of type String and qualifier of type byte[] is much easier to write using this new API.
  • ACCUMULO-4771 - The Accumulo monitor was completely rewritten.
  • ACCUMULO-4732 - Specify iterators and locality groups at table creation time.
  • ACCUMULO-4612 - Use percentages for memory related configuration.
  • ACCUMULO-1787 - Two tier compaction strategy. Support compacting small files with snappy and large files with gzip.
  • #560 - Provide new Crypto interface & impl
  • #536 - Removed mock Accumulo.
  • #438 - Added support for ZStandard compression
  • #404 - Added basic Grafana dashboard example.
  • #1102 #1100 #1037 - Removed lock contention in different areas. These locks caused threads working unrelated task to impede each other.
  • #1033 - Optimized the default compaction strategy. In some cases the Accumulo would rewrite data O(N^2) times over repeated compactions. With this change the amount of rewriting is always logarithmic.
  • Many performance improvements mentioned in the 1.9.X release notes are also available in 2.0.
  • Scanners close server side sessions on close #813 #905

Upgrading

View the Upgrading Accumulo documentation for guidance.

View all releases in the archive