Apache Accumulo 1.8.0

06 Sep 2016

Apache Accumulo 1.8.0 is a significant release that includes many important milestone features which expand the functionality of Accumulo. These include features related to security, availability, and extensibility. Over 350 JIRA issues were resolved in this version. This includes over 200 bug fixes and 71 improvements and 4 new features. See JIRA for the complete list.

Below are resources for this release:

In the context of Accumulo’s Semantic Versioning guidelines, this is a “minor version”. This means that new APIs have been created, some deprecations may have been added, but no deprecated APIs have been removed. Code written against 1.7.x should work against 1.8.0 – binary compatibility has been preserved with one exception of an already-deprecated Mock Accumulo utility class. As always, the Accumulo developers take API compatibility very seriously and have invested much time to ensure that we meet the promises set forth to our users.

Major Changes

Speed up WAL roll overs

Performance of writing mutations is improved by refactoring the bookeeping required for Write-Ahead Log (WAL) files and by creating a standby WAL for faster switching when the log is full. This was a substantial refactor in the way WALs worked, but smoothes overall ingest performance in addition to provides a increase in write speed as shown by the simple test below. The top entry is before ACCUMULO-3423 and the bottom graph is after the refactor.

Graph of WAL speed up after ACCUMULO-3423

User level API for RFile

Previously the only public API available to write RFiles was via the AccumuloFileOutputFormat. There was no way to read RFiles in the public API. ACCUMULO-4165 exposes a brand new public API for reading and writing RFiles as well as cleans up some of the internal APIs.

Suspend Tablet assignment for rolling restarts

When a tablet server dies, Accumulo attempted to reassign the tablets as quickly as possible to maintain availability. A new configuration property table.suspend.duration (with a default of zero seconds) now controls how long to wait before reassigning a tablet from a dead tserver. The property is configurable via the Accumulo shell, so you can set it, do a rolling restart, and then set it back to 0. A new state as introduced, TableState.SUSPENDED to support this feature. By default, metadata tablet reassignment is not suspended, but that can also be changed with the master.metadata.suspendable property that is false by default. Root tablet assignment can not be suspended. See ACCUMULO-4353 for more info.

Run multiple Tablet Servers on one node

ACCUMULO-4328 introduces the capability of running multiple tservers on a single node. This is intended for nodes with a large amounts of memory and/or disk. This feature is disabled by default. There are several related tickets: ACCUMULO-4072, ACCUMULO-4331 and ACCUMULO-4406. Note that when this is enabled, the names of the log files change. Previous log file names were defined in the generic_logger.xml as ${org.apache.accumulo.core.application}_{org.apache.accumulo.core.ip.localhost.hostname}.log. The files will now include the instance id after the application with ${org.apache.accumulo.core.application}_${instance}_${org.apache.accumulo.core.ip.localhost.hostname}.log.

For example: tserver_host.domain.com.log will become tserver_1_host.domain.log when multiple TabletServers are run per host. The same change also applies to the debug logs provided in the example configurations. The log names do not change if this feature is not used.

Rate limiting Major Compactions

Major Compactions can significantly increase the amount of load on TabletServers. ACCUMULO-4187 restricts the rate at which data is read and written when performing major compactions. This has a direct effect on the IO load caused by major compactions with a similar effect on the CPU utilization. This behavior is controlled by a new property tserver.compaction.major.throughput with a defaults of 0B which disables the rate limiting.

Table Sampling

Queryable sample data was added by ACCUMULO-3913. This allows users to configure a pluggable function to generate sample data. At scan time, the sample data can optionally be scanned. Iterators also have access to sample data. Iterators can access all data and sample data, this allows an iterator to use sample data for query optimizations. The new user level RFile API supports writing RFiles with sample data for bulk import.

A simple configurable sampler function is included with Accumulo. This sampler uses hashing and can be configured to use a subset of Key fields. For example if it was desired to have entire rows in the sample, then this sampler would be configured to hash+mod the row. Then when a row is selected for the sample, all of its columns and all of its updates will be in the sample data. Another scenario is one in which a document id is in the column qualifier. In this scenario, one would either want all data related to a document in the sample data or none. To achieve this, the sample could be configured to hash+mod on the column qualifier. See the sample Readme example and javadocs on the new APIs for more information.

For sampling to work, all tablets scanned must have pre-generated sample data that was generated in the same way. If this is not the case then scans will fail. For existing tables, samples can be generated by configuring sampling on the table and compacting the table.

Upgrade to Apache Thrift 0.9.3

Accumulo relies on Apache Thrift to implement remote procedure calls between Accumulo services. Ticket ACCUMULO-4077 updates our dependency to 0.9.3. See the Apache Thrift 0.9.3 Release Notes for details on the changes to Thrift. NOTE: The Thrift 0.9.3 Java library is not compatible other versions of Thrift. Applications running against Accumulo 1.8 must use Thrift 0.9.3. Different versions of Thrift on the classpath will not work.

Iterator Test Harness

Users often write a new iterator without fully understanding its limits and lifetime. Previously, Accumulo did not provide any means in which a user could test iterators to catch common issues that only become apparent in multi-node production deployments. ACCUMULO-626 provides a framework and a collection of initial tests which can be used to simulate common issues with Iterators that only appear in production deployments. This test harness can be used directly by users as a supplemental tool to unit tests and integration tests with MiniAccumuloCluster.

Please see the Accumulo User Manual chapter on Iterator Testing for more information

Default port for Monitor changed to 9995

Previously, the default port for the monitor was 50095. You will need to update your links to point to port 9995. The default port for the GC process was also changed from 50091 to 9998, although this an RPC port used internally and automatically discovered. These default ports were changed because the previous defaults fell in the Linux Ephemeral port range. This means that the operating system, when a port in this range was unusued, would allocate this port for dynamic network communication. This has the side-effect of temporal bind issues when trying to start these services (as the operating system might have allocated them elsewhere). By moving these defaults out of the ephemeral range, we can guarantee that the Monitor and GC will reliably start. These values are still configurable by setting monitor.port.clientand gc.port.client in the accumulo-site.xml.

Other Notable Changes

  • ACCUMULO-1055 Configurable maximum file size for merging minor compactions
  • ACCUMULO-1124 Optimization of RFile index
  • ACCUMULO-2883 API to fetch current tablet assignments
  • ACCUMULO-3871 Support for running integration tests in MapReduce
  • ACCUMULO-3920 Deprecate the MockAccumulo class and remove usage in our tests
  • ACCUMULO-4339 Make hadoop-minicluster optional dependency of acccumulo-minicluster
  • ACCUMULO-4318 BatchWriter, ConditionalWriter, and ScannerBase now extend AutoCloseable
  • ACCUMULO-4326 Value constructor now accepts Strings (and Charsequences)
  • ACCUMULO-4354 Bump dependency versions to include gson, jetty, and sl4j
  • ACCUMULO-3735 Bulk Import status page on the monitor
  • ACCUMULO-4066 Reduced time to processes conditional mutations.
  • ACCUMULO-4164 Reduced seek time for cached data.

Testing

Each unit and functional test only runs on a single node, while the RandomWalk and Continuous Ingest tests run on any number of nodes. Agitation refers to randomly restarting Accumulo processes and Hadoop Datanode processes, and, in HDFS High-Availability instances, forcing NameNode failover.

OS/Environment Hadoop Nodes ZooKeeper HDFS HA Tests
CentOS7/openJDK7/EC2; 3 m3.xlarge leaders, 8 d2.xlarge workers 2.6.4 11 3.4.8 No 24 HR Continuous Ingest without Agitation.
CentOS7/openJDK7/EC2; 3 m3.xlarge leaders, 8 d2.xlarge workers 2.6.4 11 3.4.8 No 16 HR Continuous Ingest with Agitation.
CentOS7/openJDK7/OpenStack VMs (16G RAM 2cores 2disk3; 1 leader, 5 workers HDP 2.5 (Hadoop 2.7) 7 HDP 2.5 (ZK 3.4) No 24 HR Continuous Ingest without Agitation.
CentOS7/openJDK7/OpenStack VMs (16G RAM 2cores 2disk3; 1 leader, 5 workers HDP 2.5 (Hadoop 2.7) 7 HDP 2.5 (ZK 3.4) No 24 HR Continuous Ingest with Agitation.

View all releases in the archive