Using Azure Data Lake Gen2 storage as a data store for Accumulo
Author: Karthick Narendran
Date: 15 Oct 2019
Accumulo can store its files in Azure Data Lake Storage Gen2 using the ABFS (Azure Blob File System) driver. Similar to S3 blog, the write ahead logs & Accumulo metadata can be stored in HDFS and everything else on Gen2 storage using the volume chooser feature introduced in Accumulo 2.0. The configurations referred on this blog are specific to Accumulo 2.0 and Hadoop 3.2.0.
Hadoop setup
For ABFS client to talk to Gen2 storage, it requires one of the Authentication mechanism listed here This post covers Azure Managed Identity formerly known as Managed Service Identity or MSI. This feature provides Azure services with an automatically managed identity in Azure AD and it avoids the need for credentials or other sensitive information from being stored in code or configs/JCEKS. Plus, it comes free with Azure AD.
At least the following should be added to Hadoop’s core-site.xml
on each node.
<property>
<name>fs.azure.account.auth.type</name>
<value>OAuth</value>
</property>
<property>
<name>fs.azure.account.oauth.provider.type</name>
<value>org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider</value>
</property>
<property>
<name>fs.azure.account.oauth2.msi.tenant</name>
<value>TenantID</value>
</property>
<property>
<name>fs.azure.account.oauth2.client.id</name>
<value>ClientID</value>
</property>
See ABFS doc for more information on Hadoop Azure support.
To get hadoop command to work with ADLS Gen2 set the
following entries in hadoop-env.sh
. As Gen2 storage is TLS enabled by default,
it is important we use the native OpenSSL implementation of TLS.
export HADOOP_OPTIONAL_TOOLS="hadoop-azure"
export HADOOP_OPTS="-Dorg.wildfly.openssl.path=<path/to/OpenSSL/libraries> ${HADOOP_OPTS}"
To verify the location of the OpenSSL libraries, run whereis libssl
command
on the host
Accumulo setup
For each node in the cluster, modify accumulo-env.sh
to add Azure storage jars to the
classpath. Your versions may differ depending on your Hadoop version,
following versions were included with Hadoop 3.2.0.
CLASSPATH="${conf}:${lib}/*:${HADOOP_CONF_DIR}:${ZOOKEEPER_HOME}/*:${HADOOP_HOME}/share/hadoop/client/*"
CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/tools/lib/azure-data-lake-store-sdk-2.2.9.jar"
CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/tools/lib/azure-keyvault-core-1.0.0.jar"
CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-azure-3.2.0.jar"
CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/tools/lib/wildfly-openssl-1.0.4.Final.jar"
CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/jaxb-api-2.2.11.jar"
CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar"
CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/commons-lang3-3.7.jar"
CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/httpclient-4.5.2.jar"
CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar"
CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar"
export CLASSPATH
Include -Dorg.wildfly.openssl.path
to JAVA_OPTS
in accumulo-env.sh
as shown below. This
java property is an optional performance enhancement for TLS.
JAVA_OPTS=("${ACCUMULO_JAVA_OPTS[@]}"
'-XX:+UseConcMarkSweepGC'
'-XX:CMSInitiatingOccupancyFraction=75'
'-XX:+CMSClassUnloadingEnabled'
'-XX:OnOutOfMemoryError=kill -9 %p'
'-XX:-OmitStackTraceInFastThrow'
'-Djava.net.preferIPv4Stack=true'
'-Dorg.wildfly.openssl.path=/usr/lib64'
"-Daccumulo.native.lib.path=${lib}/native")
Set the following in accumulo.properties
and then run accumulo init
, but don’t start Accumulo.
instance.volumes=hdfs://<name node>/accumulo
After running Accumulo init we need to configure storing write ahead logs in
HDFS. Set the following in accumulo.properties
.
instance.volumes=hdfs://<namenode>/accumulo,abfss://<file_system>@<storage_account_name>.dfs.core.windows.net/accumulo
general.volume.chooser=org.apache.accumulo.server.fs.PreferredVolumeChooser
general.custom.volume.preferred.default=abfss://<file_system>@<storage_account_name>.dfs.core.windows.net/accumulo
general.custom.volume.preferred.logger=hdfs://<namenode>/accumulo
Run accumulo init --add-volumes
to initialize the Azure DLS Gen2 volume. Doing this
in two steps avoids putting any Accumulo metadata files in Gen2 during init.
Copy accumulo.properties
to all nodes and start Accumulo.
Individual tables can be configured to store their files in HDFS by setting the
table property table.custom.volume.preferred
. This should be set for the
metadata table in case it splits using the following Accumulo shell command.
config -t accumulo.metadata -s table.custom.volume.preferred=hdfs://<namenode>/accumulo
Accumulo example
The following Accumulo shell session shows an example of writing data to Gen2 and reading it back. It also shows scanning the metadata table to verify the data is stored in Gen2.
root@muchos> createtable gen2test
root@muchos gen2test> insert r1 f1 q1 v1
root@muchos gen2test> insert r1 f1 q2 v2
root@muchos gen2test> flush -w
2019-10-16 08:01:00,564 [shell.Shell] INFO : Flush of table gen2test completed.
root@muchos gen2test> scan
r1 f1:q1 [] v1
r1 f1:q2 [] v2
root@muchos gen2test> scan -t accumulo.metadata -c file
4< file:abfss://<file_system>@<storage_account_name>.dfs.core.windows.net/accumulo/tables/4/default_tablet/F00000gj.rf [] 234,2
These instructions will help to configure Accumulo to use Azure’s Data Lake Gen2 Storage along with HDFS. With this setup, we are able to successfully run the continuos ingest test. Going forward, we’ll experiment more on this space with ADLS Gen2 and add/update blog as we come along.
View all posts in the news archive