Running Accumulo on Fedora 25
Author: Christopher Tubbs and Mike Miller
Date: 19 Dec 2016
Reviewer(s) Keith Turner, Mike Walch
Apache Accumulo has been available in Fedora since F20. Recently, the Fedora
packages have been updated to Accumulo version 1.6.6
and have made some
improvements to the default configuration and launch scripts to provide a good
out-of-box experience. This post will discuss the basic setup procedures for
running Accumulo in the latest version, Fedora 25
.
Prepare the system
WARNING: Before you start, be sure you’ve got plenty of free disk space. Otherwise, you could run into this bug or see other problems.
These instructions will assume you’re using Fedora 25, fully up-to-date (sudo
dnf --refresh upgrade
).
Install packages
Fedora provides a meta-package to install Accumulo and all of its dependencies.
It’s a good idea to install the JDK, so you’ll have access to the jps
command, and tuned
for setting system performance tuning parameters from a
profile. It’s also a good idea to ensure the optional hadoop native libraries
are installed, and you have a good editor (replace vim
with your preferred
editor):
sudo dnf install accumulo java-1.8.0-openjdk-devel tuned vim hadoop-common-native
It is possible to install only a specific Accumulo service. For the single node
setup, almost everything is needed. For the multi-node setup, it might make
more sense to be selective about which you choose to install on each node (for
example, to only install accumulo-tserver
).
Set up tuned
(Optional) tuned
can optimize your server settings, adjusting things like
your vm.swappiness
. To set up tuned
, do:
sudo systemctl start tuned.service # start service
sudo tuned-adm profile network-latency # pick a good profile
sudo tuned-adm active # verify the selected profile
sudo systemctl enable tuned.service # auto-start on reboots
Set up ZooKeeper
You’ll need to set up ZooKeeper, regardless of whether you’ll be running a single node or many. So, let’s create its configuration file (the defaults are fine):
sudo cp /etc/zookeeper/zoo_sample.cfg /etc/zookeeper/zoo.cfg
Now, let’s start ZooKeeper (and set it to run on reboot):
sudo systemctl start zookeeper.service
sudo systemctl enable zookeeper.service
Note that the default port for ZooKeeper is 2181
. Remember the hostname of
the node where ZooKeeper is running, referred to as <zk-dns-name>
later.
Running a single node
Configure Accumulo
To run on a single node, you don’t need to run HDFS. Accumulo can use the local
filesystem as a volume instead. By default, it uses /tmp/accumulo
. Let’s
change that to something which will survive a reboot:
sudo vim /etc/accumulo/accumulo-site.xml
Change the value of the instance.volumes
property from file:///tmp/accumulo
to file:///var/tmp/accumulo
in the configuration file (or another preferred
location).
While you are editing the Accumulo configuration file, you should also change
the default instance.secret
from DEFAULT
to something else. You can also
change the credentials used by the tracer
service now, too. If you use the
root
user, you’ll have to set its password to the same one you’ll use later
when you initialize Accumulo. If you use another user name, you’ll have to
create that user later.
Configure Hadoop client
Hadoop’s default local filesystem handler isn’t very good at ensuring files are
written to disk when services are stopped. So, let’s use a better filesystem
implementation for file://
locations. This implementation may not be as
robust as a full HDFS instance, but it’s more reliable than the default. Even
though you’re not going to be running HDFS, the Hadoop client code used in
Accumulo can still be configured by modifying Hadoop’s configuration file:
sudo vim /etc/hadoop/core-site.xml
Add a new property:
<property>
<name>fs.file.impl</name>
<value>org.apache.hadoop.fs.RawLocalFileSystem</value>
</property>
Initialize Accumulo
Now, initialize Accumulo. You’ll need to do this as the accumulo
user,
because the Accumulo services run as the accumulo
user. This user is created
automatically by the RPMs if it doesn’t exist when the RPMs are installed. If
you already have a user and/or group by this name, it will probably not be a
problem, but be aware that this user will have permissions for the server
configuration files. To initialize Accumulo as a specific user, use sudo -u
:
sudo -u accumulo accumulo init
As expected, this command will fail if ZooKeeper is not running, or if the
destination volume (file:///var/tmp/accumulo
) already exists.
Start Accumulo services
Now that Accumulo is initialized, you can start its services:
sudo systemctl start accumulo-{master,tserver,gc,tracer,monitor}.service
Enable the commands to start at boot:
sudo systemctl enable accumulo-{master,tserver,gc,tracer,monitor}.service
Running multiple nodes
Amazon EC2 setup
For a multi-node setup, the authors tested these instructions with a Fedora 25 Cloud AMI on Amazon EC2 with the following characteristics:
us-east-1
availability zoneami-e5757bf2
(latest inus-east-1
at time of writing)HVM
virtualization typegp2
disk type64GB EBS
root volume (no additional storage)m4.large
andm4.xlarge
instance types (tested on both)3
nodes
For this setup, you should have a name service configured properly. For
convenience, we used the EC2 provided internal DNS, with internal IP addresses.
Make sure the nodes can communicate with each other using these names. If
you’re using EC2, this means making sure they are in the same security group,
and the security group has an inbound rule for “All traffic” with the source
set to itself (sg-xxxxxxxx
).
The default user is fedora
for the Fedora Cloud AMIs. For the best
experience, don’t forget to make sure they are fully up-to-date (sudo dnf
--refresh upgrade
).
Configure and run Hadoop
Configuring HDFS is the primary difference between the single and multi-node setup. For both Hadoop and Accumulo, you can edit the configuration files on one machine, and copy them to the others.
Pick a server to be the NameNode and identify its DNS name,
(<namenode-dns-name>
). Edit Hadoop’s configuration to set the default
filesystem name to this location:
sudo vim /etc/hadoop/core-site.xml
Set the value for the property fs.default.name
to
hdfs://<namenode-dns-name>:8020
.
Distribute copies of the changed configuration files to each node.
Now, format the NameNode. You’ll need to do this as the hdfs
user on the
NameNode instance:
sudo -u hdfs hdfs namenode -format
On the NameNode, start the NameNode service and enable it on reboot:
sudo systemctl start hadoop-namenode.service
sudo systemctl enable hadoop-namenode.service
On each DataNode, start the DataNode service:
sudo systemctl start hadoop-datanode.service
sudo systemctl enable hadoop-datanode.service
Configure and run Accumulo
Update Accumulo’s configuration to use this HDFS filesystem:
sudo vim /etc/accumulo/accumulo-site.xml
Change the value of the instance.volumes
to
hdfs://<namenode-dns-name>:8020/accumulo
in the configuration file. Don’t
forget to also change the default instance.secret
and the trace user’s
credentials, if necessary. Also, since you will have multiple nodes, you cannot
use localhost:2181
for ZooKeeper, so set instance.zookeeper.host
to
<zk-dns-name>:2181
.
Distribute copies of the changed configuration files to each node.
With HDFS now running, make sure Accumulo has permission to create its directory in HDFS, and initialize Accumulo:
sudo -u hdfs hdfs dfs -chmod 777 /
sudo -u accumulo accumulo init
After Accumulo has created its directory structure, you can change the permissions for the root back to what they were:
sudo -u hdfs hdfs dfs -chmod 755 /
Note: we only choose to do the above because this is a developer/testing environment. Temporarily changing ownership of HDFS is not recommended for the root of HDFS.
Now, you can start Accumulo.
On the NameNode, start all the Accumulo services and enable on reboot:
sudo systemctl start accumulo-{master,tserver,gc,tracer,monitor}.service
sudo systemctl enable accumulo-{master,tserver,gc,tracer,monitor}.service
On each DataNode, start just the tserver
and enable it on reboot:
sudo systemctl start accumulo-tserver.service
sudo systemctl enable accumulo-tserver.service
Watching and using Accumulo
Run the shell
Run a shell as Accumulo’s root user (the instance name and root password are the ones you selected during the initialize step above:
accumulo shell -u root -zh <zk-dns-name>:2181 -zi <instanceName>
View the monitor pages
You should also be able to view the NameNode monitor page and the Accumulo monitor pages. If you are running this in EC2, you can view these over an SSH tunnel using the NameNode’s public IP address. If you didn’t give this node a public IP address, you can allocate one in EC2 and associate it with this node:
ssh -L50070:localhost:50070 -L50095:localhost:50095 <user>@<host>
Replace <user>
with your username (probably fedora
if using the Fedora
AMI), and <host>
with the public IP or hostname for your EC2 instance. Now,
in your local browser, you should be able to navigate to these addresses in
your localhost: Hadoop monitor (http://localhost:50070) and Accumulo
monitor (http://localhost:50095).
Debugging commands
Check the status of a service:
sudo systemctl status <ServiceName>.service
Check running Java processes:
sudo jps -ml
Check the system logs for a specific service within the last 10 minutes:
sudo journalctl -u <ServiceName> --since '10 minutes ago'
Check listening ports:
sudo netstat -tlnp
Check DNS name for a given IP address:
getent hosts <ipaddress> # OR
hostname -A
Perform forward and reverse DNS lookups:
sudo dnf install bind-utils
dig +short <hostname> # forward DNS lookup
dig +short -x <ipaddress> # reverse DNS lookup
Find the instance ID for your instance name:
zkCli.sh -server <host>:2181 # replace <host> with your ZooKeeper server DNS name
> get /accumulo/instances/<name> # replace <name> with your instance name
> quit
If the NameNode is listening on the loopback address, you’ll probably need to restart the service manually, as well as any Accumulo services which failed. This is a known issue with Hadoop:
sudo systemctl restart hadoop-namenode.service
Some helpful rpm commands:
rpm -q -i <installed-package-name> # to see info about an installed package
rpm -q -i -p <rpm-file-name> # to see info about an rpm file
rpm -q --provides <installed-package-name> # see what a package provides
rpm -q --requires <installed-package-name> # see what a package requires
rpm -q -l <installed-package-name> # list package files
rpm -q --whatprovides <file> # find rpm which owns <file>
rpm -q --whatrequires 'mvn(groupId:artifactId)' # find rpm which requires maven coords
Helping out
Feel free to get involved with the Fedora or Fedora EPEL
(for RHEL/CentOS users) packaging. Contact the Fedora maintainers (user at
fedoraproject dot
org) for the Accumulo packages to see how you can help
patching bugs, adapting the upstream packages to the Fedora packaging
standards, testing updates, maintaining dependency packages, and more.
View all posts in the news archive