Accumulo 2.x Documentation >> Troubleshooting >> Basic Troubleshooting
Basic Troubleshooting
Edit this pageGeneral
The tablet server does not seem to be running!? What happened?
Accumulo is a distributed system. It is supposed to run on remote
equipment, across hundreds of computers. Each program that runs on
these remote computers writes down events as they occur, into a local
file. By default, this is defined in conf/accumulo-env.sh
as ACCUMULO_LOG_DIR
.
Look in the $ACCUMULO_LOG_DIR/tserver*.log
file. Specifically, check the end of the file.
The tablet server did not start and the debug log does not exist! What happened?
When the individual programs are started, the stdout and stderr output
of these programs are stored in .out
and .err
files in
$ACCUMULO_LOG_DIR
. Often, when there are missing configuration
options, files or permissions, messages will be left in these files.
Probably a start-up problem. Look in $ACCUMULO_LOG_DIR/tserver*.err
Accumulo is not working, what’s wrong?
There’s a small web server that collects information about all the components that make up a running Accumulo instance. It will highlight unusual or unexpected conditions.
Point your browser to the monitor (typically the manager host, on port 9995). Is anything red or yellow?
My browser is reporting connection refused, and I cannot get to the monitor
The monitor program’s output is also written to .err and .out files in
the $ACCUMULO_LOG_DIR
. Look for problems in this file if the
$ACCUMULO_LOG_DIR/monitor*.log
file does not exist.
The monitor program is probably not running. Check the log files for errors.
My browser hangs trying to talk to the monitor.
Your browser needs to be able to reach the monitor program. Often large clusters are firewalled, or use a VPN for internal communications. You can use SSH to proxy your browser to the cluster, or consult with your system administrator to gain access to the server from your browser.
It is sometimes helpful to use a text-only browser to sanity-check the monitor while on the machine running the monitor:
$ links http://localhost:9995
Verify that you are not firewalled from the monitor if it is running on a remote host.
The monitor responds, but there are no numbers for tservers and tables. The summary page says the manager is down.
The monitor program gathers all the details about the manager and the tablet servers through the manager. It will be mostly blank if the manager is down. Check for a running manager.
The ZooKeeper information is not available on the Overview page.
The monitor uses the ZooKeeper stat
four-letter-word command to retrieve information.
The ZooKeeper configuration may require explicitly listing the stat
command in the four-letter-word whitelist.
I’ve lost the Accumulo root password, now what?
Running accumulo init --reset-security
will prompt you for a new root password. CAUTION: this command will delete all existing users. You will need to re-create all other users and set permissions accordingly. Running the accumulo admin dumpConfig
command will output current configuration, including current users, which may aid in this process.
Accumulo Processes
My tablet server crashed! The logs say that it lost its zookeeper lock.
Tablet servers reserve a lock in zookeeper to maintain their ownership over the tablets that have been assigned to them. Part of their responsibility for keeping the lock is to send zookeeper a keep-alive message periodically. If the tablet server fails to send a message in a timely fashion, zookeeper will remove the lock and notify the tablet server. If the tablet server does not receive a message from zookeeper, it will assume its lock has been lost, too. If a tablet server loses its lock, it kills itself: everything assumes it is dead already.
Investigate why the tablet server did not send a timely message to zookeeper.
I need to decommission a node. How do I stop the tablet server on it?
Use the admin command:
$ accumulo admin stop hostname:9997
2013-07-16 13:15:38,403 [util.Admin] INFO : Stopping server 12.34.56.78:9997
I cannot login to a tablet server host, and the tablet server will not shut down. How can I kill the server?
Sometimes you can kill a “stuck” tablet server by deleting its lock in zookeeper:
$ accumulo org.apache.accumulo.server.util.TabletServerLocks --list
127.0.0.1:9997 TSERV_CLIENT=127.0.0.1:9997
$ accumulo org.apache.accumulo.server.util.TabletServerLocks -delete 127.0.0.1:9997
$ accumulo org.apache.accumulo.server.util.TabletServerLocks -list
127.0.0.1:9997 null
You can find the manager and instance id for any accumulo instances using the same zookeeper instance:
$ accumulo org.apache.accumulo.server.util.ListInstances
INFO : Using ZooKeepers localhost:2181
Instance Name | Instance ID | Manager
---------------------+--------------------------------------+-------------------------------
"test" | 6140b72e-edd8-4126-b2f5-e74a8bbe323b | 127.0.0.1:9999
One of my Accumulo processes died. How do I bring it back?
The easiest way to bring all services online for an Accumulo instance is to run the accumulo-cluster
script.
$ accumulo-cluster start
This process will check the process listing, using jps
on each host before attempting to restart a service on the given host.
Typically, this check is sufficient except in the face of a hung/zombie process. For large clusters, it may be
undesirable to ssh to every node in the cluster to ensure that all hosts are running the appropriate processes and accumulo-service
may be of use.
$ ssh host_with_dead_process
$ accumulo-service tserver start
My process died again. Should I restart it via cron
or tools like supervisord
?
A repeatedly dying Accumulo process is a sign of a larger problem. Typically, these problems are due to a misconfiguration of Accumulo or over-saturation of resources. Blind automation of any service restart inside of Accumulo is generally an undesirable situation as it is indicative of a problem that is being masked and ignored. Accumulo processes should be stable on the order of months and not require frequent restart.
Accumulo Clients
Accumulo is not showing me any data!
Is your client configured with authorizations that match your visibilities? See the Authorizations documentation for help.
What are my visibilities?
Use the rfile-info tool on a representative file to get some idea of the visibilities in the underlying data.
Note that the use of rfile-info
is an administrative tool and can only
be used by someone who can access the underlying Accumulo data. It
does not provide the normal access controls in Accumulo.
Ingest
Why does my ingest rate periodically go down during heavy ingest?
Periods of zero or low ingest rates can be caused by Java garbage collection pauses in tablet servers. This problem can be mitigated by enabling native maps in tablet servers.
HDFS
Accumulo reads and writes to the Hadoop Distributed File System. Accumulo needs this file system available at all times for normal operations.
Accumulo is having problems “getting a block blk_1234567890123”. How do I fix it?
This troubleshooting guide does not cover HDFS, but in general, you want to make sure that all the datanodes are running and an fsck check finds the file system clean:
$ hadoop fsck /accumulo
You can use:
$ hadoop fsck /accumulo/path/to/corrupt/file -locations -blocks -files
to locate the block references of individual corrupt files and use those references to search the name node and individual data node logs to determine which servers those blocks have been assigned and then try to fix any underlying file system issues on those nodes.
On a larger cluster, you may need to increase the number of Xcievers for HDFS DataNodes:
<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
</property>
Verify HDFS is healthy, check the datanode logs.
Zookeeper
The accumulo init
command is hanging. It says something about talking to zookeeper.
Zookeeper is also a distributed service. You will need to ensure that it is up. You can run the zookeeper command line tool to connect to any one of the zookeeper servers:
$ zkCli.sh -server zoohost
...
[zk: zoohost:2181(CONNECTED) 0]
It is important to see the word CONNECTED
! If you only see
CONNECTING
you will need to diagnose zookeeper errors.
Check to make sure that zookeeper is up, and that
accumulo.properties
has been pointed to
your zookeeper server(s).
Zookeeper is running, but it does not say CONNECTED
Zookeeper processes talk to each other to elect a leader. All updates go through the leader and propagate to a majority of all the other nodes. If a majority of the nodes cannot be reached, zookeeper will not allow updates. Zookeeper also limits the number connections to a server from any other single host. By default, this limit can be as small as 10 and can be reached in some everything-on-one-machine test configurations.
You can check the election status and connection status of clients by
asking the zookeeper nodes for their status. You connect to zookeeper
and ask it with the four-letter stat
command:
$ nc zoohost 2181
stat
Zookeeper version: 3.4.5-1392090, built on 09/30/2012 17:52 GMT
Clients:
/127.0.0.1:58289[0](queued=0,recved=1,sent=0)
/127.0.0.1:60231[1](queued=0,recved=53910,sent=53915)
Latency min/avg/max: 0/5/3008
Received: 1561459
Sent: 1561592
Connections: 2
Outstanding: 0
Zxid: 0x621a3b
Mode: standalone
Node count: 22524
Check zookeeper status, verify that it has a quorum, and has not exceeded maxClientCnxns.