public abstract class InputFormatBase<K,V> extends Object implements org.apache.hadoop.mapred.InputFormat<K,V>
InputFormat
class allows MapReduce jobs to use Accumulo as the source of K,V pairs.
Subclasses must implement a InputFormat.getRecordReader(InputSplit, JobConf, Reporter)
to provide a RecordReader
for K,V.
A static base class, RecordReaderBase, is provided to retrieve Accumulo Key
/Value
pairs, but one must implement its
RecordReader.next(Object, Object)
to transform them to the desired generic types K,V.
See AccumuloInputFormat
for an example implementation.
Modifier and Type | Class and Description |
---|---|
static class |
InputFormatBase.RangeInputSplit
Deprecated.
since 1.5.2; Use
RangeInputSplit instead. |
protected static class |
InputFormatBase.RecordReaderBase<K,V>
|
Modifier and Type | Field and Description |
---|---|
protected static org.apache.log4j.Logger |
log |
Constructor and Description |
---|
InputFormatBase() |
Modifier and Type | Method and Description |
---|---|
static void |
addIterator(org.apache.hadoop.mapred.JobConf job,
IteratorSetting cfg)
Encode an iterator on the input for this job.
|
static void |
fetchColumns(org.apache.hadoop.mapred.JobConf job,
Collection<org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> columnFamilyColumnQualifierPairs)
Restricts the columns that will be mapped over for this job.
|
protected static boolean |
getAutoAdjustRanges(org.apache.hadoop.mapred.JobConf job)
Determines whether a configuration has auto-adjust ranges enabled.
|
protected static Set<org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> |
getFetchedColumns(org.apache.hadoop.mapred.JobConf job)
Gets the columns to be mapped over from this job.
|
protected static String |
getInputTableName(org.apache.hadoop.mapred.JobConf job)
Gets the table name from the configuration.
|
protected static Instance |
getInstance(org.apache.hadoop.mapred.JobConf job)
Initializes an Accumulo
Instance based on the configuration. |
protected static List<IteratorSetting> |
getIterators(org.apache.hadoop.mapred.JobConf job)
Gets a list of the iterator settings (for iterators to apply to a scanner) from this configuration.
|
protected static org.apache.log4j.Level |
getLogLevel(org.apache.hadoop.mapred.JobConf job)
Gets the log level from this configuration.
|
protected static String |
getPrincipal(org.apache.hadoop.mapred.JobConf job)
Gets the user name from the configuration.
|
protected static List<Range> |
getRanges(org.apache.hadoop.mapred.JobConf job)
Gets the ranges to scan over from a job.
|
protected static Authorizations |
getScanAuthorizations(org.apache.hadoop.mapred.JobConf job)
Gets the authorizations to set for the scans from the configuration.
|
org.apache.hadoop.mapred.InputSplit[] |
getSplits(org.apache.hadoop.mapred.JobConf job,
int numSplits)
Read the metadata table to get tablets and match up ranges to them.
|
protected static org.apache.accumulo.core.client.impl.TabletLocator |
getTabletLocator(org.apache.hadoop.mapred.JobConf job)
Initializes an Accumulo
TabletLocator based on the configuration. |
protected static byte[] |
getToken(org.apache.hadoop.mapred.JobConf job)
Gets the password from the configuration.
|
protected static String |
getTokenClass(org.apache.hadoop.mapred.JobConf job)
Gets the serialized token class from the configuration.
|
protected static Boolean |
isConnectorInfoSet(org.apache.hadoop.mapred.JobConf job)
Determines if the connector has been configured.
|
protected static boolean |
isIsolated(org.apache.hadoop.mapred.JobConf job)
Determines whether a configuration has isolation enabled.
|
protected static boolean |
isOfflineScan(org.apache.hadoop.mapred.JobConf job)
Determines whether a configuration has the offline table scan feature enabled.
|
static void |
setAutoAdjustRanges(org.apache.hadoop.mapred.JobConf job,
boolean enableFeature)
Controls the automatic adjustment of ranges for this job.
|
static void |
setConnectorInfo(org.apache.hadoop.mapred.JobConf job,
String principal,
AuthenticationToken token)
Sets the connector information needed to communicate with Accumulo in this job.
|
static void |
setInputTableName(org.apache.hadoop.mapred.JobConf job,
String tableName)
Sets the name of the input table, over which this job will scan.
|
static void |
setLocalIterators(org.apache.hadoop.mapred.JobConf job,
boolean enableFeature)
Controls the use of the
ClientSideIteratorScanner in this job. |
static void |
setLogLevel(org.apache.hadoop.mapred.JobConf job,
org.apache.log4j.Level level)
Sets the log level for this job.
|
static void |
setMockInstance(org.apache.hadoop.mapred.JobConf job,
String instanceName)
Configures a
MockInstance for this job. |
static void |
setOfflineTableScan(org.apache.hadoop.mapred.JobConf job,
boolean enableFeature)
Enable reading offline tables.
|
static void |
setRanges(org.apache.hadoop.mapred.JobConf job,
Collection<Range> ranges)
Sets the input ranges to scan for this job.
|
static void |
setScanAuthorizations(org.apache.hadoop.mapred.JobConf job,
Authorizations auths)
Sets the
Authorizations used to scan. |
static void |
setScanIsolation(org.apache.hadoop.mapred.JobConf job,
boolean enableFeature)
Controls the use of the
IsolatedScanner in this job. |
static void |
setZooKeeperInstance(org.apache.hadoop.mapred.JobConf job,
String instanceName,
String zooKeepers)
Configures a
ZooKeeperInstance for this job. |
protected static boolean |
usesLocalIterators(org.apache.hadoop.mapred.JobConf job)
Determines whether a configuration uses local iterators.
|
protected static void |
validateOptions(org.apache.hadoop.mapred.JobConf job)
Check whether a configuration is fully configured to be used with an Accumulo
InputFormat . |
public static void setConnectorInfo(org.apache.hadoop.mapred.JobConf job, String principal, AuthenticationToken token) throws AccumuloSecurityException
WARNING: The serialized token is stored in the configuration and shared with all MapReduce tasks. It is BASE64 encoded to provide a charset safe conversion to a string, and is not intended to be secure.
job
- the Hadoop job instance to be configuredprincipal
- a valid Accumulo user name (user must have Table.CREATE permission)token
- the user's passwordAccumuloSecurityException
protected static Boolean isConnectorInfoSet(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobsetConnectorInfo(JobConf, String, AuthenticationToken)
protected static String getPrincipal(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobsetConnectorInfo(JobConf, String, AuthenticationToken)
protected static String getTokenClass(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobsetConnectorInfo(JobConf, String, AuthenticationToken)
protected static byte[] getToken(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobsetConnectorInfo(JobConf, String, AuthenticationToken)
public static void setZooKeeperInstance(org.apache.hadoop.mapred.JobConf job, String instanceName, String zooKeepers)
ZooKeeperInstance
for this job.job
- the Hadoop job instance to be configuredinstanceName
- the Accumulo instance namezooKeepers
- a comma-separated list of zookeeper serverspublic static void setMockInstance(org.apache.hadoop.mapred.JobConf job, String instanceName)
MockInstance
for this job.job
- the Hadoop job instance to be configuredinstanceName
- the Accumulo instance nameprotected static Instance getInstance(org.apache.hadoop.mapred.JobConf job)
Instance
based on the configuration.job
- the Hadoop context for the configured jobsetZooKeeperInstance(JobConf, String, String)
,
setMockInstance(JobConf, String)
public static void setLogLevel(org.apache.hadoop.mapred.JobConf job, org.apache.log4j.Level level)
job
- the Hadoop job instance to be configuredlevel
- the logging levelprotected static org.apache.log4j.Level getLogLevel(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobsetLogLevel(JobConf, Level)
public static void setInputTableName(org.apache.hadoop.mapred.JobConf job, String tableName)
job
- the Hadoop job instance to be configuredtableName
- the table to use when the tablename is null in the write callprotected static String getInputTableName(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobsetInputTableName(JobConf, String)
public static void setScanAuthorizations(org.apache.hadoop.mapred.JobConf job, Authorizations auths)
Authorizations
used to scan. Must be a subset of the user's authorization. Defaults to the empty set.job
- the Hadoop job instance to be configuredauths
- the user's authorizationsprotected static Authorizations getScanAuthorizations(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobsetScanAuthorizations(JobConf, Authorizations)
public static void setRanges(org.apache.hadoop.mapred.JobConf job, Collection<Range> ranges)
job
- the Hadoop job instance to be configuredranges
- the ranges that will be mapped overprotected static List<Range> getRanges(org.apache.hadoop.mapred.JobConf job) throws IOException
job
- the Hadoop context for the configured jobIOException
- if the ranges have been encoded improperlysetRanges(JobConf, Collection)
public static void fetchColumns(org.apache.hadoop.mapred.JobConf job, Collection<org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> columnFamilyColumnQualifierPairs)
job
- the Hadoop job instance to be configuredcolumnFamilyColumnQualifierPairs
- a pair of Text
objects corresponding to column family and column qualifier. If the column qualifier is null, the entire column family is
selected. An empty set is the default and is equivalent to scanning the all columns.protected static Set<org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> getFetchedColumns(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobfetchColumns(JobConf, Collection)
public static void addIterator(org.apache.hadoop.mapred.JobConf job, IteratorSetting cfg)
job
- the Hadoop job instance to be configuredcfg
- the configuration of the iteratorprotected static List<IteratorSetting> getIterators(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobaddIterator(JobConf, IteratorSetting)
public static void setAutoAdjustRanges(org.apache.hadoop.mapred.JobConf job, boolean enableFeature)
By default, this feature is enabled.
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwisesetRanges(JobConf, Collection)
protected static boolean getAutoAdjustRanges(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobsetAutoAdjustRanges(JobConf, boolean)
public static void setScanIsolation(org.apache.hadoop.mapred.JobConf job, boolean enableFeature)
IsolatedScanner
in this job.
By default, this feature is disabled.
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwiseprotected static boolean isIsolated(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobsetScanIsolation(JobConf, boolean)
public static void setLocalIterators(org.apache.hadoop.mapred.JobConf job, boolean enableFeature)
ClientSideIteratorScanner
in this job. Enabling this feature will cause the iterator stack to be constructed within the Map
task, rather than within the Accumulo TServer. To use this feature, all classes needed for those iterators must be available on the classpath for the task.
By default, this feature is disabled.
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwiseprotected static boolean usesLocalIterators(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobsetLocalIterators(JobConf, boolean)
public static void setOfflineTableScan(org.apache.hadoop.mapred.JobConf job, boolean enableFeature)
Enable reading offline tables. By default, this feature is disabled and only online tables are scanned. This will make the map reduce job directly read the table's files. If the table is not offline, then the job will fail. If the table comes online during the map reduce job, it is likely that the job will fail.
To use this option, the map reduce user will need access to read the Accumulo directory in HDFS.
Reading the offline table will create the scan time iterator stack in the map process. So any iterators that are configured for the table will need to be on the mapper's classpath.
One way to use this feature is to clone a table, take the clone offline, and use the clone as the input table for a map reduce job. If you plan to map reduce over the data many times, it may be better to the compact the table, clone it, take it offline, and use the clone for all map reduce jobs. The reason to do this is that compaction will reduce each tablet in the table to one file, and it is faster to read from one file.
There are two possible advantages to reading a tables file directly out of HDFS. First, you may see better read performance. Second, it will support speculative execution better. When reading an online table speculative execution can put more load on an already slow tablet server.
By default, this feature is disabled.
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwiseprotected static boolean isOfflineScan(org.apache.hadoop.mapred.JobConf job)
job
- the Hadoop context for the configured jobsetOfflineTableScan(JobConf, boolean)
protected static org.apache.accumulo.core.client.impl.TabletLocator getTabletLocator(org.apache.hadoop.mapred.JobConf job) throws TableNotFoundException
TabletLocator
based on the configuration.job
- the Hadoop context for the configured jobTableNotFoundException
- if the table name set on the configuration doesn't existprotected static void validateOptions(org.apache.hadoop.mapred.JobConf job) throws IOException
InputFormat
.job
- the Hadoop context for the configured jobIOException
- if the context is improperly configuredpublic org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits) throws IOException
getSplits
in interface org.apache.hadoop.mapred.InputFormat<K,V>
IOException
Copyright © 2011-2016 The Apache Software Foundation. All Rights Reserved.