public abstract class InputFormatBase<K,V> extends AbstractInputFormat<K,V>
InputFormat
class allows MapReduce jobs to use Accumulo as the source of K,V pairs.
Subclasses must implement a InputFormat.createRecordReader(InputSplit, TaskAttemptContext)
to provide a RecordReader
for K,V.
A static base class, RecordReaderBase, is provided to retrieve Accumulo Key
/Value
pairs, but one must implement its
RecordReader.nextKeyValue()
to transform them to the desired generic types K,V.
See AccumuloInputFormat
for an example implementation.
Modifier and Type | Class and Description |
---|---|
static class |
InputFormatBase.RangeInputSplit
Deprecated.
since 1.5.2; Use
RangeInputSplit instead. |
protected static class |
InputFormatBase.RecordReaderBase<K,V> |
AbstractInputFormat.AbstractRecordReader<K,V>
CLASS, log
Constructor and Description |
---|
InputFormatBase() |
Modifier and Type | Method and Description |
---|---|
static void |
addIterator(org.apache.hadoop.mapreduce.Job job,
IteratorSetting cfg)
Encode an iterator on the single input table for this job.
|
static void |
fetchColumns(org.apache.hadoop.mapreduce.Job job,
Collection<org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> columnFamilyColumnQualifierPairs)
Restricts the columns that will be mapped over for this job for the default input table.
|
protected static boolean |
getAutoAdjustRanges(org.apache.hadoop.mapreduce.JobContext context)
Determines whether a configuration has auto-adjust ranges enabled.
|
protected static Set<org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> |
getFetchedColumns(org.apache.hadoop.mapreduce.JobContext context)
Gets the columns to be mapped over from this job.
|
protected static String |
getInputTableName(org.apache.hadoop.mapreduce.JobContext context)
Gets the table name from the configuration.
|
protected static List<IteratorSetting> |
getIterators(org.apache.hadoop.mapreduce.JobContext context)
Gets a list of the iterator settings (for iterators to apply to a scanner) from this configuration.
|
protected static List<Range> |
getRanges(org.apache.hadoop.mapreduce.JobContext context)
Gets the ranges to scan over from a job.
|
protected static org.apache.accumulo.core.client.impl.TabletLocator |
getTabletLocator(org.apache.hadoop.mapreduce.JobContext context)
Deprecated.
since 1.6.0
|
static boolean |
isBatchScan(org.apache.hadoop.mapreduce.JobContext context)
Determines whether a configuration has the
BatchScanner feature enabled. |
protected static boolean |
isIsolated(org.apache.hadoop.mapreduce.JobContext context)
Determines whether a configuration has isolation enabled.
|
protected static boolean |
isOfflineScan(org.apache.hadoop.mapreduce.JobContext context)
Determines whether a configuration has the offline table scan feature enabled.
|
static void |
setAutoAdjustRanges(org.apache.hadoop.mapreduce.Job job,
boolean enableFeature)
Controls the automatic adjustment of ranges for this job.
|
static void |
setBatchScan(org.apache.hadoop.mapreduce.Job job,
boolean enableFeature)
Controls the use of the
BatchScanner in this job. |
static void |
setInputTableName(org.apache.hadoop.mapreduce.Job job,
String tableName)
Sets the name of the input table, over which this job will scan.
|
static void |
setLocalIterators(org.apache.hadoop.mapreduce.Job job,
boolean enableFeature)
Controls the use of the
ClientSideIteratorScanner in this job. |
static void |
setOfflineTableScan(org.apache.hadoop.mapreduce.Job job,
boolean enableFeature)
Enable reading offline tables.
|
static void |
setRanges(org.apache.hadoop.mapreduce.Job job,
Collection<Range> ranges)
Sets the input ranges to scan for the single input table associated with this job.
|
static void |
setScanIsolation(org.apache.hadoop.mapreduce.Job job,
boolean enableFeature)
Controls the use of the
IsolatedScanner in this job. |
protected static boolean |
usesLocalIterators(org.apache.hadoop.mapreduce.JobContext context)
Determines whether a configuration uses local iterators.
|
getAuthenticationToken, getClientConfiguration, getInputTableConfig, getInputTableConfigs, getInstance, getLogLevel, getPrincipal, getScanAuthorizations, getSplits, getTabletLocator, getToken, getTokenClass, isConnectorInfoSet, setConnectorInfo, setConnectorInfo, setLogLevel, setMockInstance, setScanAuthorizations, setZooKeeperInstance, setZooKeeperInstance, validateOptions
protected static String getInputTableName(org.apache.hadoop.mapreduce.JobContext context)
context
- the Hadoop context for the configured jobsetInputTableName(Job, String)
public static void setInputTableName(org.apache.hadoop.mapreduce.Job job, String tableName)
job
- the Hadoop job instance to be configuredtableName
- the table to use when the tablename is null in the write callpublic static void setRanges(org.apache.hadoop.mapreduce.Job job, Collection<Range> ranges)
job
- the Hadoop job instance to be configuredranges
- the ranges that will be mapped overprotected static List<Range> getRanges(org.apache.hadoop.mapreduce.JobContext context) throws IOException
context
- the Hadoop context for the configured jobIOException
setRanges(Job, Collection)
public static void fetchColumns(org.apache.hadoop.mapreduce.Job job, Collection<org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> columnFamilyColumnQualifierPairs)
job
- the Hadoop job instance to be configuredcolumnFamilyColumnQualifierPairs
- a pair of Text
objects corresponding to column family and column qualifier. If the column qualifier is null, the entire column family is
selected. An empty set is the default and is equivalent to scanning the all columns.protected static Set<org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> getFetchedColumns(org.apache.hadoop.mapreduce.JobContext context)
context
- the Hadoop context for the configured jobfetchColumns(Job, Collection)
public static void addIterator(org.apache.hadoop.mapreduce.Job job, IteratorSetting cfg)
job
- the Hadoop job instance to be configuredcfg
- the configuration of the iteratorprotected static List<IteratorSetting> getIterators(org.apache.hadoop.mapreduce.JobContext context)
context
- the Hadoop context for the configured jobaddIterator(Job, IteratorSetting)
public static void setAutoAdjustRanges(org.apache.hadoop.mapreduce.Job job, boolean enableFeature)
By default, this feature is enabled.
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwisesetRanges(Job, Collection)
protected static boolean getAutoAdjustRanges(org.apache.hadoop.mapreduce.JobContext context)
setBatchScan(Job, boolean)
is true.context
- the Hadoop context for the configured jobsetAutoAdjustRanges(Job, boolean)
public static void setScanIsolation(org.apache.hadoop.mapreduce.Job job, boolean enableFeature)
IsolatedScanner
in this job.
By default, this feature is disabled.
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwiseprotected static boolean isIsolated(org.apache.hadoop.mapreduce.JobContext context)
context
- the Hadoop context for the configured jobsetScanIsolation(Job, boolean)
public static void setLocalIterators(org.apache.hadoop.mapreduce.Job job, boolean enableFeature)
ClientSideIteratorScanner
in this job. Enabling this feature will cause the iterator stack to be constructed within the Map
task, rather than within the Accumulo TServer. To use this feature, all classes needed for those iterators must be available on the classpath for the task.
By default, this feature is disabled.
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwiseprotected static boolean usesLocalIterators(org.apache.hadoop.mapreduce.JobContext context)
context
- the Hadoop context for the configured jobsetLocalIterators(Job, boolean)
public static void setOfflineTableScan(org.apache.hadoop.mapreduce.Job job, boolean enableFeature)
Enable reading offline tables. By default, this feature is disabled and only online tables are scanned. This will make the map reduce job directly read the table's files. If the table is not offline, then the job will fail. If the table comes online during the map reduce job, it is likely that the job will fail.
To use this option, the map reduce user will need access to read the Accumulo directory in HDFS.
Reading the offline table will create the scan time iterator stack in the map process. So any iterators that are configured for the table will need to be on the mapper's classpath.
One way to use this feature is to clone a table, take the clone offline, and use the clone as the input table for a map reduce job. If you plan to map reduce over the data many times, it may be better to the compact the table, clone it, take it offline, and use the clone for all map reduce jobs. The reason to do this is that compaction will reduce each tablet in the table to one file, and it is faster to read from one file.
There are two possible advantages to reading a tables file directly out of HDFS. First, you may see better read performance. Second, it will support speculative execution better. When reading an online table speculative execution can put more load on an already slow tablet server.
By default, this feature is disabled.
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwiseprotected static boolean isOfflineScan(org.apache.hadoop.mapreduce.JobContext context)
context
- the Hadoop context for the configured jobsetOfflineTableScan(Job, boolean)
public static void setBatchScan(org.apache.hadoop.mapreduce.Job job, boolean enableFeature)
BatchScanner
in this job. Using this feature will group Ranges by their source tablet,
producing an InputSplit per tablet rather than per Range. This batching helps to reduce overhead when querying a large number of small ranges. (ex: when
doing quad-tree decomposition for spatial queries)
In order to achieve good locality of InputSplits this option always clips the input Ranges to tablet boundaries. This may result in one input Range contributing to several InputSplits.
Note: that the value of setAutoAdjustRanges(Job, boolean)
is ignored and is assumed to be true when BatchScan option is enabled.
This configuration is incompatible with:
setOfflineTableScan(org.apache.hadoop.mapreduce.Job, boolean)
setLocalIterators(org.apache.hadoop.mapreduce.Job, boolean)
setScanIsolation(org.apache.hadoop.mapreduce.Job, boolean)
By default, this feature is disabled.
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwisepublic static boolean isBatchScan(org.apache.hadoop.mapreduce.JobContext context)
BatchScanner
feature enabled.context
- the Hadoop context for the configured jobsetBatchScan(Job, boolean)
@Deprecated protected static org.apache.accumulo.core.client.impl.TabletLocator getTabletLocator(org.apache.hadoop.mapreduce.JobContext context) throws TableNotFoundException
TabletLocator
based on the configuration.context
- the Hadoop context for the configured jobTableNotFoundException
- if the table name set on the configuration doesn't existCopyright © 2011–2018 The Apache Software Foundation. All rights reserved.