Class InputFormatBase<K,V>
- All Implemented Interfaces:
org.apache.hadoop.mapred.InputFormat<K,
V>
- Direct Known Subclasses:
AccumuloInputFormat
,AccumuloRowInputFormat
InputFormat
class allows MapReduce jobs to use Accumulo as the source of
K,V pairs.
Subclasses must implement a InputFormat.getRecordReader(InputSplit, JobConf, Reporter)
to provide a
RecordReader
for K,V.
A static base class, RecordReaderBase, is provided to retrieve Accumulo Key
/Value
pairs, but one must implement its RecordReader.next(Object, Object)
to transform them
to the desired generic types K,V.
See AccumuloInputFormat
for an example implementation.
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic class
Deprecated.protected static class
Nested classes/interfaces inherited from class org.apache.accumulo.core.client.mapred.AbstractInputFormat
AbstractInputFormat.AbstractRecordReader<K,
V> -
Field Summary
Fields inherited from class org.apache.accumulo.core.client.mapred.AbstractInputFormat
CLASS, log
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic void
addIterator
(org.apache.hadoop.mapred.JobConf job, IteratorSetting cfg) Encode an iterator on the input for this job.static void
fetchColumns
(org.apache.hadoop.mapred.JobConf job, Collection<org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text>> columnFamilyColumnQualifierPairs) Restricts the columns that will be mapped over for this job.protected static boolean
getAutoAdjustRanges
(org.apache.hadoop.mapred.JobConf job) Determines whether a configuration has auto-adjust ranges enabled.protected static Set<org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text,
org.apache.hadoop.io.Text>> getFetchedColumns
(org.apache.hadoop.mapred.JobConf job) Gets the columns to be mapped over from this job.protected static String
getInputTableName
(org.apache.hadoop.mapred.JobConf job) Gets the table name from the configuration.protected static List<IteratorSetting>
getIterators
(org.apache.hadoop.mapred.JobConf job) Gets a list of the iterator settings (for iterators to apply to a scanner) from this configuration.getRanges
(org.apache.hadoop.mapred.JobConf job) Gets the ranges to scan over from a job.protected static org.apache.accumulo.core.client.impl.TabletLocator
getTabletLocator
(org.apache.hadoop.mapred.JobConf job) Deprecated.since 1.6.0static boolean
isBatchScan
(org.apache.hadoop.mapred.JobConf job) Determines whether a configuration has theBatchScanner
feature enabled.protected static boolean
isIsolated
(org.apache.hadoop.mapred.JobConf job) Determines whether a configuration has isolation enabled.protected static boolean
isOfflineScan
(org.apache.hadoop.mapred.JobConf job) Determines whether a configuration has the offline table scan feature enabled.static void
setAutoAdjustRanges
(org.apache.hadoop.mapred.JobConf job, boolean enableFeature) Controls the automatic adjustment of ranges for this job.static void
setBatchScan
(org.apache.hadoop.mapred.JobConf job, boolean enableFeature) Controls the use of theBatchScanner
in this job.static void
setInputTableName
(org.apache.hadoop.mapred.JobConf job, String tableName) Sets the name of the input table, over which this job will scan.static void
setLocalIterators
(org.apache.hadoop.mapred.JobConf job, boolean enableFeature) Controls the use of theClientSideIteratorScanner
in this job.static void
setOfflineTableScan
(org.apache.hadoop.mapred.JobConf job, boolean enableFeature) Enable reading offline tables.static void
setRanges
(org.apache.hadoop.mapred.JobConf job, Collection<Range> ranges) Sets the input ranges to scan for this job.static void
setSamplerConfiguration
(org.apache.hadoop.mapred.JobConf job, SamplerConfiguration samplerConfig) Causes input format to read sample data.static void
setScanIsolation
(org.apache.hadoop.mapred.JobConf job, boolean enableFeature) Controls the use of theIsolatedScanner
in this job.protected static boolean
usesLocalIterators
(org.apache.hadoop.mapred.JobConf job) Determines whether a configuration uses local iterators.Methods inherited from class org.apache.accumulo.core.client.mapred.AbstractInputFormat
getAuthenticationToken, getClassLoaderContext, getClientConfiguration, getInputTableConfig, getInputTableConfigs, getInstance, getLogLevel, getPrincipal, getScanAuthorizations, getSplits, getTabletLocator, isConnectorInfoSet, setClassLoaderContext, setConnectorInfo, setConnectorInfo, setLogLevel, setMockInstance, setScanAuthorizations, setZooKeeperInstance, setZooKeeperInstance, validateOptions
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.apache.hadoop.mapred.InputFormat
getRecordReader
-
Constructor Details
-
InputFormatBase
public InputFormatBase()
-
-
Method Details
-
setInputTableName
Sets the name of the input table, over which this job will scan.- Parameters:
job
- the Hadoop job instance to be configuredtableName
- the table to use when the tablename is null in the write call- Since:
- 1.5.0
-
getInputTableName
Gets the table name from the configuration.- Parameters:
job
- the Hadoop context for the configured job- Returns:
- the table name
- Since:
- 1.5.0
- See Also:
-
setRanges
Sets the input ranges to scan for this job. If not set, the entire table will be scanned.- Parameters:
job
- the Hadoop job instance to be configuredranges
- the ranges that will be mapped over- Since:
- 1.5.0
-
getRanges
Gets the ranges to scan over from a job.- Parameters:
job
- the Hadoop context for the configured job- Returns:
- the ranges
- Throws:
IOException
- if the ranges have been encoded improperly- Since:
- 1.5.0
- See Also:
-
fetchColumns
public static void fetchColumns(org.apache.hadoop.mapred.JobConf job, Collection<org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text, org.apache.hadoop.io.Text>> columnFamilyColumnQualifierPairs) Restricts the columns that will be mapped over for this job.- Parameters:
job
- the Hadoop job instance to be configuredcolumnFamilyColumnQualifierPairs
- a pair ofText
objects corresponding to column family and column qualifier. If the column qualifier is null, the entire column family is selected. An empty set is the default and is equivalent to scanning the all columns.- Since:
- 1.5.0
-
getFetchedColumns
protected static Set<org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> getFetchedColumns(org.apache.hadoop.mapred.JobConf job) Gets the columns to be mapped over from this job.- Parameters:
job
- the Hadoop context for the configured job- Returns:
- a set of columns
- Since:
- 1.5.0
- See Also:
-
addIterator
Encode an iterator on the input for this job.- Parameters:
job
- the Hadoop job instance to be configuredcfg
- the configuration of the iterator- Since:
- 1.5.0
-
getIterators
Gets a list of the iterator settings (for iterators to apply to a scanner) from this configuration.- Parameters:
job
- the Hadoop context for the configured job- Returns:
- a list of iterators
- Since:
- 1.5.0
- See Also:
-
setAutoAdjustRanges
public static void setAutoAdjustRanges(org.apache.hadoop.mapred.JobConf job, boolean enableFeature) Controls the automatic adjustment of ranges for this job. This feature merges overlapping ranges, then splits them to align with tablet boundaries. Disabling this feature will cause exactly one Map task to be created for each specified range. The default setting is enabled. *By default, this feature is enabled.
- Parameters:
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwise- Since:
- 1.5.0
- See Also:
-
getAutoAdjustRanges
protected static boolean getAutoAdjustRanges(org.apache.hadoop.mapred.JobConf job) Determines whether a configuration has auto-adjust ranges enabled. Must be enabled whensetBatchScan(JobConf, boolean)
is true.- Parameters:
job
- the Hadoop context for the configured job- Returns:
- false if the feature is disabled, true otherwise
- Since:
- 1.5.0
- See Also:
-
setScanIsolation
public static void setScanIsolation(org.apache.hadoop.mapred.JobConf job, boolean enableFeature) Controls the use of theIsolatedScanner
in this job.By default, this feature is disabled.
- Parameters:
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwise- Since:
- 1.5.0
-
isIsolated
protected static boolean isIsolated(org.apache.hadoop.mapred.JobConf job) Determines whether a configuration has isolation enabled.- Parameters:
job
- the Hadoop context for the configured job- Returns:
- true if the feature is enabled, false otherwise
- Since:
- 1.5.0
- See Also:
-
setLocalIterators
public static void setLocalIterators(org.apache.hadoop.mapred.JobConf job, boolean enableFeature) Controls the use of theClientSideIteratorScanner
in this job. Enabling this feature will cause the iterator stack to be constructed within the Map task, rather than within the Accumulo TServer. To use this feature, all classes needed for those iterators must be available on the classpath for the task.By default, this feature is disabled.
- Parameters:
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwise- Since:
- 1.5.0
-
usesLocalIterators
protected static boolean usesLocalIterators(org.apache.hadoop.mapred.JobConf job) Determines whether a configuration uses local iterators.- Parameters:
job
- the Hadoop context for the configured job- Returns:
- true if the feature is enabled, false otherwise
- Since:
- 1.5.0
- See Also:
-
setOfflineTableScan
public static void setOfflineTableScan(org.apache.hadoop.mapred.JobConf job, boolean enableFeature) Enable reading offline tables. By default, this feature is disabled and only online tables are scanned. This will make the map reduce job directly read the table's files. If the table is not offline, then the job will fail. If the table comes online during the map reduce job, it is likely that the job will fail.To use this option, the map reduce user will need access to read the Accumulo directory in HDFS.
Reading the offline table will create the scan time iterator stack in the map process. So any iterators that are configured for the table will need to be on the mapper's classpath.
One way to use this feature is to clone a table, take the clone offline, and use the clone as the input table for a map reduce job. If you plan to map reduce over the data many times, it may be better to the compact the table, clone it, take it offline, and use the clone for all map reduce jobs. The reason to do this is that compaction will reduce each tablet in the table to one file, and it is faster to read from one file.
There are two possible advantages to reading a tables file directly out of HDFS. First, you may see better read performance. Second, it will support speculative execution better. When reading an online table speculative execution can put more load on an already slow tablet server.
By default, this feature is disabled.
- Parameters:
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwise- Since:
- 1.5.0
-
isOfflineScan
protected static boolean isOfflineScan(org.apache.hadoop.mapred.JobConf job) Determines whether a configuration has the offline table scan feature enabled.- Parameters:
job
- the Hadoop context for the configured job- Returns:
- true if the feature is enabled, false otherwise
- Since:
- 1.5.0
- See Also:
-
setBatchScan
public static void setBatchScan(org.apache.hadoop.mapred.JobConf job, boolean enableFeature) Controls the use of theBatchScanner
in this job. Using this feature will group Ranges by their source tablet, producing an InputSplit per tablet rather than per Range. This batching helps to reduce overhead when querying a large number of small ranges. (ex: when doing quad-tree decomposition for spatial queries)In order to achieve good locality of InputSplits this option always clips the input Ranges to tablet boundaries. This may result in one input Range contributing to several InputSplits.
Note: that the value of
setAutoAdjustRanges(JobConf, boolean)
is ignored and is assumed to be true when BatchScan option is enabled.This configuration is incompatible with:
setOfflineTableScan(JobConf, boolean)
setLocalIterators(JobConf, boolean)
setScanIsolation(JobConf, boolean)
By default, this feature is disabled.
- Parameters:
job
- the Hadoop job instance to be configuredenableFeature
- the feature is enabled if true, disabled otherwise- Since:
- 1.7.0
-
isBatchScan
public static boolean isBatchScan(org.apache.hadoop.mapred.JobConf job) Determines whether a configuration has theBatchScanner
feature enabled.- Parameters:
job
- the Hadoop context for the configured job- Since:
- 1.7.0
- See Also:
-
setSamplerConfiguration
public static void setSamplerConfiguration(org.apache.hadoop.mapred.JobConf job, SamplerConfiguration samplerConfig) Causes input format to read sample data. If sample data was created using a different configuration or a tables sampler configuration changes while reading data, then the input format will throw an error.- Parameters:
job
- the Hadoop job instance to be configuredsamplerConfig
- The sampler configuration that sample must have been created with inorder for reading sample data to succeed.- Since:
- 1.8.0
- See Also:
-
getTabletLocator
@Deprecated protected static org.apache.accumulo.core.client.impl.TabletLocator getTabletLocator(org.apache.hadoop.mapred.JobConf job) throws TableNotFoundException Deprecated.since 1.6.0Initializes an AccumuloTabletLocator
based on the configuration.- Parameters:
job
- the Hadoop job for the configured job- Returns:
- an Accumulo tablet locator
- Throws:
TableNotFoundException
- if the table name set on the job doesn't exist- Since:
- 1.5.0
-
RangeInputSplit
instead.