Class AccumuloFileOutputFormat
public class AccumuloFileOutputFormat
extends org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Key,Value>
This class allows MapReduce jobs to write output in the Accumulo data file format.
Care should be taken to write only sorted data (sorted by
Care should be taken to write only sorted data (sorted by
Key
), as this is an important
requirement of Accumulo data files.
The output path to be created must be specified via
FileOutputFormat.setOutputPath(Job, Path)
. This is inherited from
FileOutputFormat.setOutputPath(Job, Path)
. Other methods from FileOutputFormat
are not supported and may be ignored or cause failures. Using other Hadoop configuration options
that affect the behavior of the underlying files directly in the Job's configuration may work,
but are not directly supported at this time.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.Counter
-
Field Summary
Fields inherited from class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
BASE_OUTPUT_NAME, COMPRESS, COMPRESS_CODEC, COMPRESS_TYPE, OUTDIR, PART
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionprotected static org.apache.accumulo.core.conf.AccumuloConfiguration
getAccumuloConfiguration
(org.apache.hadoop.mapreduce.JobContext context) Deprecated.since 1.7.0 This method returns a type that is not part of the public API and is not guaranteed to be stable.getRecordWriter
(org.apache.hadoop.mapreduce.TaskAttemptContext context) static void
setCompressionType
(org.apache.hadoop.mapreduce.Job job, String compressionType) Sets the compression type to use for data blocks.static void
setDataBlockSize
(org.apache.hadoop.mapreduce.Job job, long dataBlockSize) Sets the size for data blocks within each file.
Data blocks are a span of key/value pairs stored in the file that are compressed and indexed as a group.static void
setFileBlockSize
(org.apache.hadoop.mapreduce.Job job, long fileBlockSize) Sets the size for file blocks in the file system; file blocks are managed, and replicated, by the underlying file system.static void
setIndexBlockSize
(org.apache.hadoop.mapreduce.Job job, long indexBlockSize) Sets the size for index blocks within each file; smaller blocks means a deeper index hierarchy within the file, while larger blocks mean a more shallow index hierarchy within the file.static void
setReplication
(org.apache.hadoop.mapreduce.Job job, int replication) Sets the file system replication factor for the resulting file, overriding the file system default.static void
setSampler
(org.apache.hadoop.mapreduce.Job job, SamplerConfiguration samplerConfig) Specify a sampler to be used when writing out data.Methods inherited from class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
checkOutputSpecs, getCompressOutput, getDefaultWorkFile, getOutputCommitter, getOutputCompressorClass, getOutputName, getOutputPath, getPathForWorkFile, getUniqueFile, getWorkOutputPath, setCompressOutput, setOutputCompressorClass, setOutputName, setOutputPath
-
Field Details
-
log
protected static final org.apache.log4j.Logger log
-
-
Constructor Details
-
AccumuloFileOutputFormat
public AccumuloFileOutputFormat()
-
-
Method Details
-
getAccumuloConfiguration
@Deprecated protected static org.apache.accumulo.core.conf.AccumuloConfiguration getAccumuloConfiguration(org.apache.hadoop.mapreduce.JobContext context) Deprecated.since 1.7.0 This method returns a type that is not part of the public API and is not guaranteed to be stable. The method was deprecated to discourage its use.This helper method provides an AccumuloConfiguration object constructed from the Accumulo defaults, and overridden with Accumulo properties that have been stored in the Job's configuration.- Parameters:
context
- the Hadoop context for the configured job- Since:
- 1.5.0
-
setCompressionType
Sets the compression type to use for data blocks. Specifying a compression may require additional libraries to be available to your Job.- Parameters:
job
- the Hadoop job instance to be configuredcompressionType
- one of "none", "gz", "lzo", or "snappy"- Since:
- 1.5.0
-
setDataBlockSize
public static void setDataBlockSize(org.apache.hadoop.mapreduce.Job job, long dataBlockSize) Sets the size for data blocks within each file.
Data blocks are a span of key/value pairs stored in the file that are compressed and indexed as a group.Making this value smaller may increase seek performance, but at the cost of increasing the size of the indexes (which can also affect seek performance).
- Parameters:
job
- the Hadoop job instance to be configureddataBlockSize
- the block size, in bytes- Since:
- 1.5.0
-
setFileBlockSize
public static void setFileBlockSize(org.apache.hadoop.mapreduce.Job job, long fileBlockSize) Sets the size for file blocks in the file system; file blocks are managed, and replicated, by the underlying file system.- Parameters:
job
- the Hadoop job instance to be configuredfileBlockSize
- the block size, in bytes- Since:
- 1.5.0
-
setIndexBlockSize
public static void setIndexBlockSize(org.apache.hadoop.mapreduce.Job job, long indexBlockSize) Sets the size for index blocks within each file; smaller blocks means a deeper index hierarchy within the file, while larger blocks mean a more shallow index hierarchy within the file. This can affect the performance of queries.- Parameters:
job
- the Hadoop job instance to be configuredindexBlockSize
- the block size, in bytes- Since:
- 1.5.0
-
setReplication
public static void setReplication(org.apache.hadoop.mapreduce.Job job, int replication) Sets the file system replication factor for the resulting file, overriding the file system default.- Parameters:
job
- the Hadoop job instance to be configuredreplication
- the number of replicas for produced files- Since:
- 1.5.0
-
setSampler
public static void setSampler(org.apache.hadoop.mapreduce.Job job, SamplerConfiguration samplerConfig) Specify a sampler to be used when writing out data. This will result in the output file having sample data.- Parameters:
job
- The Hadoop job instance to be configuredsamplerConfig
- The configuration for creating sample data in the output file.- Since:
- 1.8.0
-
getRecordWriter
public org.apache.hadoop.mapreduce.RecordWriter<Key,Value> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException - Specified by:
getRecordWriter
in classorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Key,
Value> - Throws:
IOException
-