Class AccumuloFileOutputFormat

java.lang.Object
org.apache.hadoop.mapreduce.OutputFormat<K,V>
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Key,Value>
org.apache.accumulo.core.client.mapreduce.AccumuloFileOutputFormat

public class AccumuloFileOutputFormat extends org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Key,Value>
This class allows MapReduce jobs to write output in the Accumulo data file format.
Care should be taken to write only sorted data (sorted by Key), as this is an important requirement of Accumulo data files.

The output path to be created must be specified via FileOutputFormat.setOutputPath(Job, Path). This is inherited from FileOutputFormat.setOutputPath(Job, Path). Other methods from FileOutputFormat are not supported and may be ignored or cause failures. Using other Hadoop configuration options that affect the behavior of the underlying files directly in the Job's configuration may work, but are not directly supported at this time.

  • Nested Class Summary

    Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat

    org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.Counter
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected static final org.apache.log4j.Logger
     

    Fields inherited from class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat

    BASE_OUTPUT_NAME, COMPRESS, COMPRESS_CODEC, COMPRESS_TYPE, OUTDIR, PART
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    protected static org.apache.accumulo.core.conf.AccumuloConfiguration
    getAccumuloConfiguration(org.apache.hadoop.mapreduce.JobContext context)
    Deprecated.
    since 1.7.0 This method returns a type that is not part of the public API and is not guaranteed to be stable.
    org.apache.hadoop.mapreduce.RecordWriter<Key,Value>
    getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext context)
     
    static void
    setCompressionType(org.apache.hadoop.mapreduce.Job job, String compressionType)
    Sets the compression type to use for data blocks.
    static void
    setDataBlockSize(org.apache.hadoop.mapreduce.Job job, long dataBlockSize)
    Sets the size for data blocks within each file.
    Data blocks are a span of key/value pairs stored in the file that are compressed and indexed as a group.
    static void
    setFileBlockSize(org.apache.hadoop.mapreduce.Job job, long fileBlockSize)
    Sets the size for file blocks in the file system; file blocks are managed, and replicated, by the underlying file system.
    static void
    setIndexBlockSize(org.apache.hadoop.mapreduce.Job job, long indexBlockSize)
    Sets the size for index blocks within each file; smaller blocks means a deeper index hierarchy within the file, while larger blocks mean a more shallow index hierarchy within the file.
    static void
    setReplication(org.apache.hadoop.mapreduce.Job job, int replication)
    Sets the file system replication factor for the resulting file, overriding the file system default.
    static void
    setSampler(org.apache.hadoop.mapreduce.Job job, SamplerConfiguration samplerConfig)
    Specify a sampler to be used when writing out data.

    Methods inherited from class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat

    checkOutputSpecs, getCompressOutput, getDefaultWorkFile, getOutputCommitter, getOutputCompressorClass, getOutputName, getOutputPath, getPathForWorkFile, getUniqueFile, getWorkOutputPath, setCompressOutput, setOutputCompressorClass, setOutputName, setOutputPath

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • log

      protected static final org.apache.log4j.Logger log
  • Constructor Details

    • AccumuloFileOutputFormat

      public AccumuloFileOutputFormat()
  • Method Details

    • getAccumuloConfiguration

      @Deprecated protected static org.apache.accumulo.core.conf.AccumuloConfiguration getAccumuloConfiguration(org.apache.hadoop.mapreduce.JobContext context)
      Deprecated.
      since 1.7.0 This method returns a type that is not part of the public API and is not guaranteed to be stable. The method was deprecated to discourage its use.
      This helper method provides an AccumuloConfiguration object constructed from the Accumulo defaults, and overridden with Accumulo properties that have been stored in the Job's configuration.
      Parameters:
      context - the Hadoop context for the configured job
      Since:
      1.5.0
    • setCompressionType

      public static void setCompressionType(org.apache.hadoop.mapreduce.Job job, String compressionType)
      Sets the compression type to use for data blocks. Specifying a compression may require additional libraries to be available to your Job.
      Parameters:
      job - the Hadoop job instance to be configured
      compressionType - one of "none", "gz", "lzo", or "snappy"
      Since:
      1.5.0
    • setDataBlockSize

      public static void setDataBlockSize(org.apache.hadoop.mapreduce.Job job, long dataBlockSize)
      Sets the size for data blocks within each file.
      Data blocks are a span of key/value pairs stored in the file that are compressed and indexed as a group.

      Making this value smaller may increase seek performance, but at the cost of increasing the size of the indexes (which can also affect seek performance).

      Parameters:
      job - the Hadoop job instance to be configured
      dataBlockSize - the block size, in bytes
      Since:
      1.5.0
    • setFileBlockSize

      public static void setFileBlockSize(org.apache.hadoop.mapreduce.Job job, long fileBlockSize)
      Sets the size for file blocks in the file system; file blocks are managed, and replicated, by the underlying file system.
      Parameters:
      job - the Hadoop job instance to be configured
      fileBlockSize - the block size, in bytes
      Since:
      1.5.0
    • setIndexBlockSize

      public static void setIndexBlockSize(org.apache.hadoop.mapreduce.Job job, long indexBlockSize)
      Sets the size for index blocks within each file; smaller blocks means a deeper index hierarchy within the file, while larger blocks mean a more shallow index hierarchy within the file. This can affect the performance of queries.
      Parameters:
      job - the Hadoop job instance to be configured
      indexBlockSize - the block size, in bytes
      Since:
      1.5.0
    • setReplication

      public static void setReplication(org.apache.hadoop.mapreduce.Job job, int replication)
      Sets the file system replication factor for the resulting file, overriding the file system default.
      Parameters:
      job - the Hadoop job instance to be configured
      replication - the number of replicas for produced files
      Since:
      1.5.0
    • setSampler

      public static void setSampler(org.apache.hadoop.mapreduce.Job job, SamplerConfiguration samplerConfig)
      Specify a sampler to be used when writing out data. This will result in the output file having sample data.
      Parameters:
      job - The Hadoop job instance to be configured
      samplerConfig - The configuration for creating sample data in the output file.
      Since:
      1.8.0
    • getRecordWriter

      public org.apache.hadoop.mapreduce.RecordWriter<Key,Value> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException
      Specified by:
      getRecordWriter in class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Key,Value>
      Throws:
      IOException