Class AbstractHashSampler

All Implemented Interfaces:
Direct Known Subclasses:
RowColumnSampler, RowSampler

public abstract class AbstractHashSampler extends Object implements Sampler
A base class that can be used to create Samplers based on hashing. This class offers consistent options for configuring the hash function. The subclass decides which parts of the key to hash.

This class support two options passed into init(SamplerConfiguration). One option is hasher which specifies a hashing algorithm. Valid values for this option are md5, sha1, and murmur3_32. If you are not sure, then choose murmur3_32.

The second option is modulus which can have any positive integer as a value.

Any data where hash(data) % modulus == 0 will be selected for the sample.

  • Field Details


      protected static final Set<String> REQUIRED_SAMPLER_OPTIONS
  • Constructor Details

    • AbstractHashSampler

      public AbstractHashSampler()
  • Method Details

    • validateOptions

      public void validateOptions(Map<String,String> config)
      Subclasses with options should override this method to validate subclass options while also calling super.validateOptions(config) to validate base class options.
      Specified by:
      validateOptions in interface Sampler
      config - Sampler options configuration to validate. Validates option and value.
    • isValidOption

      @Deprecated(since="2.1.0") protected boolean isValidOption(String option)
      since 2.1.0, replaced by validateOptions(Map)
      Subclasses with options should override this method and return true if the option is valid for the subclass or if super.isValidOption(opt) returns true.
    • init

      public void init(SamplerConfiguration config)
      Subclasses with options should override this method and call super.init(config).
      Specified by:
      init in interface Sampler
      config - Configuration options for a sampler.
    • hash

      protected abstract void hash(DataOutput hasher, Key k) throws IOException
      Subclass must override this method and hash some portion of the key.
      hasher - Data written to this will be used to compute the hash for the key.
    • accept

      public boolean accept(Key k)
      Specified by:
      accept in interface Sampler
      k - A key that was written to a rfile.
      True if the key (and its associated value) should be stored in the rfile's sample. Return false if it should not be included.