Apache Accumulo Regex Group Balancer Example
For some data access patterns, its important to spread groups of tablets within a table out evenly. Accumulo has a balancer that can do this using a regular expression to group tablets. This example shows how this balancer spreads 4 groups of tablets within a table evenly across 17 tablet servers.
Below shows creating a table and adding splits. For this example we would like all of the tablets where the split point has the same two digits to be on different tservers. This gives us four groups of tablets: 01, 02, 03, and 04.
root@accumulo> createtable testRGB
root@accumulo testRGB> addsplits -t testRGB 01b 01m 01r 01z 02b 02m 02r 02z 03b 03m 03r 03z 04a 04b 04c 04d 04e 04f 04g 04h 04i 04j 04k 04l 04m 04n 04o 04p
root@accumulo testRGB> tables -l
accumulo.metadata => !0
accumulo.replication => +rep
accumulo.root => +r
testRGB => 2
trace => 1
After adding the splits we look at the locations in the metadata table.
root@accumulo testRGB> scan -t accumulo.metadata -b 2; -e 2< -c loc
2;01b loc:34a5f6e086b000c [] ip-10-1-2-25:9997
2;01m loc:34a5f6e086b000c [] ip-10-1-2-25:9997
2;01r loc:14a5f6e079d0011 [] ip-10-1-2-15:9997
2;01z loc:14a5f6e079d000f [] ip-10-1-2-13:9997
2;02b loc:34a5f6e086b000b [] ip-10-1-2-26:9997
2;02m loc:14a5f6e079d000c [] ip-10-1-2-28:9997
2;02r loc:14a5f6e079d0012 [] ip-10-1-2-27:9997
2;02z loc:14a5f6e079d0012 [] ip-10-1-2-27:9997
2;03b loc:14a5f6e079d000d [] ip-10-1-2-21:9997
2;03m loc:14a5f6e079d000e [] ip-10-1-2-20:9997
2;03r loc:14a5f6e079d000d [] ip-10-1-2-21:9997
2;03z loc:14a5f6e079d000e [] ip-10-1-2-20:9997
2;04a loc:34a5f6e086b000b [] ip-10-1-2-26:9997
2;04b loc:14a5f6e079d0010 [] ip-10-1-2-17:9997
2;04c loc:14a5f6e079d0010 [] ip-10-1-2-17:9997
2;04d loc:24a5f6e07d3000c [] ip-10-1-2-16:9997
2;04e loc:24a5f6e07d3000d [] ip-10-1-2-29:9997
2;04f loc:24a5f6e07d3000c [] ip-10-1-2-16:9997
2;04g loc:24a5f6e07d3000a [] ip-10-1-2-14:9997
2;04h loc:14a5f6e079d000c [] ip-10-1-2-28:9997
2;04i loc:34a5f6e086b000d [] ip-10-1-2-19:9997
2;04j loc:34a5f6e086b000d [] ip-10-1-2-19:9997
2;04k loc:24a5f6e07d30009 [] ip-10-1-2-23:9997
2;04l loc:24a5f6e07d3000b [] ip-10-1-2-22:9997
2;04m loc:24a5f6e07d30009 [] ip-10-1-2-23:9997
2;04n loc:24a5f6e07d3000b [] ip-10-1-2-22:9997
2;04o loc:34a5f6e086b000a [] ip-10-1-2-18:9997
2;04p loc:24a5f6e07d30008 [] ip-10-1-2-24:9997
2< loc:24a5f6e07d30008 [] ip-10-1-2-24:9997
Below the information above was massaged to show which tablet groups are on each tserver. The four tablets in group 03 are on two tservers, ideally those tablets would be spread across 4 tservers. Note the default tablet (2<) was categorized as group 04 below.
ip-10-1-2-13:9997 01
ip-10-1-2-14:9997 04
ip-10-1-2-15:9997 01
ip-10-1-2-16:9997 04 04
ip-10-1-2-17:9997 04 04
ip-10-1-2-18:9997 04
ip-10-1-2-19:9997 04 04
ip-10-1-2-20:9997 03 03
ip-10-1-2-21:9997 03 03
ip-10-1-2-22:9997 04 04
ip-10-1-2-23:9997 04 04
ip-10-1-2-24:9997 04 04
ip-10-1-2-25:9997 01 01
ip-10-1-2-26:9997 02 04
ip-10-1-2-27:9997 02 02
ip-10-1-2-28:9997 02 04
ip-10-1-2-29:9997 04
To remedy this situation, the RegexGroupBalancer is configured with the commands below. The configured regular expression selects the first two digits from a tablets end row as the group id. Tablets that don’t match and the default tablet are configured to be in group 04.
root@accumulo testRGB> config -t testRGB -s table.custom.balancer.group.regex.pattern=(\\d\\d).*
root@accumulo testRGB> config -t testRGB -s table.custom.balancer.group.regex.default=04
root@accumulo testRGB> config -t testRGB -s table.balancer=org.apache.accumulo.server.master.balancer.RegexGroupBalancer
After waiting a little bit, look at the tablet locations again and all is good.
root@accumulo testRGB> scan -t accumulo.metadata -b 2; -e 2< -c loc
2;01b loc:34a5f6e086b000a [] ip-10-1-2-18:9997
2;01m loc:34a5f6e086b000c [] ip-10-1-2-25:9997
2;01r loc:14a5f6e079d0011 [] ip-10-1-2-15:9997
2;01z loc:14a5f6e079d000f [] ip-10-1-2-13:9997
2;02b loc:34a5f6e086b000b [] ip-10-1-2-26:9997
2;02m loc:14a5f6e079d000c [] ip-10-1-2-28:9997
2;02r loc:34a5f6e086b000d [] ip-10-1-2-19:9997
2;02z loc:14a5f6e079d0012 [] ip-10-1-2-27:9997
2;03b loc:24a5f6e07d3000d [] ip-10-1-2-29:9997
2;03m loc:24a5f6e07d30009 [] ip-10-1-2-23:9997
2;03r loc:14a5f6e079d000d [] ip-10-1-2-21:9997
2;03z loc:14a5f6e079d000e [] ip-10-1-2-20:9997
2;04a loc:34a5f6e086b000b [] ip-10-1-2-26:9997
2;04b loc:34a5f6e086b000c [] ip-10-1-2-25:9997
2;04c loc:14a5f6e079d0010 [] ip-10-1-2-17:9997
2;04d loc:14a5f6e079d000e [] ip-10-1-2-20:9997
2;04e loc:24a5f6e07d3000d [] ip-10-1-2-29:9997
2;04f loc:24a5f6e07d3000c [] ip-10-1-2-16:9997
2;04g loc:24a5f6e07d3000a [] ip-10-1-2-14:9997
2;04h loc:14a5f6e079d000c [] ip-10-1-2-28:9997
2;04i loc:14a5f6e079d0011 [] ip-10-1-2-15:9997
2;04j loc:34a5f6e086b000d [] ip-10-1-2-19:9997
2;04k loc:14a5f6e079d0012 [] ip-10-1-2-27:9997
2;04l loc:14a5f6e079d000f [] ip-10-1-2-13:9997
2;04m loc:24a5f6e07d30009 [] ip-10-1-2-23:9997
2;04n loc:24a5f6e07d3000b [] ip-10-1-2-22:9997
2;04o loc:34a5f6e086b000a [] ip-10-1-2-18:9997
2;04p loc:14a5f6e079d000d [] ip-10-1-2-21:9997
2< loc:24a5f6e07d30008 [] ip-10-1-2-24:9997
Once again, the data above is transformed to make it easier to see which groups are on tservers. The transformed data below shows that all groups are now evenly spread.
ip-10-1-2-13:9997 01 04
ip-10-1-2-14:9997 04
ip-10-1-2-15:9997 01 04
ip-10-1-2-16:9997 04
ip-10-1-2-17:9997 04
ip-10-1-2-18:9997 01 04
ip-10-1-2-19:9997 02 04
ip-10-1-2-20:9997 03 04
ip-10-1-2-21:9997 03 04
ip-10-1-2-22:9997 04
ip-10-1-2-23:9997 03 04
ip-10-1-2-24:9997 04
ip-10-1-2-25:9997 01 04
ip-10-1-2-26:9997 02 04
ip-10-1-2-27:9997 02 04
ip-10-1-2-28:9997 02 04
ip-10-1-2-29:9997 03 04
If you need this functionality, but a regular expression does not meet your needs then extend GroupBalancer. This allows you to specify a partitioning function in Java. Use the RegexGroupBalancer source as an example.