Kata number five is about Bloom Filters. No, that isn’t anything to do with flowers, though of course it could be used for that purpose. Dave Thomas’ description will quite suffice. If you’re not familiar with them, go read it now. I’ll wait.
Oh, you’re back. Good. Now that you have a good understanding of what they do and how, here’s my implementation. It uses each pair of bytes out of an MD5 digest, as an individual hash. (For the security geeks out there, yes I know MD5 has been declared too weak to rely on for security, but this is just a simple demo.)
#! /usr/bin/ruby
# Dave Thomas Code Kata #5: Bloom Filter
# See http://codekata.pragprog.com/2007/01/kata_five_bloom.html
# for problem -- but not data, he's taken down the word list.
# Solution by Dave Aronson
require 'digest/md5'
class Bloom_Filter
  FixnumSize = 32
  def initialize
    # WARNING WILL ROBINSON, DANGER, DANGER!
    # There's probably some way to tell Ruby "give me this size chunk
    # of RAM" but I haven't stumbled across it yet.  We need 2^16 bits
    # cuz we're going to take the hash 2 bytes at a time; divide by
    # FixnumSize cuz that's the storage unit for our bits.
    @bytes = Array.new((2 ** 16) / FixnumSize, 0)
  end
  def add_datum(datum)
    hash = Digest::MD5.hexdigest datum
    # could also convert hash to arbitrary-size integer en masse
    # and then take byte pairs as binary... this way seemed easier....
    (hash.length / 2).times { |pair| set_bit hash[pair*2 .. pair*2+1].hex }
  end
  def check_bit(bit_num)
    (@bytes[bit_num / FixnumSize] & (1 << (bit_num % FixnumSize))) != 0
  end
  def check_datum(datum)
    hash = Digest::MD5.hexdigest datum
    (hash.length / 2).times do |subhash|
      return false if ! check_bit hash[subhash * 2 .. subhash * 2 + 1].hex
    end
    true
  end
  def load_data(data)
    data.each { |datum| add_datum datum }
  end
  def set_bit(bit_num)
    @bytes[bit_num / FixnumSize] |= (1 << bit_num % FixnumSize)
  end
end
# TESTS
require 'test/unit'
class TC_MyTest < Test::Unit::TestCase
  def setup
    @words = [ 'zero', 'one', 'two', 'three', 'four', 'five',
                       'six', 'seven', 'eight', 'nine', 'ten' ]
    @filter = Bloom_Filter.new
    @filter.load_data @words
  end
  def helper(lang_words)
    lang_words.each do |word|
      assert_equal @words.member?(word), @filter.check_datum(word),
                   "error: checking #{word} failed!"
    end
  end
  def test_finding_same_words
    helper @words
  end
  def test_find_commonality_with_english
    helper ['zero', 'one',    'two',    'three', 'four',   'five',
                    'six',    'seven',  'eight', 'nine',   'ten',]
  end
  # given how the helper works, this is really meaningless,
  # included just for illustration
  def test_finding_nothing
    helper []
  end
  def test_finding_empty
    helper ['']
  end
  def test_finding_space
    helper [' ']
  end
  def test_find_commonality_with_french
    helper ['zero', 'un',     'deux',   'trois', 'quatre', 'cinq',
                    'six',    'sept',   'huite', 'neuf',   'dix']
  end
  def test_find_commonality_with_spanish
    helper ['zero', 'uno',   'dos',    'tres',  'cuatro', 'cinco',
                    'seis',  'siete',  'ocho',  'neuve',  'dies']
  end
  def test_find_commonality_with_german
    helper ['zero', 'eins',   'zwei',   'drei',  'fier',   'funf',
                    'sex',    'sieben', 'ocht',  'neun',   'zehn']
  end
  def test_find_commonality_with_japanese
    helper ['zero', 'ichi',   'ni',     'san',   'yon',    'go',
                    'ryoku',  'shichi', 'hachi', 'kyu',    'ju']
  end
  def test_find_commonality_with_multiple
    helper ['onetwo', 'one two three']
  end
  def test_find_commonality_with_nonsense
    helper ['ooee', 'ooahah', 'tingtang', 'wallawallabingbang',
            'flibbertygibbet', 'asdfasdf', 'qwertyuiop', '@#$%^&*!']
  end
end
So now, as usual, it’s the Peanut Gallery’s turn. What would you do differently, and why? Would you use a completely different algorithm? A different way to allocate the memory, and set and check the bits? Would you test it differently? Or is it (ha!) perfect?