uncategorized

Using Bloom Filters to Avoid Repetition

I have a site DailyCryptogram for folks who like to solve cryptograms. It has a database of just over 10,000 quotations from about 3000 authors. One of the features of the site is that it displays a random quotation every two minutes. One would think with 10,000 quotation you would almost never see a quote repeated. Experience shows, however, that it happens fairly often. Also, since some authors have a single quote and others have as many as fifty you seldom see many of the authors. Now I like Woody Allen and Yogi Berra as much as anybody but I would like to see quotes from others too.

I was trying to figure out a simple way to avoid repetition when I discovered Bloom Filters. Conceived by Burton Howard Bloom in 1970, when memory was expensive, they offer a probabilistic method of determining whether a member exists in a set. False positives are possible, but false negatives are not. In other words we can be certain that a member is not in a set but can only say a member is “likely” to be in a set.

This seems like a good method if we are using it to exclude cryptograms and/or authors that have been previously selected. We may occasionally discard one that really hasn’t been used but we will never display one twice.

There really is no need to use such a filter for authors because we could easily just keep track of the id’s used but for quotations it is quite useful. Often quotations differ only in punctuation or phrasing or a misspelled word. If we choose the key not be the entire quotation but just the “important words” we can avoid repetition. When you read two quotations you can easily recognize similarity but who has time to read 10000 quotes and look for similiar ones. After some trial and error, choosing the key to be a sorted array of words whose length is at least four letters works well. Restricting the length to greater than five often eliminated quotations which were clearly not the same. I suppose that may be because we have a lot of quotes from celebrities whose vocabulary consists primarily of four letter words.

Implementation

We have three requirements

  1. Never repeat an author on the same day
  2. Never show the same quotation in any four day period
  3. Deliver the quotation in less than 2 seconds.

Those requirements are arbitrary but should provide a reasonable user experience. In the code, those values are configurable but some care must be given. Since we deliver a quotation every two minutes we deliver 720 per day. If we choose a very long period before we repeat we will likely degrade the performance significantly in the later days because the pool of condidates is smaller and choosing a random quote is much more likely to fail.

Architecture

Viewer

Each object (except the Browser and Mongo) are Docker containers hosted at Joyent. The Mongo database is hosted at MongoLabs.

The external client (via a cron task) sends a request via a socket to the socket server. The socket server broadcasts a message that a quote is requested. The Bloom Filter application is listening for those messages.

Process in the Bloom Filter application.

  1. Retrieve a random quote from the Mongo database.
  2. Retrieves the existing Bloom Filter for Authors which are stored in a redis database.
  3. Test the author against the filter. If it tests positive go back to step 1.
  4. Retrieve the existing Bloom Filter for Quotes.
  5. Prepare the key.
  6. Test the quote. If it tests positive return to step 1
  7. Send the quote via socket to the socket server.

The socket server then broadcasts the quote. Tha browser page on the cryptogram site is listening for those messages and renders the quote on the page.

The code

We make use of the node module bloomfilter by Jason Davies

Fetching (and creating if it doesn’t exist) filter

1
'use strict';
var Promise = require("bluebird"),
	BloomFilter = require("bloomfilter").BloomFilter,
	unboundSlice = Array.prototype.slice,
	slice = Function.prototype.call.bind(unboundSlice),   // so we can slice a typed Array

	createFilter = (app, filterName)=>new Promise((resolve, reject)=> {
		let bf = new BloomFilter(
			512 * 256, // number of bits to allocate.
			16        // number of hash functions.
		);
		let array = slice(bf.buckets);
		let json = JSON.stringify(array);
		let expireTime = Number(app.get("nconf").get(filterName + "_EXPIRE_TIME"));
		app.get("client").setAsync(filterName, json)
			.then(()=> {
				app.get("client").expireAsync(filterName, expireTime)
					.then(()=> {
						resolve(bf);
					})
			})
			.catch(reject);
	});
module.exports = (app, filterName)=>new Promise((resolve, reject)=> {
	app.get("client").getAsync(filterName)
		.then((redisString)=> {
			if (!redisString) {
				createFilter(app, filterName)
					.then(resolve)
					.catch(reject);
			} else {
				resolve(new BloomFilter(JSON.parse(redisString), 16));
			}
		})
		.catch(reject);
});

One needs to adjust the number of bits to allocate. Too many and it becomes to large. Too few and you get too many false positives.
In order to store in redis we have to stringify the filter. The “client” is a redis client promisified with Bluebird.

Testing the quote

1
let re = new RegExp('[?.;\'!]', 'g');
let getKey = (quote)=>quote.toLowerCase().replace(re, ' ').split(/\s/).reduce((pv, cv)=> {
			if (cv.length > 3) {
				pv.push(cv);
			}
			return pv;
		}, []).sort().join(" ");
      ....
let testVal = getKey(theQuote.quote);
	if (quoteFilter.test(testVal)) {
		retries++;
		if (retries >= MAXRETRIES) {
			app.logger.log("info", `Failed to find quote after ${retries}`);
		}
		doit();
	}else {
		quoteFilter.add(testVal);
		require("../index").setQuoteFilter(app, "QuoteFilter", quoteFilter)
			.then(()=> {
				foundGood = true;
				app.logger.log("info", `Sent ${theQuote.quote} after ${retries}`);
				emitQuote(theQuote);
			})
			.catch((err)=> {
				retries = MAXRETRIES + 1;
				app.logger.log("error", err);
			});
		}

The key is generated by converting the string to lower case, replacing punctuation with spaces, splitting the quote into an array of words, reducing that array to words of more than three characters, sorting the array and joining back to a space separated string.

If the key tests positive we start over else we add the key to the filter and emit the quote.

Results

I have just begin to test but it looks very promising. The time between the cron task emitting the request for a quote until the quote is rendered in the browser is roughly a second.
The cron container is in the Eastern Data Center at Joyent. The remaining containers are in the Southwest Data Center. Considering that we make a minimum of 4 calls to redis and one to mongo I think the performancce is acceptable.

Further uses

I think the filters could be used to select/reject new quotations. When the quotations come from a variety of sources it is likely that the same quotation will be slightly different. Without some filtering mechanism you wind up annoying users by presenting two cryptograms which they can easily recognize as being the same because the human brain is very good at ignoring the irrelevant.

The could also be used when the user requests a new cryptogram to play. That would involve having separate filters per user with a non expiring TTL. Of course, in than scenario we would really prefer sub second performance.

Summary

Although we live in the age of cheap memory and seemingly infinite storage capacity, I think we would be wise to consider some of the techniques used almost 50 years ago, when code absolutely had to be efficient.

Please add comments and ideas of how to apply Bloom Filters

Share