Tidbits: Multiple String Pattern Matching

Friday, January 25, 2008

Multiple String Pattern Matching

The Mailinator guys blogged about how they were using a modified Aho-Corasick style multiple string pattern matching algorithm to index 185 emails/s.

Aho-Corasick takes all the search strings and builds them into a Trie so that it can scan the whole document for N strings in once pass.

The problem is that Aho-Corasick doesn’t support more advanced constructs such as string jumping made possible by Boyer-Moore.

Boyer-Moore says that it would be more intelligent to search from the end of a word. This way when you find a mismatch it can figure out a ‘jump’ and skip characters! This has the counter intuitive property of being able to index the document FASTER when the word your searching for gets longer.

Crazy huh.

Now if only you could support Boyer-Moore jumping with Aho-Corasick style single pass indexing.

Turns out that’s already done.

Wu-Manber is another algorithm which combines the advantages of both Aho-Corasick and can index the document once as well as skip over words in a jump table.

Spinn3r uses Aho-Corasick for one of our HTML indexing component. It turns out that Wu-Manber won’t give us much of a speed boost because HTML comments begin and end with only three characters. It’s a 3x performance gain to migrate to Wu-Manber but our Aho-Corasick code is proven and in production.

We’re also no longer CPU bound so this isn’t on my agenda to fix anytime soon.