Modified Porter Algorithm
Introduction
A few years ago, I modified the Porter algorithm for a Bible concordance program I was writing for PalmOS. Well, it's about time I admitted to myself that the program will never get done, so I'm releasing the algorithm to the public.
Description
Programmers already familiar with information retrieval techniques will immediately recognize the name of "Porter's algorithm". What I describe here is some simple modifications to Porter's algorithm that are intended for use with a Bible search utility, though this may well work with other older English texts as well.
The idea behind a stemming algorithm is to strip suffixes off words when searching, because words with the same root often have similar meanings (e.g., "carry", "carrying", and "carrieth" all have related meanings). When using a stemming algorithm, any search for, e.g., "carry" would match any verse containing "carry" or "carrying" or "carrieth". Likewise, any search for "carrieth" would also match any verse containing any of those words.
My starting point is Porter's modifications to his own algorithm (improving on it since its publication). He describes those changes on his web page. From there, I've made my own changes to make the Porter algorithm work better with the Bible text.
Changes made to the Porter algorithm
- Added support for -eeth and -eest
1b: (m>0) eeth -> ee (m>0) eest -> ee - Added support for suffixes -est, -eth, -ingly, and -edst
1b: (*v*) est -> (*v*) eth -> (*v*) ingly -> (*v*) edst -> - Changed ousli -> ous to ousli -> ou and ousness -> ous to ousness -> ou
2: (m>0) ousli -> ou 2: (m>0) ousness -> ou
- Added support for suffix -antly
2: (m>0) antli -> ant
Test Page
I have implemented the modified Porter algorithm in JavaScript, with a test page that will stem any arbitrary word, show all words in the Bible with the same stemming result (i.e., list of related words), and also display all the rules in the modified Porter's algorithm that were used during that stemming. Note: this page references ~21K of JavaScript code, so it may take a while to load, especially over a dialup connection.