I added text search to Iron Money in early March and had a few requirements: I wanted to implement the AND and OR operators, I wanted to support quotes for exact matches, I wanted to support a “not” operator (“-”), I didn’t want any stop words, and I needed to be able to search encrypted text. Here are some examples of queries I wanted to support:
- a b [both a and b]
- a or b [a or b]
- “a b” [the string “a b”]
- -a b [strings that contain b but not a]
I decided that implementing my own text search would be my best option with those requirements. I wasn’t able to find any good articles about implementing text search, thus I’ve decided to write up my thoughts here. I can’t guarantee that this is the best solution for you, but given the above contraints it was the best option for Iron Money.
The first thing I do is parse the string like a CSV with PHP’s str_getcsv(). str_getcsv() is a function for parsing CSV strings; I set the delimiter to the space character and leave the enclosure to the default ‘”‘. If you’re writing this in another language, any CSV parser should do the trick; just make sure it splits the search text by spaces and respects quote values.
After I get an array of strings with str_getcsv(), I run through that array and add it to a brand new array (we’ll call it $parts). As I run through the strings, I check to see if the string starts with ‘-”‘; this indicates that the next string in the array is going to be a NOT value. Once I get to the next string, I append the rest of the NOT value to the previous string that was added to the $parts array and continue with the rest of the array. We end up with an array of $parts that has an array of strings.
With $parts in hand, I check each string in $parts to find if any of the strings are the word “or”. If the string is “or”, then I know that the string before and after the “or” belong to that “or”; thus, both of them are added to a special $or_conditions array, and I remove the three strings from the $parts array.
At this point we have two arrays: $parts has all of our AND conditions, while $or_conditions has all of our OR conditions. My goal, however, is to have an array of all of the conditions that a string must meet; this includes values that it needs to contain, values that it should not contain, and a list of OR conditions which the string must meet. Thus, I create a $conditions array which will contain all of the conditions a string must meet to match the search.
We continue along by creating a new array ($condition) with two values: a $contains array and a $does_not_contain array. We run through all of the strings in $parts; remember, all of the strings in the $parts array are AND conditions. We check each string in $parts to see if it starts with the “-” character; if it does, we put the string in $does_not_contain; otherwise, we put the string in $contains. After running through all of the strings in $parts, we’ll end up with a single $condition array that we add to the $conditions array; it represents all of the AND conditions.
Now we move on to the $or_conditions array. For each of the $or_conditions, we’ll create a new $condition that has a $contains array, a $does_not_contain array, and a value called “operator” that indicates that this is an OR condition. We use the same process as before to modify the $condition array, and then add it to the $conditions array.
At this point in time, you have a $conditions array that has a list of conditions a string must meet to match the search. Each value of the $conditions array has a $contains and $does_not_contain array, and it might have an “operator” value to indicate if it’s an OR condition.
Now, we get all of the potential candidates to search through. In Iron Money’s case, we have an array of transactions that we need to sift through; if the “name” or “description” attribute matches the text search, then the candidate is returned in the search results.
To determine whether or not a value matches the search conditions, we run through the $conditions array that we created earlier. For each $condition, we check to see if the candidate strings match the $condition—if the strings have the $contains text and don’t have the $does_not_contain text. If the candidate is a match, we add it to the search results.
Manually implementing text search can be a real pain; I think it’s almost always better to use existing, proven solutions for text-matching needs. However, if you’re in a situation like I was with Iron Money, hopefully the above will be useful when implementing a robust text-matching algorithm.
0 Responses to “Implementing Basic Text Search”