Abdul Basit's Blog: September 2013

Search....the final frontier....

Whoops...lets start over....search...hard to imagine life without it now. And for very good reason, the amount of information readily accessible has gone up a million times over the past 10 years! One has to be able to quickly sift through all that data and find the relevant information.

Search comes in all shapes and sizes..whether your using email, accessing the web, or even using local search on windows...your always in need of it. Of course, when it comes to implementation, there is a plethora of techniques out there one can use. And I'm not just talking about databases, but we now have natural search engines like Apache Lucene...and I don't even know what Google uses under the hood...

And then we have security. Security is very real and has its importance in the food chain. There are many fractions to security, but I'm specifically going to talk about data encryption.

Alas...data encryption itself is pretty huge...and the algorithms that exist are pretty sophisticated. One can even encrypt the same data into several different hashes, and be able to decrypt it again!

So thats that, now onto the topic and hand. Search and security are like water and oil....well maybe I'm being a bit dramatic...but at the very least, you'll need to rethink how to get it going. Don't believe me...lets look at an example.

Lets say your building a database model for some customers. Usually, the customer will have some attributes like a first name, last name, date of birth and perhaps an address.

Search becomes pretty simple:

select * from customer where firstname = 'abdul';
select * from customer where dob < '01/01/2000';
select * from customer where address ilike '%wellington%';

Super. Memory footprint is low..if your using indexes then the response is very quick (except perhaps the last query).

Now lets encrypt the data. Oh wait....you can't execute ANY of the above verbatim! Fortunately, we have some solutions.

If security is high high on your radar, then you'll probably want to use a variable hash algorithm (meaning that for the same text input, you'll get a different hash output every time), then the best solution really is to go row by row in the database, bring that row to the application, decrypt it, and do your comparison there. Very poor efficiency, but it'll get the job done.

Another solution would be to store an in memory data store of sorts. Perhaps an in memory database, or a lucene cache, or some custom data structure. Security folks come back with concerns that if a memory dump is taken, then the data can be exposed...sure..the risk is there. Like I said, if security is high up in the project, it'll probably be a difficult sell.

As far as variable hash algorithms, thats all I've got. All the above examples can be done in code once the data has been decrypted.

If you can relax the constraint of variable hashes...and use same hash algorithms, well the story changes quite a bit. Now you can encrypt the search text in the application and push the encrypted search string to the database. For equal and not equal operators, this will work brilliantly. You can index the columns too to get very fast results.

But thats pretty much all you can do. Haha! In particular, comparator operations like less than or greater than cannot work. You are after all, working with garbage strings.

For the comparator operations defined above, typically, they stem from some kind of business rule. Perhaps in our example, we need to wait for the customer to be 18 before they are eligible for a certain product, say. Then you can have a scheduled job that sets the particular indicators based on those rules. Essentially, you'll need to remove the less than/greater than with some form of equal to.

And then finally, we have our good friend like/ilike. Here, even if you use single hash algorithms; your still stuck...because your search phrase is usually a subset of the entire phrase - which is why your using the like clause to begin with.

Don't despair good people. The notion remains the same...we need to replace the like clause with some form of equal to operator. But how?

Enter string tokenization. All you really do is break the string into a set of literals. Lets say someone lives at 10 Wellington Drive. Then the tokens can be 10, wellington, drive. You can go crazy with slicing and dicing, start with say 3 characters, till the end of the string. So wellington becomes wel, ell, lli,.....will, elli,....wellington. Now encrypt all these tokens and associate the tokens to the id of the customer. When a user types well in the search phrase, you do an equal to operation on the tokens, get the associated id and voila!

I am sure there are other ways in which these problems can be solved. This is one solution that I've applied to a project and works quite well. Needless to say, as technology evolves, I am sure more work will be done in this area. Until then....

Abdul Basit's Blog

Wednesday 18 September 2013

Search and Security

Tuesday 17 September 2013

Another Developer and his Blog?