Introducing i4o (indexes for objects) – the first easy to use mechanism in the .NET framework that allows you to declaratively create indexes on your objects – indexes that LINQ can use to make your LINQ query operations up to 1000 times faster than without indexing. i4o makes the idea of "indexed LINQ" an easy to implement reality, not just theory. I first wrote about Indexed LINQ back in January. I have taken the idea much, much farther since.
First of all, lets go back to understanding what problem this particular technology solves. Put simply, you need indexes for objects for the same reason you need indexes on databases – it makes accessing any particular piece of data faster. The idea has been around ever since someone got a bright idea to put an index in the back of a book – that is, sequential search is always slower than associative search. In other words, if we know a little something about what we are looking for, we don’t have to look through the whole damn pile to find it.
Relational Databases, since they tend to be stored on disk, have been using indexing for almost their entire history in order for them to have any sort of reasonable performance. If you have ever done any kind of performance optimization on a program, one of the first things you always look at are whether the database it uses is properly indexed – especially if it has queries or stored procedures that take a long time to execute. A well designed index inserted into a database can, and very frequently does, create an order of magnitude performance improvement.
Enter LINQ. We now have a tool that allows us to do relational style queries over objects in memory. And this is great – since we can compress an idea in LINQ much more concisely that we could without LINQ (i.e. select * from collection where someproperty=somevalue, is simpler than foreach something in somecollection, if something.someproperty=somevalue, yield return something). Simply, LINQ allows us to use set syntax, right in the language. This is a good thing – which should greatly reduce the amount of code we need to write for a lot of different kinds of problems.
The problem is that all this LINQ code we have now, will increase the amount of set operations programmers will do. Set operations – such as joins – are great, because they allow for more efficient expression. Unfortunately, they also will create a huge performance nightmare if there is no indexing mechanism. To get a vivid demonstration of this problem, and it’s solution, create two tables with 1M random records each, then try to do an inner join based on an ID or string field in either table. Then do the same thing with an index.
There are others at Microsoft who have started to think about this problem (http://blogs.msdn.com/kfarmer/archive/2006/03/15/552615.aspx, http://weblogs.asp.net/brianbec/archive/2006/03/15/440293.aspx). I think those guys are moving in the right direction… but ultimatley, people will need to be able to put indexes on fields for their objects in the same way that they do with a database – declare which field is going to be indexed, and have the where or join clauses automatically figure out how to use the index. That is where i4o comes in.
After you reference i4o, there are only 2 things you need to do in order to be using indexed LINQ. Step 1 – put the [Indexable()] attribute on your property (of a class) you want to index. Step 2 – when you want to use a collection of that class, use IndexableCollection<T> where T is the class with Indexable() on one or more properties. There is no step 3 :). If you use LINQ on the indexable collection now, it will return results based on any present indexes, rather than interating through the whole set, if your query is based on an equality test for one of the indexable fields.
So, how does it work? The key to indexable collection is that it is aware of the Indexable attribute on the class it is a collection of. In it’s constructor, it creates blank indexes for each property that has the indexable attribute. In its add (and eventually it’s remove), it adds the item to each applicable index. Normal indexing rules apply – whereas just like in a database, the more heavily you index something, you incur more overhead on the add, but save the time later on the search.
The real fun comes in when it is time to actually use the index. To do that, we use Extension methods over the Where and Join methods for IEnumerable. By evaluating the expression tree sent to either method, we can determine if the left hand side of the delegate sent to us is an indexable property. If so, we use the index rather than searching through the colleciton sequentially for the item. We then evaluate the right side of the expression tree, get it’s hash code, and use that hash code in our index lookup to find our items that might match the index. We then look at all the items where the hashcode matches – and run the standard packaged operator (be it where or join) on the result.
The first version of i4o is available now. It supports the where and join clauses (group join is implemented, but not tested), and has a sample that comes along with the default installation that demonstrates the performance increase you get just from using it with a where clause (i.e. Finding a common name thorugh 1M records in <10 ms with an IndexableCollection<T> versus over >100 ms using just Collection<T>). The where clause sample demonstrates the performance increase in a simple case – when you start doing joins with this (coming soon … scans N items where N is the size of the collection, join scans the product of the lengths in either collection) – the theoretical performance improvement goes up by yet another order of magnitude.
Here is a screenshot of the tests from the demo code in action:
If you have any questions, comments, or concerns … or would like to contribute to the project, please email me at aaron.c.erickson@gmail.com .