Generic Collections, LINQ, and the Indicies

Here is the situation.  You have a generic collection class that inherits from Collection, or some other base that allows the class to be a generic collection.  Furthermore, you are doing the equivalent of that in LINQ, with the idea that you intend to be able to quickly and efficiently do object based queries.

Of course, we know that in order to do fast and efficient queries over very large sets of data, you typically need to have an index on the field you intend to put in the “where clause” of the query.  Without an index, such an operation becomes a O(N) operation, whereas with an index, it is an O(C) operation.  To put it another way, there are very good reasons why people put indexes on database tables, and those reasons do not change just because you are dealing with objects in memory.

So… the question is, how will LINQ deal with the need for indicies?

To speculate about this, lets first discuss how we would do this outside of the LINQ world.  Lets say I have a class called Student, and a related collection class called Students that inherits from Collection.  Now, I want to be able to quickly locate a student by Social Security number.  To do something like that, and make it efficient, I would implement an internal list in the form of a private member variable called _studentIndex that is of type Dictionary.

To make the index populate seamlessly, I would override Add as follows:

new public void Add(Student student)

{

  base.Add(student);

  if (!_studentIndex.ContainsKey(student.SSN))

    _studentIndex.Add(student.SSN,student);

}

In the Item and Remove overrides, I would do a similar type of operation (lets assume that SSN is unique for purposes here).

Then, to retrieve a Student by SSN quickly, it becomes fairly trivial:

public Student FindByStudent(string SSN)

{

  if (_studentIndex.ContainsKey(SSN))

    return(_studentIndex[SSN]);

  else

    return null;

}

Now, fast forward to LINQ.  Say, you do the equivalent operation when you have where Student.SSN == “000-11-2222′.  How will LINQ know to put an index on SSN.  While you can designate a primary key on the object, there is a good chance that SSN might not be the primary key (i.e. especially if you are in a case, like most universities are, where usage of SSN as a primary key through the system is considered a very bad idea).  In order to do efficient queries LINQ will need some mechanism to designate an index.

I would love for someone more experienced at LINQ than I am to tell me how indicies are supposed to happen.  If they are not there, well, who knows, perhaps we have a nice idea for future enhancement (i.e. perhaps there is a way to put an attribute on a property that would designate it as an index).

Update: A little further research – some people from Microsoft have already been playing with the idea:

http://weblogs.asp.net/brianbec/archive/2006/03/15/440293.aspx

http://blogs.msdn.com/kfarmer/archive/2006/03/15/552615.aspx

It looks like… technically, you can do it, but setting it up is far from trivial (i.e. like in SQL server, where you simply create an index and the query knows to use it if available).

Advertisements
Generic Collections, LINQ, and the Indicies

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s