Working with the FreeDB dataset in Voron

Jan 08 2014

Working with the FreeDB dataset in Voron

time to read 26 min | 5099 words

Reminder, the FreeDB data set is a 3.32 million records. Containing most of the albums that came out in the past few decades. We created the following Voron database to handle it:

   1: _storageEnvironment = new StorageEnvironment(StorageEnvironmentOptions.ForPath("FreeDB"));

   2: using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.ReadWrite))

   3: {

   4:     _storageEnvironment.CreateTree(tx, "albums");

   5:     _storageEnvironment.CreateTree(tx, "ix_diskids");

   6:     _storageEnvironment.CreateTree(tx, "ix_artists");

   7:     _storageEnvironment.CreateTree(tx, "ix_titles");

   8:     tx.Commit();

   9: }

The albums tree contains the actual information about the album, as a json string. And the ix_* trees contains back references to it. They are our indexes. For what it is worth, you might want to note that this is pretty much how most RDBMS implements their indexing. We’re now working at a pretty low level.

Note also that we need to define those trees whenever we start the database.

Now, let us do some queries, shall we?

We are going to start with the simplest option, given a disk id, give me the matching disk. Because disk ids are only nearly unique, we have the possibility of multiple results returning.

Side note, the DB now is 9.00 GB in size.

Let us see how we query it. It pains me to write it, but I created a “repository like” interface, because Voron is way too low level for us to expose to user code. This is actually one of the few places where a repository like interface is good. Because it hides the extra complexity and the rigidity of the structure is justified.

The interface looks like:

Now, let us see how we actually implement this guy. We’ll start from the simplest thing, doing a search by a disk id, which is a nearly unique value that identify a disk.

   1: public IEnumerable<Disk> FindByDiskId(string diskId)

   2: {

   3:     using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.Read))

   4:     {

   5:         var dix = tx.GetTree("ix_diskids");

   6:         var albums = tx.GetTree("albums");

7:

   8:         using (var it = dix.MultiRead(tx, diskId))

   9:         {

  10:             if (it.Seek(Slice.BeforeAllKeys) == false)

  11:                 yield break;

  12:             do

  13:             {

  14:                 var readResult = albums.Read(tx, it.CurrentKey);

  15:                 using (var stream = readResult.Reader.AsStream())

  16:                 {

  17:                     yield return _serializer.Deserialize<Disk>(new JsonTextReader(new StreamReader(stream)));

  18:                 }

  19:             } while (it.MoveNext());

  20:         }

21:

  22:         tx.Commit();

  23:     }

  24: }

Let us go over this code in detail. We are using a read transaction, because we aren’t doing any writes.

We are using two trees here, the ix_diskids, which is the index on the disks, and the albums tree, which contains the actual data.

On line 8, we do a multi read. This is done because a single disk id value may belong to multiple albums.

Lines 10 and 11 are needed in case there are no results. And the line 14 is where the important thing happen. The result of the MultiRead is an iterator that contains all the ids of the albums with this disk id. We then read it from the albums tree, deserialize it and hand it to the user. Pretty simple, overall.

Now, let us go to the more complex scenario, where we want to do a search by artist or album title.

   1: public IEnumerable<Disk> FindByArtist(string prefix)

   2: {

   3:     return FindByMultiValueIterator(prefix, "ix_artists");

   4: }

5:

   6: public IEnumerable<Disk> FindByAlbumTitle(string prefix)

   7: {

   8:     return FindByMultiValueIterator(prefix, "ix_titles");

   9: }

10:

  11: private IEnumerable<Disk> FindByMultiValueIterator(string prefix, string treeIndexName)

  12: {

  13:     using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.Read))

  14:     {

  15:         var dix = tx.GetTree(treeIndexName);

  16:         var albums = tx.GetTree("albums");

17:

  18:         using (var multiValueIterator = dix.Iterate(tx))

  19:         {

  20:             multiValueIterator.RequiredPrefix = prefix.ToLower();

  21:             if (multiValueIterator.Seek(multiValueIterator.RequiredPrefix) == false)

  22:                 yield break;

  23:             do

  24:             {

  25:                 using (var albumsIterator = multiValueIterator.CreateMutliValueIterator())

  26:                 {

  27:                     if (albumsIterator.Seek(Slice.BeforeAllKeys) == false)

  28:                         continue;

  29:                     do

  30:                     {

  31:                         var readResult = albums.Read(tx, albumsIterator.CurrentKey);

  32:                         using (var stream = readResult.Reader.AsStream())

  33:                         {

  34:                             yield return _serializer.Deserialize<Disk>(new JsonTextReader(new StreamReader(stream)));

  35:                         }

  36:                     } while (albumsIterator.MoveNext());

  37:                 }

  38:             } while (multiValueIterator.MoveNext());

  39:         }

40:

  41:         tx.Commit();

  42:     }

  43: }

You can see that in both cases, we handle it the same, because the actual behavior is the same. We don’t want to do just an exact match. If we wanted that, we could use the exact same logic as we did in FindByDiskId. But we want to do something more, we want to be able to search by prefix, not just by exact match. That means that we have to iterate over the tree. The only difference between FindByAlbumTitle and FindByArtist is the tree index that they use.

We start out as before, iterating over the index (line 18). Note that in line 20, we defined a required prefix, and we use the lower case form of the prefix. We also entered that to the index as lower case. This is how we are able to get case insensitive searches.

Line 21 actually take us to the beginning of all the entries greater or equal to the prefix, and by setting RequiredPrefix we ensure that we can’t go to any entry that is doesn’t have this prefix. This is an interesting example, because we are now iterating over all the index entries that have the same prefix. But the values of the index is also a list. That is why we do in line 25. Get all the values that match a particular entry with that particular prefix.

The value of that is the actual id that we use to go into the albums tree. And there you have in, non trivial search with Voron.

Tweet Share Share 9 comments

Tags:

raven

Comments

13 Dec 2013
16:34 PM

Khalid Abuhakmeh

So queries based on ids are pretty straight forward, but how do you perform range queries with a key value store like Voron?

In my mind is it as simple as taking the least common denominator between point A and Point B? For Example if you want to query between two dates (12/1/2013 to 12/15/2013 U.S.), you would start at the key 20131201 and go to 20131215? Then you would just iterate each key. I am assuming the index is ordered based on your property and that it has to be ordered in some natural way (string, numeric, etc...).

I would like to see posts on how to do other kinds of queries.

14 Dec 2013
20:11 PM

Ayende Rahien

Khalid, You get an iterator, you Seek to the right location, then you continue iterating until you are past the last thing you want to see.

14 Dec 2013
20:11 PM

Ayende Rahien

Khalid, As for other queries, just tell me what you want, I'll add them.

08 Jan 2014
10:46 AM

tobi

I'd be interested to see how you implement Unicode aware case-insensitive indexing. I think you'd need to use the SortKey class and I'm not sure about its performance.

08 Jan 2014
11:52 AM

Jesús López

I think FindByDiskId has a bug on line 14. it.CurrentKey contains the DiskId, but you need the AlbumId with is in the value not in the key.

09 Jan 2014
08:27 AM

Ayende Rahien

Tobi, The problem with using SortKey is that you don't get any promises about the stability of the format. In particular, the docs explicitly calls this out: ". If an application serializes a SortKey object, the application must regenerate all of the sort keys when there is a new version of the .NET Framework." Because it uses LCMapString it is also vulnerable for different Windows versions, culture settings, etc.

What we would generally do is just call ToLower and use that as the result.

09 Jan 2014
08:28 AM

Ayende Rahien

Jesus, No, there isn't a bug. We are doing a multi read, so the values in the iterator are actually the values, not the keys.

09 Jan 2014
15:42 PM

Jesús López

it.CurrentKey doesn't look like a value, it looks like a key.

09 Jan 2014
15:53 PM

Ayende Rahien

Jesus, Yes, that is an unforutnate side affect of the way muti value items work. The values are stored in keys.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB