Big Data SearchBinary Search of Textual Data

time to read 29 min | 5758 words

The index I created for the exercise is just a text file, sorted by the indexed key. When doing a search by a human, that make it very easy to work with. Much easier than trying to work with a binary file, it also helps debugging.

However, it does make it running a binary search on the data a bit harder. Mostly because there isn’t a nice way to say “give me the #th line”. Instead, I wrote the following:

   1: public void SetPositionToLineAt(long position)
   2: {
   3:     // now we need to go back until we either get to the start of the file
   4:     // or find a \n character
   5:     const int bufferSize = 128;
   6:     _buffer.Capacity = Math.Max(bufferSize, _buffer.Capacity);
   7:  
   8:     var charCount = _encoding.GetMaxCharCount(bufferSize);
   9:     if (charCount > _charBuf.Length)
  10:         _charBuf = new char[Utils.NearestPowerOfTwo(charCount)];
  11:  
  12:     while (true)
  13:     {
  14:         _input.Position = position - (position < bufferSize ? 0 : bufferSize);
  15:         var read = ReadToBuffer(bufferSize);
  16:         var buffer = _buffer.GetBuffer();
  17:         var chars = _encoding.GetChars(buffer, 0, read, _charBuf, 0);
  18:         for (int i = chars - 1; i >= 0; i--)
  19:         {
  20:             if (_charBuf[i] == '\n')
  21:             {
  22:                 _input.Position = position - (bufferSize - i) + 1;
  23:                 return;
  24:             }
  25:         }
  26:         position -= bufferSize;
  27:         if (position < 0)
  28:         {
  29:             _input.Position = 0;
  30:             return;
  31:         }
  32:     }
  33: }

This code starts at an arbitrary byte position, and go backward until it find the new line character ‘\n’. This give me the ability to go to a rough location and get the line oriented input.

Once I have that, the rest is pretty easy. Here is the binary search:

   1: while (lo <= hi)
   2: {
   3:     position = (lo + hi) / 2;
   4:     _reader.SetPositionToLineAt(position);
   5:  
   6:     bool? result;
   7:     do
   8:     {
   9:         result = _reader.ReadOneLine();
  10:     } while (result == null); // skip empty lines
  11:  
  12:     if (result == false)
  13:         yield break; // couldn't find anything
  14:  
  15:     var entry = _reader.Current.Values[0];
  16:     match = Utils.CompareArraySegments(expectedIndexEntry, entry);
  17:  
  18:     if (match == 0)
  19:     {
  20:         break;
  21:     }
  22:     if (match > 0)
  23:         lo = position + _reader.Current.Values.Sum(x => x.Count) + 1;
  24:     else
  25:         hi = position - 1;
  26: }
  27:  
  28: if (match != 0)
  29: {
  30:     // no match
  31:     yield break;
  32: }

The idea is that this positions us on the location of the index that has an entry with a value that is equal to what we are searched on.

We then write the following to actually get the data from the actual data file:

   1: // we have a match, now we need to return all the matches
   2: _reader.SetPositionToLineAt(position);
   3:  
   4: while(true)
   5: {
   6:     bool? result;
   7:     do
   8:     {
   9:         result = _reader.ReadOneLine();
  10:     } while (result == null); // skip empty lines
  11:  
  12:     if(result == false)
  13:         yield break; // end of file
  14:  
  15:     var entry = _reader.Current.Values[0];
  16:     match = Utils.CompareArraySegments(expectedIndexEntry, entry);
  17:     if (match != 0)
  18:         yield break; // out of the valid range we need
  19:  
  20:     _buffer.SetLength(0);
  21:     _data.Position = Utils.ToInt64(_reader.Current.Values[1]);
  22:  
  23:     while (true)
  24:     {
  25:         var b = _data.ReadByte();
  26:         if (b == -1)
  27:             break;
  28:         if (b == '\n')
  29:         {
  30:             break;
  31:         }
  32:         _buffer.WriteByte((byte)b);
  33:     }
  34:  
  35:     yield return _encoding.GetString(_buffer.GetBuffer(), 0, (int)_buffer.Length);
  36: }

As you can see, we are moving forward in the index file, reading one line at a time. Then we take the second value, the position of the relevant line in the data file, and read that.

We continue to do so as long as the indexed value is the same. Pretty simple, all told. But it comes with its own set of problems. I’ll discuss that in my next post.

More posts in "Big Data Search" series:

  1. (24 Jan 2014) Sorting randomness
  2. (23 Jan 2014) Sorting Optimizations
  3. (22 Jan 2014) The index format is horrible
  4. (20 Jan 2014) Binary Search of Textual Data
  5. (17 Jan 2014) Setting up