Big Data SearchBinary Search of Textual Data
The index I created for the exercise is just a text file, sorted by the indexed key. When doing a search by a human, that make it very easy to work with. Much easier than trying to work with a binary file, it also helps debugging.
However, it does make it running a binary search on the data a bit harder. Mostly because there isn’t a nice way to say “give me the #th line”. Instead, I wrote the following:
1: public void SetPositionToLineAt(long position)2: {
3: // now we need to go back until we either get to the start of the file4: // or find a \n character5: const int bufferSize = 128;6: _buffer.Capacity = Math.Max(bufferSize, _buffer.Capacity);
7:
8: var charCount = _encoding.GetMaxCharCount(bufferSize);
9: if (charCount > _charBuf.Length)10: _charBuf = new char[Utils.NearestPowerOfTwo(charCount)];11:
12: while (true)13: {
14: _input.Position = position - (position < bufferSize ? 0 : bufferSize);
15: var read = ReadToBuffer(bufferSize);
16: var buffer = _buffer.GetBuffer();
17: var chars = _encoding.GetChars(buffer, 0, read, _charBuf, 0);
18: for (int i = chars - 1; i >= 0; i--)19: {
20: if (_charBuf[i] == '\n')21: {
22: _input.Position = position - (bufferSize - i) + 1;
23: return;24: }
25: }
26: position -= bufferSize;
27: if (position < 0)28: {
29: _input.Position = 0;
30: return;31: }
32: }
33: }
This code starts at an arbitrary byte position, and go backward until it find the new line character ‘\n’. This give me the ability to go to a rough location and get the line oriented input.
Once I have that, the rest is pretty easy. Here is the binary search:
1: while (lo <= hi)2: {
3: position = (lo + hi) / 2;
4: _reader.SetPositionToLineAt(position);
5:
6: bool? result;7: do8: {
9: result = _reader.ReadOneLine();
10: } while (result == null); // skip empty lines11:
12: if (result == false)13: yield break; // couldn't find anything14:
15: var entry = _reader.Current.Values[0];
16: match = Utils.CompareArraySegments(expectedIndexEntry, entry);
17:
18: if (match == 0)19: {
20: break;21: }
22: if (match > 0)23: lo = position + _reader.Current.Values.Sum(x => x.Count) + 1;
24: else25: hi = position - 1;
26: }
27:
28: if (match != 0)29: {
30: // no match31: yield break;32: }
The idea is that this positions us on the location of the index that has an entry with a value that is equal to what we are searched on.
We then write the following to actually get the data from the actual data file:
1: // we have a match, now we need to return all the matches2: _reader.SetPositionToLineAt(position);
3:
4: while(true)5: {
6: bool? result;7: do8: {
9: result = _reader.ReadOneLine();
10: } while (result == null); // skip empty lines11:
12: if(result == false)13: yield break; // end of file14:
15: var entry = _reader.Current.Values[0];
16: match = Utils.CompareArraySegments(expectedIndexEntry, entry);
17: if (match != 0)18: yield break; // out of the valid range we need19:
20: _buffer.SetLength(0);
21: _data.Position = Utils.ToInt64(_reader.Current.Values[1]);
22:
23: while (true)24: {
25: var b = _data.ReadByte();
26: if (b == -1)27: break;28: if (b == '\n')29: {
30: break;31: }
32: _buffer.WriteByte((byte)b);33: }
34:
35: yield return _encoding.GetString(_buffer.GetBuffer(), 0, (int)_buffer.Length);36: }
As you can see, we are moving forward in the index file, reading one line at a time. Then we take the second value, the position of the relevant line in the data file, and read that.
We continue to do so as long as the indexed value is the same. Pretty simple, all told. But it comes with its own set of problems. I’ll discuss that in my next post.
More posts in "Big Data Search" series:
- (24 Jan 2014) Sorting randomness
- (23 Jan 2014) Sorting Optimizations
- (22 Jan 2014) The index format is horrible
- (20 Jan 2014) Binary Search of Textual Data
- (17 Jan 2014) Setting up
Comments
@Ayende, Correct me if I'm wrong but this sort of binary search would render O(log N) file seeks, wouldn't it be better to store the ranges of the files we merged and just search through them using bin search to narrow the search in the index file?, this would reduce file seeks which are expensive because instead 15TB region we would need to just search a small part of that.
Bartosz, I am not sure that I am following. Yes, this would require O(log N) seeks, but what do you mean, ranges of the files we merged? Again, there is no guarantees for different ranges in different files.
@Ayende since you merge the files into one index and this is sorted then you should be able to store some sorted values in mem along with their offset in the index file. By issuing a query you could start by binary searching those values this should give you a smaller range to search in the index file.
for example:
Assume that we have 2 data sets [A,B,C,D] and [A,C,D,F] since all we have to do to merge them is to maintain two pointers we can easily determine where one file range starts and ends.
So:
After merge we would have the following: [A,A,B,C,C,D,F] the ranges would be [A,D,F] one A is omitted as we already have index for that value. If another set that we would need to merge would look like so: [A,A,B,B] then the range set would have [A,B,D,F]
The down side would be that we might end up in lot's of indexed values like that but this could be optimize to hold just 10K ranges.
I don't know if my explanation makes any sense I guess the best way to show it would be through code.
Bartosz, I don't understand. After I merged the files, there is just one big index file. Why do I need anything else about the previous iterations?
@Ayende, What this operation is ment to do is to partition this file one big indexed file.
Given that:
[A,A,B,C,C,D,F] is our big index file if we want to search for C then we just bin search the range set [A,D,F] since C is between A and D so we issue a binary search over the index file starting l = A and h = D.
Bartosz, That isn't really going to save you anything in the real world. You basically want to cache the first level or two of the binary search. But given that the OS is already going to cache those internally anyway, I don't see any reason to do that.
@Ayende, Actually depending on the data but ideally this would be 30% of seeks. By caching internally you are referring to lookahead disk read to cache?
Bartosz, Since we are always going to be hitting roughly the same spots (midway parts into the file), the OS is going to cache those in its own memory, and we can avoid doing an actual hardware seek.
@Ayende, Oh I See, new thing learned thanks! :)
Always full of interesting information. Maybe this is some mangling specific to The Old Reader, which I use to read your posts, but your code listings always get broken out into individual line blocks which make it a little harder to read.
(Btw, this is exactly what the Unix "look" command does.)
Comment preview