Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 577 words

I realized lately that I wanted to know a lot more about exactly how Lucene is storing data on disk. Oh, I know the general stuff about segments and files, etc. But I wanted to know the actual bits & bytes. So I started tracing into Lucene and trying to figure out what it is doing.

And, by the way, the only thing that the Lucene.NET codebase is missing is this sign:

image

At any rate, this is how Lucene writes the segment file. Note that this is done in a CRC32 signed file:

image

And the info write method is:

image

Today, I would probably use a JSON file for something like that (bonus point, you know if it is corrupted and it is human readable), but this code was written in 2001, so that explains it.

This is the format of the format of a segment file, and the segments.gen file is generated using:

image

Moving on to actually writing data, I created ten Lucene documents and wrote them. Then just debugged through the code to see what will happen. It started by creating _0.fdx and _0.fdt files. The .fdt is for fields, the fdx is for field indexes.

Both of those files are used when writing the stored fields. This is the empty operation, writing an unstored field.

image

This is how fields are actually stored:

image

And then it ends up in:

image

Note that this particular data goes in the fdt file, while the fdx appears to be a quick way to go from a known document id to the relevant position in the fdx file.

As I was going through the code, I did some searches, and found a very detailed explanation of the actual file format in the docs. That is really nice and quite informative, however, just seeing how the “let us take the documents and make them searchable” part is quite interesting. Lucene has a lot of chains of responsibilities going through. And it is also quite interesting to see the design choices that were made.

Unfortunately, Lucene is very much wedded to its file format, and making changes to it isn’t going to be possible, which is a shame, since it impacts quite a lot of the way Lucene works in general.

time to read 3 min | 487 words

An interesting question that keeps popping up is how to re-reduce the results of a map/reduce. That is really nice feature on the surface, but it has a lot of implications, for example, when / how you run the 2nd reduce, can you chain only 1 time, or multiple times , what happens when there are a lot of reduce results, etc.

But most of the time, what people want is to be able to do aggregation on the map/reduce results without too much hassle, and they don’t have a lot of aggregated results or they are fine with waiting for them if they are very large. And we have a really nice solution for that scenario.

You start by defining the base map/reduce operation, like so:

image

Note that we need to also output the fields that we care about reducing further. In this case, we start by reducing to postal code, but we keep the city, region and country options as well.

Then, we define a transformer. Note that this is a special transformer, in that it has a group by in it, and it takes some parameters from outside.

image

Using those two together, we can now get the following results…

Raw map/reduce output:

image

With loc = City, we get:

image

With loc = Country, we get:

image

Tada, we have reduced further the result of a map/reduce operation. Now, this is subject to the usual limitations of RavenDB paging, in that it will only go through the only 1024 results. That can be a problem, but that is why RavenDB has the Streaming API.

You can use streaming on a map/reduce index with a transformer (and even apply parameters on top of that). That end up giving you the ability to run a re-reduction on top of a map/reduce index regardless of size.

Of course, on very large result sets, that can take quite a while, but that is expected and usually fine. For that matter, if you need to, you can chain the stream into a bulk insert, and get the re-reduction in that manner.

time to read 10 min | 1909 words

Rafal  an Ben Foster commented on my previous post with some ideas on how to deal with incremental updates to map/reduce indexes. Rafal said:

Actually, it's quite simple if you can 'reverse' the mapping operation (for given key find all documents matching that key): you just delete aggregate record with specified key and run incremental map-reduce on all matching documents. In today's example, you would delete the aggregate with key='oren' and then run map reduce with a query:

db.items.mapReduce(map,reduce, { out: {reduce: ‘distinct_item_names’}, query: {name: 'oren' } });

And Ben said:

It's worth mentioning that I was able to get the MongoDB map-reduce collections updating automatically (insert/update/delete) by monitoring the MongoDB OpLog …

…and listen for new documents in the OpLog which could then be used to re-execute an incremental Map-Reduce.

And while this looks right, this actually can’t possibly work. I’ll start from Rafal’s suggestion first. He suggest just issuing the following set of commands whenever we delete something from the database:

   1: db.distinct_item_names.remove({name: 'oren' } });
   2: db.items.mapReduce(map,reduce, { out: {reduce: ‘distinct_item_names’}, query: {name: 'oren' } });

And yes, that will actually work, as long as you are careful to never do this concurrently. Because if you do run this concurrently… well, the best you can hope is no data, but the liker scenario is data corruption.

But this actually gets better, deletes are annoying, but they are a relatively simple case to process. You have updates to deal with too. We’ll assume that we are watching the oplog to get notified when this happens. Here is an MongoDB oplog entry

   1: {
   2:   "ts": {
   3:     "t": 1286821984000,
   4:     "i": 1
   5:   },
   6:   "h": "1633487572904743924",
   7:   "op": "u",
   8:   "ns": "items",
   9:   "o2": {
  10:     "_id": "4cb35859007cc1f4f9f7f85d"
  11:   },
  12:   "o": {
  13:     "$set": {
  14:       "Name": "Eini"
  15:     }
  16:   }
  17: }

As you can see, we an update operation (op: u) on a specific document (o2._id) with the specified update (o.$set). That is really great, and it is utterly useless for our purposes. In this case, we updated the name from Oren to Eini, so we would like to be able to run this:

   1: db.distinct_item_names.remove({name: 'oren' } });
   2: db.distinct_item_names.remove({name: eini' } });
   3: db.items.mapReduce(map,reduce, { out: {reduce: ‘distinct_item_names’}, query: {name: 'oren' } });
   4: db.items.mapReduce(map,reduce, { out: {reduce: ‘distinct_item_names’}, query: {name: eini' } });

Except that we don’t have any way to get the old value out from the oplog. And this still isn’t going to work concurrently.

But let us say that we decided to have a watcher process monitor the oplog somehow, and it will ensure no concurrency of those requests. Now you have to deal with fun issues like: “what happens if the watcher process recycle?”  How do you keep your place in the oplog (and remember, the oplog is capped, stuff you haven’t seen might be removed if they are beyond the specified size.

And… to be frank, once we have done all of that, this is still the easy part. One of the reasons that you want to do this work in the first place is to deal with large amount of data. But you cannot assume that you’ll have even distribution of the data.

One bug request that came against the RavenDB map/reduce implementation was a map/reduce index on the US Census data. That is ~300 million documents, and the index the user wanted to build was a map/reduce group by the state. You have states like California, with more than 30 million people in it, and you realize that you don’t want to have to re-do the map/reduce over the entire 30+ million documents that you have there. In RavenDB, under this scenario, you’ll have to issue about 3,073 operations, by the way. Versus the 30 millions you would need for this approach.

So yeah, “incremental” map/reduce can’t handle concurrent work, can’t handle deletes, can’t handle updates, and definitely shouldn’t be used on large data sets. And that is after you went to the trouble of setting up the watcher process, monitoring the oplog, etc.

Or, you can use RavenDB and you get a true incremental map/reduce without having to worry about any of that.

time to read 6 min | 1119 words

Ben Foster has a really cool article showing some of the similarities and differences between MongoDB & RavenDB with regards to their map/reduce implementation.

However, there is a very important distinction that was missed. Map/reduce operations are run online in MongoDB, that means that for large collections, map/reduce is going to be very expensive. MongoDB has the option of taking the result of a map/reduce operation and writing it to a collection, so you don’t need to run map/reduce jobs all the time. However, that is a snapshot view of the data, not a live view. Ben mentioned that you can do something called incremental map/reduce, but that isn’t actually really good idea at all.

Let us look at the following sequence of operations:

   1: db.items.insert({name: 'oren', ts: 1 });
   2: db.items.insert({name: 'ayende', ts: 2});
   3:  
   4: var map = function Map() { emit(this.name,null); };
   5: var reduce = function(key, val) { return key; };
   6:  
   7: db.items.mapReduce(map,reduce, { out: 'distinct_item_names' });

This creates two items, and give me the distinct names in a separate collection. Now, let us see how that works with updates…

   1: db.items.insert({name: 'eini', ts: 3 });
   2:  
   3: db.items.mapReduce(map,reduce, { out: {reduce: 'distinct_item_names'}, query: {ts: {$gt: 2} } });

This is actually nice, mongo is able to merge the previous results with the new results, so you only have to do the work on the new data. But this has several implications:

  • You have to kick something like ‘ts’ property around to check for new stuff. And you have to _udpate_ that ts property on every update.
  • You have to run this on a regular basis yourself, mongo won’t do that for you.
  • It can’t work with deletes.

It is the last part that is really painful:

   1: db.items.remove({name: 'oren'});

Now, there is just no way for you to construct a map/reduce job that would remove the name when it is gone.

This sort of thing works very nicely when what you want is to just append stuff. That is easy. It is PITA when we are talking about actually using it for live data, that can change and be modified.

Contrast that with the map/reduce implementation in RavenDB:

  • No need to manually maintain state, the database does it for you.
  • No need to worry about updates & deletes, the database does it for you.
  • No need to schedule map/reduce job updates, database does it for you.
  • Map/reduce queries are very fast, regardless of data size.

To be frank, the map/reduce implementation in RavenDB is complex, and pretty much all of it comes down to the fact that we don’t do stupid stuff like run a map/reduce operation on a large database on every query, and that we support edge cases scenarios like data that is actually updated or deleted.

Naturally I’m biased, but it seems to me that trying to use map/reduce in Mongo just means that you have to do a lot of hand holding yourself, while with RavenDB, we take care of everything and leaving you to actually do stuff.

time to read 2 min | 344 words

I’m currently working on a project using Voron (although only incidentally), and I was horrified to get the FatalExecutionEngineException that was discussed previously. Naturally, I assumed that this is something that I did wrong in Voron.

image

After a lot of work, I managed to narrow it down to… not Voron. To be rather more exact, it is Voron that is causing it, but it isn’t Voron’s fault.

The actual issue is the fact that I’ve accidently using Json.NET to serialize a Stream that I got from Voron. And Voron is giving us an UnmanagedMemoryStream. I kept thinking that I was doing something wrong with releasing memory, but as it turns out, here is a very small repro:

 unsafe static void Main(string[] args)
 {
     JsonConvert.SerializeObject(new Foo { Ptr = (byte*)0 });

 }

 public unsafe class Foo
 {
     public byte* Ptr { get; set; }
 }

And that is enough to get the above mentioned error.

What actually happens is that Json.NET is using dynamic IL generation to optimize accessing properties. And it just doesn’t know how to handle pointer properties. What it ended up doing is to generate invalid IL, which resulted in a crash when we tried to actually use it.

Nothing to do with Voron at all, just a class that has a pointer property that was attempting serialization.  Nasty bug, and very annoying to try to figure out.

time to read 1 min | 105 words

As we are getting ready for the RavenDB Conference, our team is actually starting to show the fruits of all of this hard work. For example, take a look at the new studio replication UI configuration. There are a lot of things like that, and I am really pumped to see all the disparate things that we have been doing over the past year come together and join up into a single and very pretty picutre.

image

Support Triage

time to read 1 min | 132 words

We got a call to the office.

We have a huge problem with our system, you need to come and help us right away. This is a critical system and we need immediate response.

That kinda of annoying, of course, but it is all part of the service. So, in order to log the appropriate items into our system, we asked:

What is your order id? And what is your support contract number?

And the answer was:

Oh, that was handled by another department, I’m not sure.

So we asked them to figure that out and send it to us, and waited. The call came at noon, but 7 PM, I sent them an email.

The reply I got back was:

We’ll try to find the order details tomorrow.

I guess it isn’t so huge, critical and immediate problem any longer…

time to read 1 min | 99 words

For a long time, xUnit’s dev has resisted adding support for skipping a test dynamically. You could create your own Fact class and handle that yourself, but that was quite a lot of work, for something very specific.

In RavenDB, we had the need to dynamically decide whatever the test can run based on the actual test situation, so I decided to add this to our xUnit fork. This turned out to be really simple to do. Just three lines of code Smile

https://github.com/ayende/xunit/commit/82accb4c850a3938187ac334fb73d6e81dc921e3#diff-3c2b9f2cb8392f32456d0bf81151b59fR57

time to read 2 min | 392 words

noname

As you probably know, we have the RavenDB Conference coming up in a few weeks. This is the very first conference that we are doing, and to say that we are excited would be a major understatement. We are going to have speakers talk about RavenDB from all angles. From the development team (including yours truly) to the operation guys behind RavenHQ as well as customers and consultants that can share their real world experience about how to best utilize RavenDB.

The idea is to have an intimate gathering, and for the very first time, to actually have direct way to talk with all of our users, as well as share a lot of the stuff that happens behind the scenes. Not to mention, we crave feedback, and several key features of RavenDB were actually proposed by the community and then implemented by us.

I was reviewing the registration numbers for the conference and it looks like we have made a… miscalculation. The idea with the conference was to charge just enough to ensure a commitment to come, to avoid the plague of free conferences where many people sign up and most never come. We even offset things so you can pay 89$ for the conference and you get a 90$ coupon for RavenDB. On the side, since we are already there, we also setup an additional 3 days course for an in-depth look at RavenDB.

But we had a lot more demand for the conference and the course than we thought we would have. We currently have just 2 places open for the course, and about 15 – 20 places open for the conference.

Since the course is more expensive, that was a the kind of surprise that is actually kinda nice to have: “Wait, you mean that they are giving us more money than we thought they would…”

At any rate, I find myself in the strange situation of encouraging people to go and buy the cheaper product, simply because we are really going to have no more room for the course soon.

You can register here, I’m looking forward to seeing you all.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}