RavenDB Performance Tweaks

time to read 2 min | 298 words

Continuing my efforts to make RavenDB faster, I finally settled on loading all the users from Stack Overflow into Raven as a good small benchmark for Raven. This is important, because I want something big enough to give me meaningful stats, but small enough so I would have to wait too long for results.

The users data from Stack Overflow is ~180,000 documents, but it loads in just a few minutes, so that is very useful.

With that, I was able to identify a major bottleneck:

The AddTask takes a lot of time, mostly because it needs to update a Json field, so it needs to load it, update, and then save it back. In many cases, it is the same value that we keep parsing over & over again. I introduced a Json Cache, which could save the parsing cost. When introducing this, I was also able to make it useful in a few other places, which should hopefully improve performance in a few more places.

Current speed, when inserting ~183,000 documents (over HTTP on the local host) in batches of 128 documents is 128 milliseconds. I consider this pretty good. Note that this is with indexing enabled, but not counting how long it takes for the background indexes to finish processing.

When measuring that as well, after inserting hundreds of thousands of documents, it took another half a second (but note that my measuring granularity was 0.5 second) to finish updating the indexes on the background. I think it is time to try to throw the entire StackOverflow data dump at Raven and see how fast it can handle it.

Tweet Share Share 7 comments

Tags:

Raven

Comments

26 Apr 2010
10:02 AM

Demis Bellot

Yeah I temporarily added a value cache for my serializer as well as there are certain things like Serializing Dates which are very expensive. Obviously caching gives you an instant perf boost - unfortunately as I was trying to compare benchmarks with other serializers it was unfair to tailor my serializer against the benchmark data (i.e. the Northwind database in a tight loop) so I ended up taking it out. Also meant I no longer have to worry about running out of / cleaning up memory.

26 Apr 2010
14:20 PM

Karl

Would you consider using BSON instead of JSON? BSON is specifically designed for processing performance - having lengths at the start of documents (and strings) makes scanning quite efficient. I know the JSON.NET library made significant performance improvements to their BSON stuff over the last month or so.

26 Apr 2010
14:44 PM

Ayende Rahien

Karl,

I am not worried that much about the parsing cost, I am more worried about the re-parsing cost, which is something different.

26 Apr 2010
15:13 PM

Demis Bellot

@Karl

I also considered using BSON at one point, the problems I had with it was that it is less human readable/writeable and you won't be able to dump it directly from the data store onto the response stream which is very useful in Ajax apps.

I find BSON kind of lives in between JSON (slow, human read/writeable, version-able and interoperable) and Protocol Buffers which is very fast, fairly interoperable but more rigid and not human readable.

26 Apr 2010
21:23 PM

Demis Bellot

@Ken Egozi

Hey thx for the validation, I hope its proven useful!

With all the de/serializers out there I really thought .NET was missing something that was faster, smaller, readable and more resilient than JSON and XML which is why I decided to created the hybrid JSON+CSV (aka JSV format). As you've noticed the CSV-style escaping is both quicker to parse and ends up being more readable.

Basically values are not just strings or numbers (as they are in JSON) they are simply string-serialized values which are either CSV-escaped (if they contain any special chars) or not. As a bonus for controlling the source code I can ensure that it works just as good cross-platform on Mono.

27 Apr 2010
11:12 AM

James Newton-King

In my experience, having written Json.NET, BSON is faster to read, JSON is faster to write.

If you do want to improve JSON read performance that perf test pretty much nails it: parsing strings is the slowest part. I've made it as fast as I know how but I'm sure someone could do it faster.

27 Apr 2010
13:42 PM

Demis Bellot

@James Newton-King

parsing strings is the slowest part.

I think this is where CSV-style double-quote escaping is proving effective performance wise. I'm able to parse strings without a 'stack state' so I can parse simply by increment a char array, I also don't need to decode any un-escaped values and can use it as-is.

Although having said that, I believe the perf cost in deserialization is also largely due to instantiating and populating objects. I'm sure you're also avoiding any runtime reflection, but you may not be using 'generic static optimizations' to also eliminate any 'Type or cached-delegate lookups' which I've also found to provide a noticeable perf increase.

Feel free to examine the source code to see what I mean. The source code for the TypeSerializer is open source under the liberal new BSD Licence, quite short and hopefully easy to read, It all starts from this class:

code.google.com/.../TypeSerializer.cs

Although for the ultimate tips on perf optimizations I would also follow @Marc Gravell's .NET protocol buffers implementation:

http://code.google.com/p/protobuf-net/
As he is notorious of for doing any optimization known to .NET-land to squeeze every ounce of performance possible.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB