RavenDB ShardingEnabling shards for existing database
A question came up in the mailing list, how do we enable sharding for an existing database. I’ll deal with data migration in this scenario at a later post.
The scenario is that we have a very successful application, and we start to feel the need to move the data to multiple shards. Currently all the data is sitting in the RVN1 server. We want to add RVN2 and RVN3 to the mix. For this post, we’ll assume that we have the notion of Customers and Invoices.
Previously, we access the database using a simple document store:
var documentStore = new DocumentStore { Url = "http://RVN1:8080", DefaultDatabase = "Shop" };
Now, we want to move to a sharded environment, so we want to write it like this. Existing data is going to stay where it is at, and new data will be sharded according to geographical location.
var shards = new Dictionary<string, IDocumentStore> { {"Origin", new DocumentStore {Url = "http://RVN1:8080", DefaultDatabase = "Shop"}},//existing data {"ME", new DocumentStore {Url = "http://RVN2:8080", DefaultDatabase = "Shop_ME"}}, {"US", new DocumentStore {Url = "http://RVN3:8080", DefaultDatabase = "Shop_US"}}, }; var shardStrategy = new ShardStrategy(shards) .ShardingOn<Customer>(c => c.Region) .ShardingOn<Invoice> (i => i.Customer); var documentStore = new ShardedDocumentStore(shardStrategy).Initialize();
This wouldn’t actually work. We are going to have to do a bit more. To start with, what happens when we don’t have a 1:1 match between region and shard? That is when the translator become relevant:
.ShardingOn<Customer>(c => c.Region, region => { switch (region) { case "Middle East": return "ME"; case "USA": case "United States": case "US": return "US"; default: return "Origin"; } })
We basically say that we map several values into a single region. But that isn’t enough. Newly saved documents are going to have the shard prefix, so saving a new customer and invoice in the US shard will show up as:
But existing data doesn’t have this (created without sharding).
So we need to take some extra effort to let RavenDB know about them. We do this using the following two functions:
Func<string, string> potentialShardToShardId = val => { var start = val.IndexOf('/'); if (start == -1) return val; var potentialShardId = val.Substring(0, start); if (shards.ContainsKey(potentialShardId)) return potentialShardId; // this is probably an old id, let us use it. return "Origin"; }; Func<string, string> regionToShardId = region => { switch (region) { case "Middle East": return "ME"; case "USA": case "United States": case "US": return "US"; default: return "Origin"; } };
We can then register our sharding configuration so:
var shardStrategy = new ShardStrategy(shards) .ShardingOn<Customer, string>(c => c.Region, potentialShardToShardId, regionToShardId) .ShardingOn<Invoice, string>(x => x.Customer, potentialShardToShardId, regionToShardId);
That takes care of handling both new and old ids, and let RavenDB understand how to query things in an optimal fashion. For example, a query on all invoices for ‘customers/1’ will only hit the RVN1 server.
However, we aren’t done yet. New customers that don’t belong to the Middle East or USA will still go to the old server, and we don’t want any modification to the id there. We can tell RavenDB how to handle it like so:
var defaultModifyDocumentId = shardStrategy.ModifyDocumentId; shardStrategy.ModifyDocumentId = (convention, shardId, documentId) => { if(shardId == "Origin") return documentId; return defaultModifyDocumentId(convention, shardId, documentId); };
That is almost the end. There is one final issue that we need to deal with, and that is the old documents, before we used sharding, don’t have the required sharding metadata. We can fix that using a store listener. So we have:
var documentStore = new ShardedDocumentStore(shardStrategy); documentStore.RegisterListener(new AddShardIdToMetadataStoreListener()); documentStore.Initialize();
Where the listener looks like this:
public class AddShardIdToMetadataStoreListener : IDocumentStoreListener { public bool BeforeStore(string key, object entityInstance, RavenJObject metadata, RavenJObject original) { if (metadata.ContainsKey(Constants.RavenShardId) == false) { metadata[Constants.RavenShardId] = "Origin";// the default shard id } return false; } public void AfterStore(string key, object entityInstance, RavenJObject metadata) { } }
And that is it. I know that there seems to be quite a lot going on in here, but it basically can be broken down to three actions that we take:
- Modify the existing metadata to add the sharding server id via the listener.
- Modify the document id convention so documents on the old server won’t have a designation (optional).
- Modify the sharding configuration so we’ll understand that documents without a shard prefix actually belong to the Origin shard.
And that is pretty much it.
More posts in "RavenDB Sharding" series:
- (22 May 2015) Adding a new shard to an existing cluster, splitting the shard
- (21 May 2015) Adding a new shard to an existing cluster, the easy way
- (18 May 2015) Enabling shards for existing database
Comments
Idsa, 1) You are correct, I implicitly assumed that in the post. 2) I absolutely agree that it is much simpler to import/export everything that way. It would result in cleaner system, without legacy patches. When you can't take the system online, however, you have the option of doing it the hard way. 3) That is great. 4) The Load issue you found is a pretty obscure thing, transformer that return an id with the same value. This is fixed now, but it isn't something that you would generally notice, that is why we missed it. 5) It is possible to add new nodes to a shard, yes. And we have plans to do sharding at a deeper level on a future version.
We had an established database in production. Our large documents and expensive indexes (the fault of our less than ideal architecture, not Raven's) were choking the server. (Why is another story.) We have a very burst load pattern once a week. We had so many new users last year that it couldn't keep up with the one RavenDB server.
Last fall (2014) we did what was described here, plus wrote a custom ShardResolutionStrategy (because we didn't have a good natural key that was evenly split across the shards). It took 3 developers 3 days, plus one day of ops—creating the new shard instances, manually moving documents to the new shards, etc. It worked well and was quick enough that we pulled it off before the next load spike. It worked beautifully, and the shards have been running smoothly ever since.
I was very impressed with the process, overall. It ended up being easier than I thought it would be.
That being said, there are some spots that would make the process a lot nicer:
All that said, RavenDB is one of the better single-to-sharded data store stories I've been involved with. So kudos Hibernating Rhinos, and thanks for continuing to improve it even more.
1) Can you give us some examples, so we can create something that is a bit more real world?
2) That is basically what the smuggler is doing in here. And the idea is that you can use a transform script to move the data and transform it. Having a custom tool for something like this doesn't make a lot of sense, since the logic in the resharding is usually different from the actual sharding logic.
3) That is likely not going to happen. The sharding resolution is in your code, and the studio isn't aware that this exists. We have plans for a more complete sharding impl on the server side, which may have this, but that would be for 4.0 only, and isn't much beyond some rough drafts now.
4) That is actually by design. The example about the version is a very good one. We don't have a way to report that in a sane manner. A server have a single version, but a cluster may have many. How do you report that on the same interface?
4) What tools you refer to? Usually they can be made to work by using IDocumentStore, but there are things that work directly with the internal state, which obviously cannot be made to work
RE example shard resolution strategy: The PotentialShardsFor method has to (potentially) check if it is a Query or Key based lookup. Then it may need to check if it is a lookup by a known EntityType or not. Whether or not it is a direct load versus a query with an IN operator. Formatting differences between a single value versus CSV values when loading with an array of IDs. Etc. So I mean the tree of possible (or at least common) scenarios for all the ways a document or query can be requested over the client API and put into the fairly loosely structured ShardRequestData object.
RE resharding / moving documents between shards and "the logic in the resharding is usually different from the actual sharding logic": I figured this tool, let's call it a 'rebalancing' tool, would iterate over all documents in every shard, apply the new ShardResolutionStrategy to it, check the current document's meta data for the current shard key, and if it is missing or different then the new shard key for that document, it would atomically (via DTC) move the document to the new shard. In that case, it wouldn't matter what the old logic was; it only would need to know the new logic. (To be a little safer, it might need the requirement that the shards be offline so the clients don't have to deal with the mess of supporting both shard strategies at the same time. Same with the 'duplicate document' client exception during the move.) It might not be the most ultimate perfect design to check the shard strategy for every document, but if it was done while offline (possibly at the server level? something it can't do yet, but see #3 below...) it could be a decent trade off for a very useful sharding tool that would make many scenarios a lot easier.
RE Client admin UI knowing about shards: Server-based shard knowledge could help in v4. But for now, ignoring that possible feature, in v3 there is a disconnect between the 2 clients knowledge: Code clients have to know about shards, but the UI client can't know about the shard strategy. Perhaps the shard strategy code needs to be either on the server (as you suggested might come in v4) or it needs to be written in a language that all systems can understand, like the Patch Requests do with their limited but RavenDB-defined set of JavaScript. That way the shard code can be added to the UI as an upload, server-based configuration or as a server plugin DLL--any way for the admin UI to understand the relationship of the shards just like the clients do.
The way my company has been working around this for now is to build our own admin tools directly in the app. But doesn't that defeat the point of the RavenDB admin UI then? Shouldn't it know about and be able to operate with core RavenDB features to start with? Why do client RavenDB libraries have a ShardedDocumentStore, but the admin UI does not? I know the UI is currently written with the concept that it administers a single server in isolation. If that's the final purpose of the admin UI, then I guess you can just shrug off this problem. But if you want to treat the RavenDB admin UI as an enterprise level administration and query/debug access tool for a RavenDB "system" then it will need to start thinking of servers as nodes in a larger system, instead of just being an HTML/JS web page hosted under and connected to a single isolated server.
Switching the admin UI to think about something beyond a single Raven instance is a big task and possibly outside your business plan, so I'm not sure if you want to do that or not. But it is the next step in the enterprise tool chain in my opinion.
RE enabling all features when sharding: I imagine the client API would have to be updated to support returning an array or map of results from each shard. It would require a little bit more work on the consuming client code to process those results (unlike the current way where queries across shards are automatically re-assembled from separate HTTP requests into one data list returned from the RavenDB client API--which I also have a few problems with when it comes to debugging and paging... but that's another topic). But it would still be easier to consume than dealing with the internal ShardedDocumentStore shard list and state (and lack of individual shard DocumentStore initialization) that any consumer code has to use now.
RE tooling that breaks when using ShardedDocumentStore: The specific tools that broke for us were New Relic monitoring and Glimpse.RavenDB for debugging/profiling. Both of those tools can only be initialized with ProductName.SomeMethodToInitialize(DocumentStore). I assume many others would be in the same boat, because IDocumentStore doesn't define several properties, methods, and listeners that are provided in the DocumentStore only but are needed for various inspection tools like these.
I tried to switch Glimpse.RavenDB (http://www.nuget.org/packages/Glimpse.RavenDb) to use IDocumentStore, but the lack of some of the RavenDB listener hooks by a ShardedDocumentStore was a blocker. I think JsonRequestFactory.LogRequest, SessionCreatedInternal, and a couple others were some of the ones that could be attached to but were never called by ShardedDocumentStore--only by DocumentStores but not the ones inside the shard list in a ShardedDocumentStore. (I forget exactly which listeners still worked and which didn't.) So in this particular case, adding the missing listeners to the ShardedDocumentStore would make it a lot easier. But the point remains that it is a significant barrier to working with the RavenDB client stores abstractly when IDocumentStore interface doesn't define enough to cover all the available features.
Also, internal tools broke that we wrote to issue database commands to update indexes and return status information. Of course, I would assume the status information would still need to be per server and not aggregate across the shards. But since the indexes have to exist across every shard otherwise the entire request fails for all shards (another quirk that has bit us--an issue I still haven't decided is a good safe-by-default rule or not) then aggregating index updates across shards for example would make complete sense to do with one command that is automatically applied to all applicable shards. Going through every database command and deciding which make sense across shards and which don't may get a little awkward though--so I'm sure there is a lot of room for discussion on this suggestion.
This comment is getting too long... We can move this to the discussion group if you have more questions about my thoughts; let me know if you would like me to.
Oops, regarding the "tools that broke when using a ShardedDocumentStore", I meant just Glimpse.RavenDB and our custom admin tools to run various database commands -- not New Relic.
1) We will probably change that, see: http://issues.hibernatingrhinos.com/issue/RavenDB-3488
2) We are NOT going to use DTC, to start with, it is going to create locks, and cause a whole lot of issues. The problem with doing things this way is that you are going to generate a LOT of work on large systems, and you want to shard when you have a large system, after all. That means a big downtime.
3) That is a 4.0 feature that we are considering.
Note that tools like Glimpse.RavenDB needs to be initialized for _each individual document store_, not on the ShardedDocumentSTore.
And yes, I agree that this should probably go to the mailing list
Comment preview