The role of domain model with CQRS / Event Sourcing
I had some really interesting discussions while I was in CodeMash, and a few of them touched on modeling concerns with non trivial architectures. In particular, I was asked about my opinion on the role of OR/M in systems that mostly do CQRS, event processing, etc.
This is a deep question, because on first glance, your requirements from the database are pretty much just:
INSERT INTO Events(EventId, AggregateId, Time, EventJson) VALUE (…)
There isn’t really the need to do anything more interesting than that. The other size of that is a set of processes that operate on top of these event streams and produce read models that are very simple to consume as well. There isn’t any complexity in the data architecture at all, and joy to world, etc, etc.
This is true, to an extent. But this is only because you have moved a critical component of your system, the beating heart of your business. The logic, the rules, the thing that make a system more than just a dumb repository of strings and numbers.
But first, let me make sure that we are on roughly the same page. In such a system, we have:
- Commands – that cannot return a value (but will synchronously fail if invalid). These mutate the state of the system in some manner.
- Events – represent something that has (already) happened. Cannot be rejected by the system, even if they represent invalid state. The state of the system can be completely rebuilt from replaying these events.
- Queries – that cannot mutate the state
I’m mixing here two separate architectures, Command Query Responsibility Separation and Event Sourcing. They aren’t the same, but they often go together hand in hand, and it make sense to talk about them together.
And because it is always easier for me to talk in concrete, rather than abstract, terms, I want to discuss a system I worked on over a decade ago. That system was basically a clinic management system, and the part that I want to talk about today was the staff scheduling option.
Scheduling shifts is a huge deal, even before we get to the part where it directly impacts how much money you get at the end of the month. There are a lot of rules, regulations, union contracts, agreement and bunch of other staff that relate to it. So this is a pretty complex area, and when you approach it, you need to do so with the due consideration that it deserves. When we want to apply CQRS/ES to it, we can consider the following factors:
The aggregates that we have are:
- The open scheduled for two months for now. This is mutable, being worked on by the head nurse and constantly changes.
- The proposed scheduled for next month. This one is closed, changes only rarely and usually because of big stuff (something being fired, etc).
- The planned schedule for the current month, frozen, cannot be changed.
- The actual schedule for the current month. This is changed if someone doesn’t show to their shift, is sick, etc.
You can think of the first three as various stages of a PlannedScheduled, but the ActualSchedule is something different entirely. There are rules around how much divergence you can have between the planned and actual schedules, which impact compensation for the people involved, for example.
Speaking of which, we haven’t yet talked about:
- Nurses / doctors / staff – which are being assigned to shifts.
- Clinics – a nurse may work in several different locations at different times.
There is a lot of other stuff that I’m ignoring here, because it would complicate the picture even further, but that is enough for now. For example, regardless of the shifts that a person was assigned to and showed up, they may have worked more hours (had to come to a meeting, drove to a client) and that complicated payroll, but that doesn’t matter for the scheduling.
I want to focus on two actions in this domain. First, the act of the head nurse scheduling a staff member to a particular shift. And second, the ClockedOut event which happens when a staff member completes a shift.
The ScheduleAt command place a nurse at a given shift in the schedule, which seems fairly simple on its face. However, the act of processing the command is actually really complex. Here are some of the things that you have to do:
- Ensure that this nurse isn’t schedule to another shift, either concurrently or too close to another shift in a different address.
- Ensure that the nurse doesn’t work with X (because issues).
- Ensure that the role the nurse has matches the required parameters for the schedule.
- Ensure that the number of double shifts in a time period is limited.
The last one, in particular, is a sinkhole of time. Because at the same time, another business rule says that we must give each nurse N number of shifts in a time period, and yet another dictates how to deal with competing preferences, etc.
So at this point, we have: ScheduleAtCommand.Execute() and we need to apply logic, complex, changing, business critical logic.
And at this point, for that particular part of the system, I want to have a full domain, abstracted persistence and be able to just put my head down and focus on solving the business problem.
The same applies for the ClockedOut event. Part of processing it means that we have to look at the nurse’s employment contract, count the amount of overtime worked, compute total number of hours worked in a pay period, etc. Apply rules from the clinic to the time worked, apply clauses from the employment contract to the work, etc. Again, this gets very complex very fast. For example, if you have a shift from 10PM – 6 AM, how do you compute overtime? For that matter, if this is on the last day of the month, when do you compute overtime? And what pay period do you apply it to?
Here, too, I want to have a fully fleshed out model, which can operate in the problem space freely.
In other words, a CQRS/ES architecture is going to have the domain model (and some sort of OR/M) in the middle, doing the most interesting things and tackling the heart o complexity.
Comments
A nice thing about going with ES is that you can (if you want) keep a live domain model in memory without needing to map it to a relational (or NOSQL) model (and thus compromise it) .
Harry, That is only possible if your model is fairly small. As soon as it get to a reasonable size, the cost of reconstructing it from the event streams become prohibitively expensive, even if you can hold it all in mem.
I think you can get to a pretty unreasonable size these days, Hetzner can go up to 768GB of RAM, and if you have some redundancy (and some binary snapshotting).
It's a different paradigm, with different trade offs, but it can keep your model _pure_.
Harry, I think you are missing something here. Let's say that you model is 128GB in size total, which can comfortably fit in that size with room to spare for all the actual application logic / services, etc.
Now, if you are building that from events, you are likely going to read more events than the size of the data. Let's say that you need to read 512GB of events (but the size may very well be multiple TBs). I'll go further and assume you are able to read at a rate of 50MB / sec. This is the actual rate, including reading the data from disk, parsing if needed, processing to the model, etc. That means that it will take you about 3 hours to just read the data into memory. That is three hours in which you cannot do anything. Most businesses cannot actually sustain a 3 hours downtime if the server restarted, so this is what I mean about reasonable size. In practice, if it takes you more than a few minutes to build the model, that isn't a viable strategy. That probably put the top range of valid options at around ~10GB of raw events.
I think that's a valid example, but it's from the extreme end of the spectrum (a single ubernormous system).
You can mitigate - a machine going down by having multiple servers (e.g. LMAX do this) - the size of each model by some sensible sharding/partitioning (which you can put off until it's a problem) - the rebuild time by snapshotting the state, so you reload from there
Heck you could even use battery backed RAM!
I built an admittedly (mickey mouse sized compared to what we are talking about) system along these principles, and it did take a few minutes to spin back up, every so often, but the business impact was negligible.
It's a sliding scale of pros vs cons. Saying all "domains need ORM" or "all domains should be in-mem" are both overly strict doctrines, but I think people should be aware of their options.
I suspect that as a (the?) major NHibernate/Raven expert you may not feel like there's any friction introduced by ORM, but I found it very freeing to work without worrying about that aspect.
Harry, Pretty much all your options require that you'll introduce a much higher degree of complexity into your system. Unless you are building something that is really specific, I would say that you want to offload your persistence needs. Note that you can do that using something like Redis and use it's replication to keep everything in memory, but that still require some coordination.
The problem I have with your approach is that you are optimizing for development, and not for production usage, which is going to be a PITA to fix later on.
That sounds like it should be outside the bounded context of scheduling though. Assuming scheduling is an aggregate root, it's unlikely that nurse role details would come from state read from projecting scheduling events. So clearly the complexity of much of the business logic goes outside of the scheduling context, and something needs to orchestrate responses from various other aggregate roots in your domain. Once the orchestrating logic gives the all clear, Schedule AtCommand is much simpler as it's responsibility is smaller.
Dan, That just move the cheese. Something need to execute the business logic for nurse scheduling, and there you want the rich model
The other thing to consider with a ScheduleAtCommand.Execute() is that, if I'm making multiple changes at once, I could quite correctly have invalid state violating some of the rules because a subsequent change, coming in the same batch, makes everything alright again. A naïve implementation won't allow for this since it would validate the rules, and thus fail, for each change. Reminds of me how there's no way to temporarily disable FK constraints in a SQL Server database transaction so that you can change a few things and, only at the commit, have the FK constraints verified.
Ian, Yes, that is a common issue. It is usually something that you deal with using a
Warning
during normal modifications, and you have a business rule that states that you must fix allWarnings
before you can mark a schedule as final.Comment preview