Entities Associations: Point in Time vs. Current Associations
Having just finished giving three courses (2 on RavenDB and 1 on NHibernate), you might want to say that I have a lot of data stuff on my mind. Teaching is always a pleasure to me, and one of the major reasons for that is that I get to actually learn a lot whenever I teach.
In this case, in all three courses, we run into an issue with modeling associations. For the sake of the example, let us talk about employees and paychecks. You can see the model below:
Do you note the blue lines? Those represent Employee reference, but while they are both referencing the same employee, they are actually quite different associations.
The Manager association is a Current Association. It is just a pointer to the managing employee. What does this means?
Let us say that the manager of a certain employee changed her name. In that scenario, when we look at the current employee record, we should see the updated employee manager name. In this case, we are always interested in the current status.
On the other hand when looking at the paycheck PaidTo reference to an employee, we have something all together different. We have a reference no to the current employee record, but to the employee record as it was at a certain point in time. If the employee in question change his name, that paycheck was issued to Mr. Version One, not to Mr. Version Two, even though the name has been changed.
when dealing with associations, it is important to distinguish between the two options, as each require different way of working with the association.
Comments
This is really an interesting scenario, probably the easiest way to resolve is denormalizing the PaidTo into Paychecks, writing all relevant information of related Employee at the moment of paycheck. This will duplicate information but it is easier to manage, the employee has an Employee Id only, while paychecks can have Employee id but at the same time "EmployeeName" and all other information he want to retain even if the related employee changes.
I just posted about this a few weeks ago - exact same scenario. How do you actually handle something this? Do you keep every version of the employee that has ever existed with an "IsCurrent" field? Or do you copy all the relevant info (like name, address, department, manager - anything that can change with time) from the employee table(s) to the paycheck table(s)?
http://jarrettmeyer.com/blog/2012/05/17/understanding-database-denormalization/
Jarrett, I would copy all the relevant information.
looking at that schema I'd say give yEd graph editor a shot...
You're always cheating with your "point-in-time/snapshot" denormalizations because it fits the model of RavenDB nicely.
In the world I live in I have to provide what the customer asks for. I don't get to redefine the specification to be denormalization friendly. The real world is mostly normalized and interconnected. We can't just snapshot everything because its convenient to us developers.
Tobi, I gave two distinct examples here, I gave the point in time and the current view. The intent was to show the different aspects of the association, even though it is supposedly for the same feature.
Very glad you're touching on the subject of Temporal Data! It happens to be one of my core areas of interest. I started a blog on the subject awhile back, and will be expanding the posts soon.
As Gian Maria pointed out, one of the simplest ways to handle this would be through denormalization. The PayCheck class would certainly be an appropriate place to copy the employee's name, pay rate, etc. However this isn't always the most efficient thing to do, and sometimes it can lead to problems. For example, what if you needed to go back and make a correction? Could you be assured that the corrected data was paid at the correct (old) pay rate?
Sometimes it works out better to keep the entire history of data associated with the original entity. Through trial and error, I've found one of the best ways to handle this is with the Temporal Property pattern (http://martinfowler.com/eaaDev/TemporalProperty.html). It has very good alignment with DDD and NoSQL. I'm doing this in RavenDB right now. It is a little harder to implement in relational databases, but it can be done.
The other pattern often looked at is the Temporal Object pattern (http://martinfowler.com/eaaDev/TemporalObject.html). This is pretty much what happens when you use something like the RavenDB versioning bundle. It is also the general approach that many start with when building a history table in a relational database, or when adding "valid from" and "valid to" dates to a relation table. The biggest problem with Temporal Object is that it doesn't align well with the DDD concept of an Entity. Entities are defined by the fact that they have a single identity (an "ID" field). Temporal Objects require a compound identity of a primary key and a version. In other words, there are multiple instances persisted for any given identity, and that causes a lot of problems.
So I would recommend sticking with Temporal Property, or with careful use of denormalization, or perhaps both. The biggest down-side to Temporal Property is that it makes querying difficult. For example, Map/Reduce indexes in RavenDB don't quite work out well. But this is where denormalization has its advantages. Choose carefully.
I will eventually post a full implementation of Temporal Property (as a generic DDD value object) to my blog. If anyone needs it sooner, feel free to email me.
Let's say we're using EmployeeId as the Identity of the of an Employee. A Paycheck represents a point in time association. I like the idea of copying the Employee information from the Paycheck table at the time the check was issued. If the Employee changes their name, or address over time, then you shouldn't allow the Paycheck record to be updated. Even if you're presenting a projection of the Paycheck with Employee information. The check was issued to "Jane MaidenName", not to "Jane MarriedName".
To me the Paycheck record is immutable and even it's projections. Only if you implemented some kind of effective dating on the Employee record to represent how the Employee changes over time would it be safe to just link the Paycheck to the Employee.
I view it as the same way we denormalize for Address. We typically don't store CityId, StateId, ZipCodeId, we tend to store City, State, Zip.
Technical concerns aside nearly every application I have seen the user would prefer a full bitemporal model for all data.
The example you gave, how could you find out who a person's manager was last year, also when was the manager association actually updated in the application? The are two distinct aspects of temporal data for an employee's manager.
Some SQL databases such as IBM's DB2 already have built in support for full bitemporal verioning along with the SQL language support proposed back in the 90's for querying temporal data.
Just as most developer's demand fully version source control, most business application user's would demand it as-well if solutions existed for their business data.
http://en.wikipedia.org/wiki/Temporal_database
Richard Snodgrass has done extensive work in the area and has written a few books on the subject although focused on SQL implementations: http://www.cs.arizona.edu/people/rts/tdbbook.pdf
Another great resource for those going down the temporal data rabbit hole with relational databases (such TimeBase, or now DB2) is here:
http://temporaldata.com
There's a growing community around this subject as a whole. There's isn't a lot of NoSQL representation within that community yet.
Sorry, I meant TimeDB, not TimeBase. :)
Well, the two choices are you copy all data for each pay cheque, or entities follow an append-only model where any change to an employee (and it's relative structure) are created as new, with a reference to the predecessor. (original gets a reference end date.)
Either case you get duplication of data to satisfy the temporal nature of the data. Duplicate a bit of it every time you create a pay cheque, or duplicate the whole of it if and only when you change the employee. The ugly bit in the second scenario is fetching a complete history since you likely want to consider pay cheques not only for the latest revision of the employee, but all previous revisions.
I don't think NoSQL makes the situation any better or worse overall, it just changes the angle you look at the problem.
I'm sorry guys, but I can't see the problem. Can't the Paycheck just have the Employee ID, so it will always point to the Employee record?
If temporal data is a core requirement of an application then isn't event sourcing the way to go? That way you can reconstruct the data at any point in time and defer decisions about denormalisation until you need it to improve performance. I appreciate that there is a cost in terms of application complexity but, significant database migrations with temporal relational data (e.g. envers) look scary!
Calil, How can you see the name of the paycheck as it was when the paycheck was generated if you keep just the employee id?
Suppose someone got married and changed her family name. You look up the paycheck and it shows her new name. Are there legal reasons why you have to have the name like it was before the marriage? Or is it okay because it refers to the same person?
Bob, You have to show the name she had _when the paycheck was issued_. You can't show the new name.
Ayende, with the last questions I understand the problem. Well, if there is a business requirement that issues the temporal state of the data, so I agree that the best solution is to copy the required information from the Employee entity to the Paycheck, which should include the Employee ID, so it would be possible to browse from a paycheck record to the employee record.
An example I use when describing this temporal problem is think of building a contract system. You're partnering with another company and you have a contract. Let's use a marketing contract, where there is money exchanged for leads. So you have a Contract, that has Terms. A Term might be "pay me $5 per lead", or "pay me 5 cents per click". Now you setup this contract but after time you want to renegotiate. The Contract hasn't changed, it's still between two companies, but the terms of the contract have changed. The Term represents the temporal nature of the contract.
It doesn't even have to be a change to the terms. Let's say you setup the contract to be "for the first month I want $10 per lead, but we'll transition to $5 per lead for the remainder of the contract".
It's interesting that you use the paychecks example because we're writing a payroll system and run into this scenario quite a bit.
It's even more interesting when you start looking at the values used to derive the PayCheck values.
We essentially take historical copies of the rows in the table whenever anything changes, but it's definitely something I think about front and centre when defining relationships now.
Nice post.
I've used the temporal property pattern (pointed out above) to great effect as well. This was for a medical application so we had to reliably know the state of everything at certain points in time.
I think we made the right decision to do this but I still (several years after working on it) question whether copying data would have been simpler.
The only real problem was the complexity of the SQL required (join hell). For day to day use this was hidden behind a custom ORM but reports were a bitch. It performed great with the right index applied.
Another way to think about this type of problem, is to "validate" your property assumptions by asking: does the name of the property (or it's description) accurately describe the data being stored?
In this example, "employee name" is not actually the most accurate description of the information being stored about a paycheck - "name on paycheck" is more accurate, which will tell you that this is information about the check, not the employee. Same for salary, which is more accurately "amount on check", which is clearly information about the check.
In other words, try to describe the data as it relates to the entity you're working with - from "the check's point of view", if you will :-)
Comment preview