Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 6 min | 1128 words

imageFor some reason, the moment that peole start working on Enterprise Solutions, there is a... tendency to assume that because we are going to build a big and complex application, we can afford to ignore best practices and proven practices.

That is just wrong. And it applies in many manners to many parts of the application, but it never applies to something as obiviously as it applies to the database.

This is just the results of a very short research:

  • The metadata driven approach, AKA the 5 tables DB
  • The Blob
  • The  5th normalized form DB
  • The 5th denormalized form DB
  • The table a day DB
  • The obsfuscated database - create table tbl15 (fld_a int, fld_b int, fld_g nvarchar(255) )

And those are just of the database structure, we haven't yet arrived to the database behavior yet. Here we have interesting approachs such as rolling your own "Transactions" table, requiring 40 joins to get a simple value, using nvarchar as the ultimate extensibility, etc.

So, how do we approach building a database that we can actually show in public? Let us start with thinking about the constraints that we have for the application. The database needs to support...

  • OLTP for the application itself.
  • Reports.
  • Performance.
  • ETL processes to take data in and out of the sytem.
  • Large amount of data.
  • Large amount of entities.

Notice how we don't have any suggestion about integrating with the application at the database level. That is a fairly common antipattern.

We are going to try to keep a one to one mapping between the entity and the table structure, because that will make it significantly easier to work with the system. One to one means that the Account Name translate to an Account.Name in the entity model, and Accounts.Name in the database model.

 

image

image We probably want the usual auditing suspects, such as Id, CreatedAt, ModifiedAt, CreatedBy, ModifiedBy, Owner, OrganizationUnit. Maybe a few more, if this makes sense. Probably not too much, because that can get to be a problem.

Accessing and manipulating this in the application layer it going to be trivial, so I am not really going to cover that in depth.

What I do want to talk about is how we expose it to the outside world? And by that I mean reports and ETL processes.

We have several options to try to do that. We can just let the reports and ETL processes to read from the database tables directly. This is the simplest approach, I think.

Other options include views, stored procedures, and other DB thingies that we can use. I have seen systems where an entity was composed of several tables, and the report was done off of a view that joined it all together.

The underlying problem is that I have versioning to consider. I am going to dedicate a full post to the problem of building upgradable user customizable applications, so I'll delay it to then, but let us just say that radically changing the database schema between version will be painful for the users. The usual way to handle that is to only make promises for the views, and not for the tables themselves. That is a good way to handle that, in my opinion.

I would suggest putting those in a separate schema, to make it clearer that the seperation is imortant. This also gives you the ability to later do a view to table refactoring, and maintain that with triggers. This is usually a bad idea, but for performance reasons, it may very well be a good solution.

ETL processes can use the same mechanism that reports use to read data from the database efficently, but nothing writes to the database except the application. At one point I wrote a change DB password util that run every two hours, it would change the database password and update the relevant config files.

I think that you can guess what the driving force to that where, no?

Getting data into the database can be done through application services (not neccesarily web services, btw). A simple example would be API similar to this one:

void Insert(params Account[] accounts);
void Update(params Account[] accounts);
void Delete(params Account[] accounts);

This API explicitly allows for bulk operations, so it can be very nice to work with, instead of having to do things on per-row basis, which basically kills performance.

How to get good performance from this system is another hard question. In this case, I would usually recommend on getting a good DBA to look at the perfrormance charactaristics of the application and optimize the database structure if needed. But, a much easier solution to performance problems in the database server is to not hit the DB server, but use caching. Distributed caching solutions, like Memcached, NCache, etc are really good way of handling that.

No business logic in the DB! This is important, if you put business logic in the DB, you have to get to the DB in order to execute the business logic. This kills scalablity, hurts the ability to understand the solution, and in general makes life miserable all around.

Reports are an interesting problem. How do you deal with security, for instance? Consider the security infrastructure that I already presented. This security infrastructure should also come with a database function that you can use like this:

SELECT * FROM Accounts 
WHERE IsAllowed(Accounts.EntitySecurityKey, @UserId, 'Account.View')

Or this:

SELECT * FROM Accounts
WHERE Accounts.EntitySecurityKey IN (
	SELECT EntitySecurityId FROM GetAllowed(@UserId, 'Account.View') 
)

Both of which provides really easy manner to get security for the reports. If we wanted to enforce that, we can force the report writer to write somtehing like this:

SELECT * FROM GetAllowedAccounts(@UserId, 'Account.View')

We can probably get away with assuming that 'Account.View' is the default operation, so it is even shorter. Among other things, this actually have valid performance characteristics.

This post it turning out to be a "don't make stupid mistakes" post, because I don't think that I am writing anything new here. About how to avoid making stupid mistake, that is fairly simple as well. Get a good DBA (that is left as an excersize for the reader), give him/her a big stick and encourage good design through superior firepower.

time to read 13 min | 2408 words

imageIf deployment means a week with little sleep and beepers going off just as you nod to sleep, then you are doing it wrong. I am saying that as someone whose standard deployment procedure has two steps, "svn up" and "build.cmd". Actually, I lied, I have a single step, deploy.cmd, which runs them both for me.

Now, you can't always get this procedure to work, this is especially true if you need to deploy several different parts of the same application at the same time. The classic example is making a change that requires modifying the database as well. More interesting examples consist of several layers of the same application, web servers, application servers, dedicated tasks servers and the database servers, all of whom need to be touched in order to deploy the new version of the application.

One of my customers takes the simple approach of shutting down everything and systematically upgrading all the systems when the application itself is offline. That is probably the simplest approach, you don't have to worry about concurrent state and multi versioning, but it is also the one that is harder to the business to deal with.

In fact, service level agreements may very well limit or prevent doing that. Here is a simple table of the common SLA and their meaning, in terms of system maintenance:

Uptime Time for offline maintenance, monthly
99% 7 hours, 12 minutes
99.9% 43 minutes
99.99% 4 minutes
99.999% 26 seconds

The last number is the one we should pay attention to, it is the magic 5 nines that SLA often revolves around. Now, SLA often talks about unplanned downtime, and in many systems, you can safely take it down outside of business hours and tinker as much as you like with it. Those are not the kind of systems that I want to talk about today. The system that you can take down safely are fairly easy to upgrade in a safe manner, as was previously mentioned.

The systems that are more interesting are the ones that you want to be able to update on the fly, without a painfully long deployment cycle. This is directly related to the ability to meet required SLA, but it also has another critically important factor. Developer productivity. If making a change in the application is something that requires stopping the application, updating files, starting it up again and then being hit with the initial, it is painful, annoying and will lead developers to jump through all sorts of really nasty hops along the way.

So easy deployment is critical both for the developer and for production, but the needs of each are vastly different. For developers, having it work "most of the time" is fine, having it consume more memory is also fine, because the application is not going to run for days on end. For production, it must work all the time, and you must watch your resource consumption very carefully.

A typical hot deployment for production will usually look like this:

image

This assumes a typical production scenario in which you have duplicates of everything, for fail over and performance.

This is just an extension of the previous cold deployment, since we are taking servers down, just not the application as a whole.

This assumes that we actually can take vertical parts of the application down, which is not always the case, of course.

But, basically, we start with both servers operational, move all traffic to the second server, and then perform a cold deploy to the first set of servers.

Then we move all traffic to the new server and perform cold deployment on the second server.

In the end, this usually means that you can perform the entire thing upgrade without loss of service (although you may suffer loss of capacity.

As I said, this is probably the ideal scenario for perform hot deployments, but it makes quite a few assumptions that are not always valid. The main one is that the application is structured vertically in such a way that we can close and open parts of it and there are no critical points along the way.

I can think of one common scenario where this is not the case, several web servers sitting on top of the same database server, for instance. So, while this is a good approach, we probably want to consider others as well.

Another option is to design the system with multiply concurrent versions from the get go. This gives us the following diagram:

image

We have several versions running at the same time, and we can migrate users slowly from one version to the next. This also gives us a nice rollback strategy if we have a botched version.

The problem with this approach is that it is a complicated procedure. This is especially true if the application is composed of several independent services that needs to be upgraded separately. Again, database and web servers are common example, but even web services that needs to be upgraded requires some interesting mechanics when you have enough of them.

I am going to talk about multi versioned databases in the next post in the series, so let us just take as a given that this is possible, and consider the implications.

If I can run multiply versions at the same time, this means that I can probably propagate a change throughout a cluster of servers without having to worry about the change over semantics. (The time between the first server getting the change and the last one getting it).

Before we go on with production deployments, I want to talk a bit about developer deployments. What do I mean by that?

Well, deployment is probably the wrong term here, but this is the way most of the enterprise platforms make you go about it.  And that is a problem. Deploying your changes just in order to run them is a problematic issue. It is problematic because it interferes with make-a-change/run approach, which is critically important for developer productivity.

As mentioned above, any multi step process is going to be painful as hell over any significant development. I often have to go through several dozens of change/run before I can say that I have finished with a particular feature.

There is a strong incentive to make it as easy as possible, and as it turns out, we actually have the tools in place to do it very easily indeed. I have produced a screen cast that talks about hot deployment of compiled code, and applying the same principals to code in its textual form is demonstrated here.

image image Here is the directory structure that I envisioned for the platform. As you can guess, this is backed by source control, but more importantly, it is alive (queue mad laugher here).

But what do I mean by that? I mean that any change whatsoever that is done to this directories will be automatically picked up and incorporated by the system as soon as the change is detected.

This supports very nicely the idea of change & run mentality. But what actually goes into these folders?

The parts that are important to us are the entities, controllers and views. For discussing those, we will take the Account entity as an example.

For the system, an Account entity is composed of the following files:

image

There is a reason why I choose to use boo as the primary language for extending the platform. Not just because of personal bias (which exists), but because it makes it so much easier to deal with quite a few issues.

One of them is the ability to define my own syntax, so the content of the Account.boo file would be similar to this:

entity Account:
    AccountId as int
    Name as string

The ability to define entity as a keyword means that I don't need to do anything more to define any persistence concerns, even though I intend to use this as an Active Record class, it is all handled by the framework.

I do intend to allow extension using compiled code, that is why the binaries folder is there for, and you can certainly define C# classes, but the main idea here is to allow easy editing of the whole thing, which means that compilation is not necessarily a good thing.

So, after this explanation, let us go back a bit and talk about what deployment means in this scenario? Well, the first thing that it means is that once a change is detected, you want to recompile the file and keep on moving without breaking your stride. Brail itself is a good example of it. Brail templates are instantly updated if changed, but they are usually compiled (thus supposedly faster). From the developer perspective, there isn't any difference between the two. It works very well in practice, and the productivity boost means that it is encourage the small steps approach. All in all, I am quite happy with it.

I am going to leave the technical details aside for now, let us just say that it is equally easy to do in both source and binary form, and you can either see the webcast or check the post about each one.

There are a few things that we should be worried about, however, mainly, recompiling files all over the place will cause an assembly leak, which can have an adverse affect on memory consumption. Here we get to the some interesting design decisions. Batching compilation will help greatly here, so we can grab all the controllers and compile them into a single assembly, then recompile the changes into separated assemblies. This is the solution used by Brail, for instance.

This seems like it can cause problems, and in theory, it will, but in practice, there are several mitigating factors:

  • During development, we can expect the application lifetime to be short enough that assembly leakage is rarely an issue, if it is, there is a small cost to restarting the application.
  • On production, we will rarely expect to have a lot of churn, so we can handle the extra memory requirement (in the order of a single megabyte or so, in most cases).

More advance scenarios calls for either AppDomain restart (the way ASP.Net does it) or a separate AppDomain that can be safely loaded. Personally, I think that this would make the situation much harder, and I would like to avoid it if possible. The simplest solution works, in my experience.

What this all means is that a developer can go and make a change to the Account controller, hit the page and immediately get the changes made. Deployment now means that we commit to the development branch, merge to the production branch, and then we request the production system to update itself from the source. A form of CI process is a valid idea in this scenario, and will probably be the default approach to updating changes in a single system scenario. We have to have a way to disable that, because we may want to upgrade only some of the servers at a time.

This leaves us with the following issues:

imageDebugging - How can we debug scripts? Especially dynamically changing scripts? Despite what some of the Ruby guys say, I want to have the best debugger that I can available. A platform with no or poor debugging support has a distinct disadvantage.

As it turns out, the problem isn't that big. The Visual Studio debugger is a really smart one, and it is capable of handle this in most cases with aplomb. And since Boo code is compiled to IL, VS has few issues with debugging it. For the more complex scenarios, you can use C# and just direct the build path to the binaries folder.

In any case, debugging would usually involve "Attach to process" rather than "Start with debugger", but this is something that I can live with (Attach is faster anyway).

Database modifications - let us just hand wave the way we are going to handle that for a minute. We will just assume we can do something like UpdateSchema() and it will happen, we still need to think about the implications of that.

Now we need to think about how we are going to handle that when we update an entity. Do we want this update to the schema to be automatic, or do we want it to happen as a result of user input? Furthermore, changing the entity basically invalidate the ability to call the database, so how do we handle that?

Do we disable the usage of this entity until the DB is updated? Do we just let the user to run into errors? What do we do for production, for that matter? I definitely not going to let my system run with DDL permission for production, so that is another problem.

I can give arguments for each of those, but a decision still has to be reached. Right now I think that the following set of policies would serve well enough:

  • For production, all database changes must be handled after an explicit user action. Probably by going to a page and entering the credentials of a user that can execute DDL statements on the database.
  • For development, we will support the same manual approach, but we will also have a policy to auto update the database on entity change.

We are still in somewhat muddy water with regards to deploying to production with regards to changes that affects the entire system, to wit, breaking database changes. I am going to discuss that in the next installment, this one got just a tad out of hand, I am afraid.

time to read 11 min | 2133 words

I have been asked how I would design a security infrastructure for my vision of an enterprise platform, and here is an initial draft of the ideas.

As anything in this series, no actual code was written down to build them. What I am doing is going through the steps that I would usually go before I actually sit down and implement something.

While most systems goes for the Users & Roles metaphor, I have found that this is rarely a valid approach in real enterprise scenarios. You often want to do more than just the users & roles, such as granting and revoking permissions from individuals, business logic based permissions, etc.

What are the requirements for this kind of an infrastructure?

  • Performant
  • Human understandable
  • Flexible
  • Ability to specify permissions using the following scheme:
    • On a
      • Group
      • Individual users
    • Based on
      • Entity Type
      • Specific Entity
      • Entity group

Let us give a few scenarios and then go over how we are going to solve them, shall we?

  1. A helpdesk representative can view account data, cannot edit it. The helpdesk representative also cannot view the account's projected revenue.
  2. Only managers can handle accounts marked as "Special Care"
  3. A team leader can handle all the cases handled by members in the team, team members can handle only their own cases.

The security infrastructure revolves around this interface:

image

The IsAllowed purpose should be clear, I believe, but let us talk a bit about the AddPermissionsToQuery part, shall we?

Once upon a time, I built a system that had a Security Service, that being a separate system running on a different machine. That meant that in order to find out if the user had permission to perform some action, I had to send the security service the entity type, id and the requested operation. This worked, but it was problematic when we wanted to display the user more than a single entity at a time. Because the system was external, we couldn't involve it in the query directly, which meant that we had to send the entire result set to the external service for filtering. Beyond the performance issue, there is another big problem, we had no way to reliability perform paged queries, the service could decide to chop up 50% of the returned results, and we would need to compensate for that somehow. That wasn't fun, let me tell you that.

So, the next application that I built, I used a different approach. Instead of an external security service, I had an internal one, and I could send all my queries through it. The security service would enhance the query so permissions would be observed, and everything just worked. It was very good to observe. In that case, we had a lot of methods that did it, because we had a custom security infrastructure. In this case, I think we can get away with a single AddPermissionsToQuery method, since the security infrastructure in place is standardize.

Now, why do we have a Why method there? Pretty strange method, that one, no?

Well, yes, it is. But this is also something that came up through painful experience. In any security system of significant complexity, you would have to ask yourself questions such as: "Why does this user see this information" and "Why can't I see this information" ?

I remember once getting a Priority Bug that some users were not seeing information that they should see, and I sat there and looked at it, and couldn't figure out how they got to that point. After we gave up understanding on our own, we started debugging it, and we reproduced the "error" on our machines. After stepping through it for ten or twenty times, it suddenly hit me, the system was doing exactly what it was supposed to do. I stepped over the line that did it in each and every one of the times that I debugged it, but I never noticed it.

You really want transparency in such a system, because "Access Denied" is about the second most annoying error to debug, if the system will give you no further information.

Now, I am going to show you the table structure, this is not fixed in stone, and don't try to read too much into seeing a table model here. It simply make it easier to follow the connections that a class diagram would.

image

Let us go over some of the concepts that we have here, shall we?

Users & Groups should be immediately obvious, let us focus for a moment on the Operations and Permissions. What is an operation? Operation is an action that can happen in the application. Examples of operations are:

  • Account.View
  • Account.Edit
  • Account.ProjectedRevenue.View
  • Account.ProjectedRevenue.Edit
  • Account.Assign
  • Account.SendEmail

As you can see, we have a fairly simple convention here. [Entity].[Action] and [Entity].[Field].[Action], this allows me to specify granular permissions in a very easy to grok fashion. The above mentioned operations are entity-based operations, they operate on a single entity instance at a time. We also have feature-based operations, such as:

  • Features.HelpDesk
  • Features.CustomerPortal

Those operate without an object to verify on, and are a way to turn on/off permissions for an entire section of the application. Since some operations are naturally grouped together, we also have relations between operations, so we will have the "Account" operation, which will include the "Account.Edit", "Account.View" as children. If you are granted the Account operation on an entity, you automatically get the "Account.Edit" and "Account.View" on the entity as well.

This makes the design somewhat more awkward, because now we need to go through two levels of operations to find the correct one, but it is not a big deal, since we are going to be smart about how we do it.

Permissions are the set of allowed / revoked permissions for an operation on an EntitySecurityKey (will be immediately explained) which is associated with Group, User or EntityGroup.

A simple example may be something like:

  • For User "Ayende", Allow "Account" on the "Account Entity" EntitySecurityKey, Importance 1
  • For Group "Managers", Revoke "Case.Edit" on "Case Entity" EntitySecurityKey, Importance 1
  • For Group "Users", Revoke "Account.Edit" on "Important Accounts Entity Group" EntitySecurityKey, Importance 1
  • For Group "Managers", Allow "Account.Edit" on "Important Accounts Entity Group" EntitySecurityKey, Importance 10
  • For User "Bob from Northwind", Revoke "Account" on "Northwind Account"  EntitySecurityKey, Importance 1

The algorithm for IsAllowed(account, "Account.Edit", user) is something like this, get all the operations relevant to the current entity, default to deny access, then check operations. Revoke operation gets a +1, so it is more important than an Allow operation in the same level. Or in pseudo code (ie, doesn't really handle all the complexity involved):

bool isAllowed = false;
int isAllowedImportance = 0;
foreach(Operation operation in GetAllOperationsForUser(user, operationName, entity.EntitySecurityKey))
{
	bool importance = operation.Importance;
	if(operation.Allow == false)
		importance + 1; 
	if ( isAllowedimportance <  )
	{
		isAllowed = operation.Allow;
		isAllowedimportance = operation.Importance;
	}
}
return isAllowed;

As you had probably noticed already, we have the notion of an Entity Security Key, what is that?

Well, when you define an entity you also need to define its default security, this way, you can specify who can view and edit it. Then, we we create an entity, its EntitySecurityKey is copied from the default one. If we want to set special permissions on a specific entity, we will create a copy of all the current permissions on the entity type, and then edit that, under a different EntitySecurityKey, which is related to its parent.

All the operations in the child EntitySecurityKey are automatically more important then the ones in the parent EntitySecurityKey, regardless of the important score that the parent operations has.

In addition to all of that, we also have the concept of an EntityGroup to consider. Permissions can be granted and revoked on an Entity Group, and those are applicable to all the entities that are member in this group. This way, business logic that touches permissions doesn't need to be scattered all over the place, when a state change affects the permissions on an entity, it is added or removed to an entity group, which has a well known operations defined on it.

Now that you probably understand the overall idea, let us talk about what problem do we have with this approach.

Performance

The security scheme is complex, and of the top of my head, given all the variables, I can't really think of a single query that will answer it for me. The solution for that, like in all things, it to not solve the complex problem, but to break it down to easier problems.

The first thing that we want to consider is what kind of question are we asking the security system. Right now, I am thinking that the IsAllowed method should have the following signatures:

public bool IsAllowed(Operation, User, Entity);
public bool IsAllowed(Operation, User);

This means that the question that we will always ask is "Does 'User' have 'Operation' on 'Entity'?", and "Does 'User' have 'Operation'?". The last is applicable for feature based operations only, of course.

So, given that this is the question we have, how can we answer this efficiently? Let us try to take the above mentioned table structure and de-normalize it to make queries more efficient. My first attempt is this:

image

This allows you to very easily query by the above semantics, and get all the required information in a single go.

A lot of the rules that I have previously mentioned will already be calculated in advance when we write to this table, so we have a far simpler scenario when we come to check the actual permissions.

For instance, the EntitySecurityKey that we send is always the one on the Entity, so the DenormalizedPermissions table will always have the permissions from the parent EntitySecurityKey copied with pre calculated values.

Since everything is based around the EntitySecurityKey, we also have a very simple time when it comes to updating this table.

All we need to do it rebuilt the permissions for this particular EntitySecurityKey.

This makes things much easier, all around.

 

Querying

What this means, in turn, is that we have the following query to issue when we come to check permissions:

SELECT dp.Allow, dp.Importance FROM DenormalizedPermission dp
WHERE       dp.EntitySecurityKey = :EntitySecurityKey
AND         dp.Operation = :Operation
AND         (dp.User = :User OR dp.Group IN (@UserGroups)
                  OR EntityGroup IN (@EntityGroups) )

All we need to do before the query is to find out all the groups that the user belongs to, directly or indirectly, and all the Entity Groups that the entity belongs to.

When it comes down to check a feature-base operation, we can issue the same query, sans the EntitySecurityKey, and we are done.

Another important consideration is the ability to cache this sort of query. Since we will probably make a lot of those, and since we are probably also going to want to have immediate response to changes in security, caching is important, and write-through caching layer can do wonder for making this optimized.

What is missing

Just to note: this is not complete, I can think of several scenarios that this has no answer for, from the Owner can do things other cannot to supporting permissions if the organization unit is identical for the entity and the user. However, adding those is fairly easy to build within the system, all we need to do is define an action that would add the owner's permissions explicitly to the entity, and remove it when they are changed. The same can be done for entities in an organization unit, you would have the group of users in Organization Unit Foo and the Entity Group of entities in Organization Unit Foo, which will have a permission set for that group.

Final thoughts

This turned out to be quite a bit longer than anticipated, waiting expectantly for you, dear reader, to tell me how grossly off I am.

Next topics:

  • Hot deployments and distributed deployments
  • A database that doesn't make you cry
  • Supporting upgrades
  • Platform I/O - integration with the rest of the enterprise

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 4 days from now
  3. What happens when a sparse file allocation fails? - 6 days from now
  4. NTFS has an emergency stash of disk space - 8 days from now
  5. Challenge: Giving file system developer ulcer - 11 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}