Maarten Mullender Microsoft Corporation
Summary: Microsoft Architect Maarten Mullender examines the CRUD (Create, Read, Update, Delete) model, and determines that CRUD requires a deep understanding of its limits, specifically when dealing with concurrency issues. (11 printed pages)
After my article on Dealing with Concurrency: Designing Interaction Between Services and Their Agents and my talks at TechEd in Amsterdam, I needed to express my thoughts on CRUD (Create, Read, Update, and Delete). I concluded that article with the following recommendations:
When I wrote "Limit optimistic updates," one of the things I meant to imply is to limit the use of CRUD, because you shouldn't use CRUD if you don't have to. Here, I'll elaborate on this and discuss those cases where it makes sense to use it, because despite its limitations using replication (CRUD) is a great way to go for a certain set of problems.
Hearing people talking about CRUD makes me cringe; I feel the urge to react. So, let me get that off my chest first.
Let me try to explain what I mean with CRUD and what not. CRUD to me means that the entities are simply read, created, updated and deleted. CRUD works with entities and those entities are changed by retrieving data from a service, modifying the data by setting properties, and then sending the data back to the service for update. CRUD is data-oriented. I don't ask for a specific business action, I just ask the service to create an entity, or delete an entity, or I ask it to update the entity by giving the service the new data for that entity. If more needs to be done, the service derives from the data changes what action to execute.
In contrast, most order processing is not CRUD, or at least not according to my definition. For example, an order can be created offline and then sent (replicated if you will) to a service for processing. Processing of that order will affect many of the related entities. The service may update the customer information, potentially changing more than just the year-to-date totals. For instance, the customer might have reached the critical order mass and be upgraded, updating properties used for price and discount calculations; products may or may not be available; delivery dates may or may not have been realistic; and so forth. These changes are important to both parties, but with CRUD, the customer's copy of the order would not reflect them.
Regardless of the physical implementation, an order has the semantics of requesting the service accept and process a new order. Such an order is a complex request for a complex business transaction and I regard such a request as a "journal." A journal being a set of business actions combined in a single request document.
The reason for my allergic reaction to CRUD may stem from the way I see some developers misuse Datasets. Such a developer will write a service that returns a Dataset to the user, where it is modified and sent back. The service or even a stored-procedure checks the before and after image, the timestamp or the version ouch! That is a very crude way of checking the validity of the request Will my request fail to execute if someone changed something irrelevant in this record, and can it execute even though someone changed related data? Am I prevented from changing the telephone number because someone else added an e-mail address? Can I still add the order line even though someone changed the price or the product description?
Instead, why not have the request itself specifically state what it entails and specify when the update should be executed or refused? For example, rather than "Buy when the version of the information retrieved in an earlier request was 30256917," have the condition be "Buy when the price is at most x." Or why not instead of: "the old address was y, the new address is z" have a request that specifies the reason for change. Did the customer move? Then we better start a complete process. Or did we just correct a typo in the contact information?
Let's take an example that clearly does not work and analyze why it doesn't work. I keep my local check register and I transfer money from my U.S. checking account to my Dutch Checking account. If I specify that change by specifying the changed values of my checking account:Account NumberAccountBalanceInterest Rate From654321-90U.S. checkingUSD 1200.00Rate 3.41% To654321-90U.S. checkingUSD 1000.00Rate 3.41%
And the changes of my Dutch checking account:Account NumberAccountBalanceInterest Rate From654321-10Dutch checkingEUR 45000.00Rate 3.56% To654321-10Dutch checking EUR 45244.09Rate 3.56%
This fails because it:
Is there a better way to do this? A much better way to do this is to send the data to the bank as a series of requests framed as business actions such as: "Add USD 200 to my Dutch checking." This way, there is only one source for the data and the two do not need to be rectified. Also, I let the business logic check for the other dependencies, such as the conversion rate.
Yet a lot of solutions have been built using the principle of CRUD and at least some of them were very successful. CRUD has been used successfully for entity aggregation as well as for offline scenarios. Why did the designers choose CRUD?
Replication is the main reason for designers to use CRUD. Ease of interface design and implementation is another. People want to use replication as offered by e-mail systems and databases. Even though I will not go into the various ways to configure databases for data replication, I will look into how to use replication to keep two or more systems synchronized while changes may occur on each. As we are all aware, in computer science there are only three numbers: zero, one and more. Hence, when I write about replication between two systems, feel free to extrapolate this to more.
When designing for replication, it is wise to assign ownership of the data. We often talk about "single master." The essence is that for each piece of data a single owner exists, making conflict resolution easier. This applies to services that have overlapping data, such as the customer information in an ERP system and in a CRM system.
When replicating between multiple clients and a service, the service obviously owns the data.
However, when changes can be made on multiple client machines, this single master concept doesn't help much. Making sure that changes stem from only one source, be it a system or a real person, helps avoiding conflict. And it helps providing clear error messages in case a conflict occurs after all.
When two systems, each with its own data store and each with its own business logic, need to synchronize their information I can, using CRUD, either replicate at the data level or I can replicate at the service level.
This type of replication can be seen between databases, between file systems and in email systems. This type of replication implies CRUD. It uses the changes to the datathe entitiesfor the replication. When replicating at the data level, the business logic in the services on either side have no easy way of protecting the consistency of the data; the data will just be replicated and the business logic will only execute when a conflict is detected, if at all. Since the data in the system gets changed without much scrutiny, for a workable solution, this implies a high degree of trust and leads to tight coupling between the systems.
Database-to-databaselevel replication has the advantage that you can use standard mechanisms such as database replication. This immediately leads to a couple of other advantages. These mechanisms are typically fast, they sometimes support synchronization through firewalls and, the design and coding effort is greatly reduced.
The alternative is to provide some additional logic to walk through the changed data on both sides and use the service's published interface to replicate the changes. Normally this takes more development work and may carry a performance penalty. When you choose this method, you are not necessarily constrained to using CRUD and you may use any mechanism you want.
Even though service-to-service replication needn't go through CRUD, often entity services will offer CRUD semantics because in general they make replication mechanisms easier to build. When a request comes in, the before image of the entity is retrieved, the changes are applied, the after image is retrieved and both before and after image are stored in a change log for later use during the replication. This can easily be provided as a piece of reusable code in the service facade.
I can make our example even more complex by combining these two mechanisms. I can do this by adding a data store in the middle. This would be useful in a scenario in which I use Microsoft Outlook 2003 as the front end to a CRM system for instance. Outlook would have the data in its local store, for offline use and Exchange Server 2003 would sit in the middle and synchronize the data with the CRM system on the one side and with Outlook on the other. Outlook and Exchange Server would replicate the data using their own established mechanisms and Exchange and the CRM service would replicate the entities through the business logic of the CRM system. I will get back to this scenario later.
By using CRUD, you choose for an optimistic concurrency mechanism. You retrieve an entity, you change it and then you send it back. If the entity was changed in the service in the mean time, you get an optimistic failure. CRUD does not work well, when there is a good chance of a concurrency conflict, nor when it is hard notifying the user of such a conflict.
However, concurrency is not always an issue. The contact information in a customer relations management (CRM) system will not change often. The appointments in my calendar are typically only changed by me. Even if our administrator has complete change rights to my calendar, the chance of both of us changing the same appointment is slim, because of the common business practice that she just does not change my calendar unless she has to.
This is a good example of how in many cases, concurrency issues are guarded better by business practice or habit, policies and process than by the software. The software should support these and possibly enforce these, but without them, the software enforcement makes little sense.
Using CRUD for replication solutions seems very reasonable. So can we use CRUD or can't we? Let's compare the CRUD-based service interfaces with a few other cases and see if we can draw parallels.
In object-oriented design (OOD) a class typically has a set of properties. These properties are accessors (Get/Set) to the private data for that class. Properties are generally used to keep the internal state consistentthey are not used to trigger complex business logic. The Set methods typically contain little code and they typically have little or no effect on other classes and instances. Otherwise, changing the value of a property would lead to a cascading effect of property changes and the system would become unmanageable. This means that not all properties can be set by all levels of the design. The inventory quantity of a product cannot be changed simply by a user or even by another class. Usually the design will have a posting method that handles incoming goods, outgoing goods, or an inventory correction. It makes sure that the auditing requirements are met and that the general ledger and the cost accounting will be updated etc. The SetInventoryQuantity property will be private, not public, and it will definitely not be exposed outside of the component.
Using CRUD compares to using public properties and using business actions compares to using public methods. Only CRUD may change multiple properties at once. An update will retrieve the set of properties of an entity (or object), change one or more of these properties and send the property-set back to the service for update. (Luckily it allows us to send all properties at once otherwise performance alone would make CRUD unworkable.)
Property changes that have side effects (effects that transcend keeping internal state consistent) are not CRUD candidates. For these, you should define requests that convey the business meaning of those requests as well as the conditions under which to execute them. The inventory quantity cannot be changed through CRUD; it can only be changed indirectly, for instance, by submitting an incoming goods order.
In working with databases, CRUD is the normal way of dealing with data, regardless of whether or not we offer CRUD at the service interface. If you offer a CRUD service interface, the service will in most cases take the property changes and map them to logical requests that are executed and, deep down in the service, use CRUD to do the database updates. If you do not use a CRUD interface, but offer business actions such as "upgrade the customer" or "set the new quantity for inventory," the service will execute the request and then, again, deep inside the bowels of that service, still use CRUD on the database.
But there are a few notable differences between talking to a database and talking to a service, even when talking to a service using CRUD:
Data in the database is typically normalized. That means that properties aren't duplicated and consequently a single update typically suffices to maintain consistency. Services often offer a denormalized view on the entities. But, if we look at a single service I don't think it's a major difference. A well-designed service could (and should) provide the logic to keep the data consistent. It becomes different if the same information is held in multiple services, for instance changes to my bank accounts in the above example. Then logic external to these services has to keep the services synchronized. In our CRM example, the contact information may be stored both in the CRM system and in the e-mail system.
I think all three points lead to the same conclusions. Services are not databases, but by only providing CRUD interfaces from a user's perspective they offer a poorer form of database-like behavior. Restricting yourself to CRUD is like designing classes that only expose properties and no methods. And, it is like designing a service that acts like a database. There is a very narrow range of problems for which this is acceptable. Offering only CRUD is not sufficient in cases where you change multiple entity instances in one business transaction and require consistency between these entities. In such cases, services should offer business actions and business logic and be given the information needed to execute those business actions and ensure that business logic.
Think about it this way. A service encapsulates entities and it protects consistency through business logic. If the original business request affects multiple entities, a service can only provide complete service if it gets enough information about the original request. Then it can execute business logic, apply that business logic even across multiple entities, and maintain both information consistency and transactional consistency. CRUD does not offer this.
Now, this may all be true, but is it important? As is always the case, sometimes it is and sometimes it isn't. Architecture and design are all about making trade offs. So, when is it important to expose business actions instead of CRUD?
Why did the account value increase EUR 244.09? Was it a deposit? Transfer from another account? Payroll payment? Is this amount taxable?
Because it uses more attributes than needed. The chance of conflict increases steeply with the number of attributes specified. Why can't I add the Dutch translation of the product description, when someone else added the German translation?
Many transactions depend on values in other entities. You cannot set the home currency balance of a foreign currency account without making sure the rate didn't change.
Booking a trip where it is important to have a flight, hotel and car, or booking an order where both products and service are required at a specific time are examples of requests that require consistency.
The simplest examples of concurrency issues are the general ledger and the stock-on-hand information. They are hardly ever, if at all, exposed through CRUD. Not because of dependencies, but because of the concurrent nature of updates. The Venetians already knew this in the fourteenth century and that is why they invented postings and journals.
The contact information in a CRM system will not change often. The appointments in my calendar are typically only changed by me. Even if our administrator has complete change rights to my calendar, the chance of both of us changing the same appointment is slim, because of the common business practice that she just does not change my calendar unless she has to. In many cases, concurrency issues are guarded better by business practice or habit, policies and process than by the software. The software should support these and possibly enforce these, but without them, the software enforcement makes little sense.
To summarize this: Only use CRUD when concurrency issues are rare and can be resolved.
Concurrency issues may be rare because:
You can reduce concurrency issues by minimizing changes to entities and by creating new instances instead. Think of keeping multiple address records and itemization of totals like in general ledger or stock on hand.
Granularity or chattiness is neither an argument for nor against either approach. When the service exposes methods that accept complex requests and even combinations of requests (as in the journal pattern in Dealing with Concurrency), requests may even be less chatty than when using CRUD. On the other hand, the synchronization mechanisms are often well tuned, so there shouldn't be much difference between the two approaches. And both approaches are obviously infinitely better than sending individual set-property requests.
At Microsoft, the people in the field use a CRM system, but for most of their work they only use a subset of what that system has to offer. They keep their contacts up to date; they add opportunities and activities and keep these up to date as well; and they read the customer information. They would like to do that in the environment they use to communicate with the customers and with their colleagues and they would like to be able to do that while at the customer site. What if we were to provide this to them via Outlook? Outlook clearly only supports replication with Exchange, so that all changes made to local data would be replicated to the Exchange server and from there into the CRM service. While limited to CRUD, the solution would be limited in functionality, but I believe it would be a good, even recommended, solution.
Let's look at the entities involved:
More complex cases don't have to be offered through Outlook. The original solution provides these cases. Most users, and especially occasional users, will have sufficient functionality through Outlook and can hook up to the corporate network in the seldom case they need more. More demanding users can still use the original solution and have more overhead, but also more functionality.
Until here I've discussed CRUD as an all-or-nothing approach. It doesn't have to be that way. That would only be the case if you were to rely entirely on standard replication mechanisms. A good way of combining approaches is to replicate information from the service and to queue requests to the service. (You can optionally change the local data to reflect those requests, but if those changes are ignored and overwritten after the requests have been processed by the service these local changes don't affect the solution.) The business actions in the queue will then be processed and these actions can follow all recommendations made in Dealing with Concurrency.
Instead of entering the requests in a queuing system, I can store the requests in a database table that I use as a queue and have them replicated. This would reduce my infrastructure requirements by using one mechanism (database replication) instead of two (replication and queuing). By doing this, I use replication as a messaging mechanism. Processing the replicated requests is exactly the same as processing requests that have been sent to the service using mechanisms such as SOAP, only the transport mechanism is different.
A variation on this is to model the requests as entities. The order is a good example of this. The order is essentially a request to the service. Modeling the order as an entity has the additional advantage that it becomes part of the business system and that it now can be tracked and audited. Another example would be the address change of the customer. Instead of designing a request for address change, I could design an address change form that has all the required information. The form would state the reason, such as correcting incorrect information, a new telephone number, or moving to a new location; it would have the desired date of change and so forth. Now, if the customer calls me six months later, I have the original request and the activities that I executed that relate to that request. The address change form is now an entity in my system. I can use many of the standard replication mechanisms to replicate orders and address change forms. They are new instances in my system or database and they do not introduce concurrency issues. Note however, that neither the local order entry nor the local address changes need to have any effect on the other local data. Processing the request (or new entity) by the service may affect the other entities and those changes will then be replicated back to the local store.
Suppose you want to write a solution for a service technician. The technician needs to fill out hours on the road, hours spent servicing, diagnosing and repairing. He or she needs to specify the material used, the condition of the devices, and such. The information about the customers, the devices, the maintenance contract per device, the prices for the materials, all can be replicated to the notebook. However the per customer service information the technician enters consists of more than just an invoice, it can be a complex set of requests to the service that may contain a few problems on synchronization that need to be resolved before the complete set is accepted.
CRUD is not all bad, but it requires a deep understanding of its limits and when to use it. You cannot blindly apply it in all situations requiring multiple copies of data. You must make sure that your problem does not contain concurrency issues or require complex business requests. Even in these cases, CRUD can sometimes be used with other methods as a way of streamlining the solution.