Monday 15 March 2010
Several years ago (I think it was 2002) I was involved in a project to create a Business Intelligence solution for a Short Message Service (SMS) system.
The solution inserted a copy of the log records from the SMS system to an Oracle database and used the data to create reports for the marketing department of the telecom provider (for instance to find patterns in demographics - when is the time most teenagers use SMS to text to their friends). The system would also be used by the helpdesk of the provider to answer questions of end users (for instance when a text message was sent but not delivered, why was it not delivered?).
The project was already running for a few months and BI specialists were working on data models, reporting, user interfaces and the like. I was asked to look at the infrastructural aspects. One of the first questions I asked was how many log records the system was supposed to insert in the Oracle database and process. The answer was stunning.
10,000 records per second.
The system was supposed to insert 10,000 records in an Oracle database each and every second 24/7. Of course the next question was how they were going to do this. The answer was also stunning.
By inserting them.
The project never had a clue this was quite a challenge. When looking for information on the maximum speed records could be inserted in an Oracle database I found that the maximum speed reported at that time was around 1,000 inserts per second; 10 times too slow for us. I suggested to perform a proof of concept to find out how fast we could insert records in our database setup. The outcome: 500. A bit disappointing and a clear threat to the project.
We solved the problem partly by doing some fancy Oracle tricks and in the end using the same proof of concept setup we reached an acceptable 5,000 inserts per second (I think maybe a world record at the time).
The point I want to make is that apparently the project needed an architect to show them the pitfall of the solution and that a proof of concept was needed to find out how the solution would behave. Such a proof of concept should have been as one of the first things in the project.
I have very good experience with using proof of concepts in projects. A proof of concept should be used to test the most challenging parts of your solution early in the project. This is not a natural thing to do. Most people start with the part of the project they feel most familiar with. The more challenging part usually is addressed at a later stage. But these challenging parts need to be addressed anyway and could lead to a delay in the project or even a halt. A proof of concept shows this at a time not too much money is spent yet and shows the project team and the customer that the project's highest risk is been taken care of.
Monday 01 March 2010
In general much attention is paid to the creation of consistent backups. Specific backup tools and backup agents are configured to ensure databases are flushed to disk to provide a consistent backup.
Of course this is very important. Inconsistent databases can lead to incorrect startup of a database after a restore. An example is the index of the database that is out of sync with the underlying tables.
Database transactions must be handled correctly as well before making a backup to prevent a failing application after a restore. An example is a purchase order that states a product can in, where a change in the financial administration is missing.
The question is weather is it useful to force consistencies in backups on a higher level. I think this is a very expensive exercise (if possible at all) with little benefit. Nowadays systems hardly ever work in isolation. Usually they are part of a chain of internal and external systems. Purchase orders come in via order intake in a SAP system, but can also come in through resellers’ web services or via an Internet site. These systems are connected and it seems to make sense to backup them in an integral way.
However, this is hardly possible. To create an integral backup all systems must be in a consistent state. Not only internally, but also between each system. This is only possible when all connected systems are stopped. Not only is this most of the time not feasible (an Internet page cannot be taken offline to create a backup) , it is also very time consuming. If one of the systems in the chain cannot be stopped due to a long running transaction all other systems have to wait for it in order to get back-upped.
I am not even mentioning chains running over several companies. When transactions run between companies all companies must stop working to create chain-consistent backup.
Let’s first find out why a backup is made at all in large systems (I am not talking about restoring a lost user file). A backup can be used to restore systems that cannot be repaired any other way. This means that the choice to perform a restore has great business impact. A restore must therefore be handled with great care.
It is important to have backups that are consistent within one application to prevent the problems stated above. Usually backups are one or more days old. When after a restore the system would be started without measures, some unwanted side effects could occur. For instance: when just before making the backup a message from a business partner was received and the acknowledgement was not sent yet, the acknowledgement would be sent again to the business partner after the restore of the system a day later. You can probably come up with more examples.
These examples show that restoring a system from backup is always a delecate process that must be taken care of with great caution, especially in a chain of multiple applications or even companies. Therefore it is of no use to have a consistent backup over a set of applications: problems will arise anyway when a restore is performed without additional measures.
A consistent database backup and consistence within applications is therefore enough.