04 September 2012

Traditional wisdom would suggest that you shouldn’t bite off more than you can chew. Instead you should know your limits and keep within them or at least if you are going to cross over the line, do so with very small steps. I can see the logic to this argument and I agree that a little knowledge can be a dangerous thing. However willingness to go beyond the comfort zone is often a defining characteristic of people who have made some of the most critical breakthroughs in history.

The reason I mention about writing a database is simply because that is what I have done. For various reasons I wasn’t able to find a scratch for my particular itch and so I set about to make the world a better place. Well, really I set about to make my life a happier place which is altogether more selfish but also very beneficial as I’ll explain shortly.

Because I had a very specific problem that I wanted to solve, I had a pretty big advantage. I knew what I needed (or at least what I thought I needed) to make the pain go away and I had the incentive that by coming up with a solution I wouldn’t have to bang my head against the wall each morning. So far so good, but how do you actually go about building a database?

The answer: Just build it.

Now you might think that this would mean that there would be a lot of problems and a whole platoon of dead ends. You’d be right. Just as I thought I had one problem solved, it highlighted a glaringly obvious error (at least obvious now) that I had omitted earlier. There’s no way I would have picked up the problem by myself or with a pen and paper. It was working (or in my case not working) code that showed me the path.

The answer I think lies in iteration. I don’t think anyone (well apart from a select few) really starts a project knowing how it will end and how it will work. If they did, it wouldn’t be development. Even the big software companies spend millions trying to create good software - and they don’t always get it right. Sometimes they get it very very wrong. So if they can’t do it, why should we assume that we can? By getting our hands dirty and getting working code as soon as possible, the potential issues highlight themselves very quickly.

I guess you could call that organic development. Each task follows logically from the one before. Many people disagree with this approach and say that software should be properly designed. Don’t get me wrong, I’m not suggesting that no design work should be done or that getting out a pen and a pad of paper won’t help your efforts - it will. What I am saying is that trying to design the complete system from scratch when you really don’t have the experience or expertise to do so, is asking for a world of pain. You are taking a theoretical design and trying to make it work in the real world. In other words you’re trying to bend reality to meet your theoretical model. Sometimes your model is very close to reality and it works well. Other times - well…

That’s why I think the design should be an overview. It is a map describing the lay of the land, but it is not the land itself. Just like a map depicts the battlefield, the realities on the ground may not be readily reflected in the map itself.

As a case in point, I have spent the evening watching a database restore itself. Actually I should rephrase that and say try to restore itself because so far it is not having much success. This got me thinking about my database and how I would recover data in CakeDB.

CakeDB is effectively append-only and so baring serious disk failure, it’s not likely that there would be corruption in the middle of a stream, with the more likely case being that there is data missing off the end (i.e. an incomplete record). But what if there was a disk failure and a 100MB hole appeared in the middle of the database file. What then?

In that case I’d be fairly stuck. The current file format for CakeDB packs records one after the other. That’s all well and good but it assumes that after it completes reading record A, it will find record B. With corrupt data this might not be the case and once you hit corrupt data, how do you tell when it ends? You can no longer rely on the positioning in the file after all.

The first option is to use the index as that can get you close enough to most parts of the file or at least limit your loss to 1,000 records (depending on the index step). But what if your index went down the tube at the same time? If that happened, well, it would be something of a nightmare. You could step through from the top of the file until you find corruption and then start at the bottom of the file and read up - but what if you have two small bits of corruption both near the start and end of the file? You’d miss a huge chunk in the middle.

To get around that, I’m going to borrow an idea from video encoding (and probably countless other places). Every X amount of records, I will insert a small header. This header will allow CakeDB to gather its bearings should data be corrupt. It simply skips along until the next complete header and tries to read from there.

Is that a good strategy? Maybe, maybe not. It’s simple, easy to implement and will likely solve my needs for now. Will it still be useful next week? Possibly not but if it starts to show its limitations I will be intimately aware of what they are and so I can design a solution accordingly.

So, use the design to provide you with a strategy and then get your hands dirty to figure out the tactics. Above all don’t be afraid to play in the mud - you can always take a shower afterwards!



blog comments powered by Disqus