I’m experimenting using a different blogging system and so it will take a while for everything to get back up to speed.

I wrote a paper in the latter half of last year that discusses how streams of data look and feel in the real world. We often describe such things by saying “messages per second” or “Megabit per second” but this only gives us an understanding of the stream on a per second level. More interesting I think is what happens at a subsecond level.

For example, if you had a stream that transmits 1,000 messages per second on average, what does that really mean? Most developers I have spoken to would take this to mean that they will receive one message roughly every millisecond. So far so good. It falls down though when this turns into a design consideration where they develop software that is optimized to process one message every millisecond because the data at a subsecond level might not actually look anything like this.

What if the stream actually sends 900 messages in the first half of a second and then 100 messages evenly spaced throughout the rest of the second? That’s still 1,000 messages per second, but the actual impact of that on a system is significantly different. It’s really imporant to be able to understand at the message or “atomic” level how the stream operates.

There are also different types of stream or at least different ways that stream can and will be consumed. Design considerations for systems can use this to their advantage in order to optimize for a particular workload. Does an application need to consume an entire stream, or can it drop data? Does it need to process it in real time or can it store it to disk for later processing? All of these things have an impact on how the system can and potentially should be designed.

The paper is called “Adapting CakeDB to Integrate High-Pressure Big Data Streams with Low-Pressure Systems” and you can find it on Google Scholar.