InfluxDB With Cascaded Downsampling

InfluxDB as Round-Robin Database replacement

Time-series databases are useful beasts. They are particularly useful for storing and visualising measurements.

When storing measurement data, a particularly useful property is the granularity decay. Most recent measurement data is kept with the highest granularity, and as time passes, the data is aggregated and downsampled in progressively coarser steps.

The venerable tools, RRD and Graphite (or more accurately, its Carbon/Whisper storage) require to configure upfront how the granularity, compaction and retention are set. InfluxDB doesn't.

If you want the same kind of retention logic and granularity decay with InfluxDB, there are a few hoops to jump through. Oddly enough, configuring such a setup is not really documented.

Data storage and downsampling in InfluxDB

Retention time and data granularity are tied to retention policies, which are used to specify how long the stored data is kept around. However, they say nothing about how this data should look like.

As time-series data comes in, it gets stored in, and is retained according to, the DEFAULT retention policy. (Yep, such a thing always exists. Even if you didn't create it.)

When storing and accessing data, InfluxDB uses $database.$retention_policy.$datapoint_name as the full data point path. Incidentally, $database/$retention_policy is also an on-disk path under the main data directory.

We might as well call them buckets.

So, internally InfluxDB writes the incoming data points to the DEFAULT bucket. The retention policy is just a fancy way of saying that data will be expired and deleted from the bucket once it is older than the retention time.

What has this got to do with downsampling?

We're getting to that.

The standard usecase for downsampling is that all data, across all time series dimensions, is continously being aggregated according to configured granularity decay rules. So far nothing in the above has dealt with this aspect.

The trick with InfluxDB is that we can create individual buckets with progressively longer and longer retention periods. And finally, we tell InfluxDB how to populate these buckets with data. Until then, only the DEFAULT bucket will be written to.

Step 1: Choose your retention periods

Let's go with something truly extreme. Don't try this at home.1

We want the following:

  • 1-second granularity for 60 hours
  • 10-second granularity for 8 days
  • 30-second granularity for 15 days
  • 2-minute granularity for 45 days
  • 5-minute granularity for 120 days
  • 15-minute granularity for 220 days, and
  • 1-hour granularity for 800 days

All the data will be coming from collectd.

Step 2: Create named retention policies

Now that we know how long we want to store data, and how we we want it to decay, it's time to get down and dirty.

Create a text file with the following contents:

CREATE DATABASE collectd CREATE RETENTION POLICY "1s_for_60d" ON collectd DURATION 60h REPLICATION 1 DEFAULT CREATE RETENTION POLICY "10s_for_8d" ON collectd DURATION 8d REPLICATION 1 CREATE RETENTION POLICY "30s_for_15d" ON collectd DURATION 15d REPLICATION 1 CREATE RETENTION POLICY "2m_for_45d" ON collectd DURATION 45d REPLICATION 1 CREATE RETENTION POLICY "5m_for_120d" ON collectd DURATION 120d REPLICATION 1 CREATE RETENTION POLICY "15m_for_220d" ON collectd DURATION 220d REPLICATION 1 CREATE RETENTION POLICY "1h_for_800d" ON collectd DURATION 800d REPLICATION 1

And run it with InfluxDB: influx < textfile

At this point we have the data buckets in place, but data is still only being stored in the DEFAULT bucket.

NOTE: There has to be a DEFAULT. That is the only bucket where incoming data is written to.

Step 3: Tell InfluxDB how to generate the downsampled data

As we have already learned, the out-of-the-box behaviour of InfluxDB is to only write data points to DEFAULT bucket. However, we expect the RRD/Graphite semantics - at least they are intuitive.

InfluxDB has a concept of CONTINUOUS QUERY. We can think of them as time-based triggers. A continuous query runs at specified time intervals, reads data from one RETENTION POLICY bucket and writes - likely modified - data to another.

We have the missing piece of the puzzle.

In order to generate the downsampled data, we will need to create continuous queries that progressively aggregate all time-series data from one bucket to another.

So, we can create a file with contents like this:

CREATE CONTINUOUS QUERY "cq_10s_for_8d" ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."10s_for_8d".:MEASUREMENT FROM /.*/ GROUP BY time(10s),* END CREATE CONTINUOUS QUERY "cq_30s_for_15d" ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."30s_for_15d".:MEASUREMENT FROM collectd."10s_for_8d"./.*/ GROUP BY time(30s),* END CREATE CONTINUOUS QUERY "cq_2m_for_45d" ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."2m_for_45d".:MEASUREMENT FROM collectd."30s_for_15d"./.*/ GROUP BY time(2m),* END CREATE CONTINUOUS QUERY "cq_5m_for_120d" ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."5m_for_120d".:MEASUREMENT FROM collectd."2m_for_45d"./.*/ GROUP BY time(5m),* END [... and so on ...]

And run it: influx -database collectd < myqueries.txt

If we look at the first two continous queries, we can see that there is a slight syntactical difference. The first aggregation level reads data from the DEFAULT bucket, and the subsequent ones read from their respective higher granularity buckets.

However, internally the created queries are stored like this:

cq_10s_for_8d CREATE CONTINUOUS QUERY cq_10s_for_8d ON collectd BEGIN SELECT mean(*) INTO collectd."10s_for_8d".:MEASUREMENT FROM collectd."1s_for_60h"./.*/ GROUP BY time(10s), * END cq_30s_for_15d CREATE CONTINUOUS QUERY cq_30s_for_15d ON collectd BEGIN SELECT mean(*) INTO collectd."30s_for_15d".:MEASUREMENT FROM collectd."10s_for_8d"./.*/ GROUP BY time(30s), * END cq_2m_for_45d CREATE CONTINUOUS QUERY cq_2m_for_45d ON collectd BEGIN SELECT mean(*) INTO collectd."2m_for_45d".:MEASUREMENT FROM collectd."30s_for_15d"./.*/ GROUP BY time(2m), * END cq_5m_for_120d CREATE CONTINUOUS QUERY cq_5m_for_120d ON collectd BEGIN SELECT mean(*) INTO collectd."5m_for_120d".:MEASUREMENT FROM collectd."2m_for_45d"./.*/ GROUP BY time(5m), * END [... and so on ...]

The first query has been created with the name of the DEFAULT bucket as the data source, even though we didn't specify it.

After this operation, we can inspect the files on the disk and see how the storage buckets and continuous queries behave:

fluxhost:/var/lib/influxdb% find data -path '*/collectd/*.tsm' data/collectd/10s_for_8d/13/000000005-000000002.tsm data/collectd/10s_for_8d/18/000000001-000000001.tsm data/collectd/autogen/2/000000020-000000002.tsm data/collectd/30s_for_15d/14/000000002-000000002.tsm data/collectd/2m_for_45d/15/000000001-000000001.tsm data/collectd/1s_for_60h/17/000000002-000000002.tsm data/collectd/1s_for_60h/9/000000002-000000002.tsm data/collectd/1s_for_60h/12/000000014-000000003.tsm

Result: Downsampled data for all collected series

It took a few somewhat unintuitive steps, but we have created a progressively decaying time-series storage in InfluxDB

At the time of writing, the above sequence has not been really documented. Official docs explain how to build the individual RETENTION POLICY and CONTINUOUS QUERY elements, but not really how they should be intuitively tied together.

Footnotes

  1. Most time-series setups store their highest granularity data at 10- or 20-second interval and start to decay it after just a few hours. Higher granularity with long retention period will explode the storage requirements.

Engineering challenges at elsewhere

Engineering challenges at Smarkets

A few days ago I posted a long-in-the-making article about the common engineering problems and constant challenges at the $dayjob blog. There are a number of repeating themes and questions that come up when doing interviews, so I figured I might as well answer them all in one place, and with sufficient detail.

After all, why not?

Presentation slides - Professionally Paranoid Infrastructure

DC4420 Presentation: Infrastructure Design For Professionally Paranoid

I gave a surprisingly long presentation in April at the dc4420 monthly event. What was supposed to be at most 40'ish minutes followed by short Q&A ended up taking more than an hour thanks to lively discussion during the different segments.

This was an overview of infrastructural design constraints for a betting exchange, especially focused on finding practical ways to address the regulatory demands. At the cross-section between gambling and FinTech the environment comes with some fairly unique externally imposed requirements.

The slides are a revised version of the presentation material, designed as better suited for distribution.

You can find the slides here: Infrastructure Design for the Professionally Paranoid

Presentation slides - size_t Does Matter

DC4420 Presentation: size_t Does Matter

I gave a short presentation in October at the dc4420 monthly event. The talk was about the simple theory and practice behind hash extension attacks.

You can find the slides here: size_t Does Matter

Quoted on the Finnish Phoenix

BBC on the Finnish Phoenix

Well that was unexpected. BBC just quoted me.

High Up the Northern Line (Barnet branch)

Woodside Park

Way, way up in north-west, along the Barnet branch of the Northern line, lies an airy and leafy region. On the map the place looks like any other suburb. On the ground the vibe is quite different.

The small brook passing through the area is bigger, and more freely flowing, than one would expect. Possibly thanks to that, the air is constantly moving, ever so slightly. Everything around here is decently kept, and the appearances are a mix of modern-at-their-time and genuinely modern.

While there is the unavoidable air traffic, it doesn't overwhelm. Overall, the place feels a bit like Forest Hill, only calmer.

A place for a family, if you can afford it.

Walking south along the tube line one soon gets into ...

West Finchley

The general feel in and around West Finchley is somewhat more mixed. Also, while the place gives an air of having been slightly less well kept, it's still safe and inviting.

If it wasn't for the contrast against Woodside Park, this would still be a first-class family neighbourhood. Now it feels like this is where the reasonably affluent have overflowed when north became too expensive.

As one goes further south, the airy feel takes on an increasingly compact tone.

The high street is a nice experience. It's not too crowded, and feels like a nice village-like place.

Going even further down south, there is ...

Finchley Central

This is the forking point of the Barnet branch.

For someone with a Finnish origin, it can be described succintly: Kamppi before the high-end overhaul.

As long as you can find a place slightly off the busiest roads, all of these places should still be reasonable locations to set down with a family.

Eltham

Eltham

Towards the south-east edge of Greenwich, in a region that once may have been part of Blackheath, lies a village of Eltham. From the surface it looks very nice indeed.

Patios and front porches remain unwalled, with a fair number hosting tarp-covered scooters. It appears that the village is neither inhabited nor raided by kleptomaniacs.

Reality sinks in when walking around the village. The first thing you notice is the constant lack of silence. Anywhere even remotely close to the train station, one cannot escape a chronic blare of heavy traffic. The ever-present thrum originates from a major bypass, which feeds on the proximity to Blackwall Tunnel. For a Finn: imagine living next to Länsiväylä, with a 24/7 rush hour, and you get the idea. (There's also a surprising amount of air traffic, as if the place required an insurance against momentary lapses of automotive cacophony.)

The second is that above all, the whole place feels compressed. Eltham village must have been once a very desirable place to live in, to cram that many people into so confined spaces.

There is one final observation I made in Eltham. While it is certainly a place of well-off families, it's not abundantly so. Cars are not the expensive models - in general. Houses are neatly maintained but not lavish - in general. And as already pointed out, the plots are tiny - in general. Walking around the village, one sees the occasional corner plot with a lavish mansion, usually with a Jag or Bentley parked in front. These uncanny locations have one thing in common: they are nested in very special spots. As you approach (and pass) these plots, you will soon realise that right at these spots the overall noise level from the heavy traffic somehow gets muted. If this is a result from urban planning, it is a work of a disturbed genius.

In Eltham, the properly rich occupy the quiet places, while everyone else gets to drown in the noise.

Leytonstone

Leytonstone

Right at the fork of the eastern Central Line lies a village that should, by all accounts, be nice. The place is next to Epping Forest, and has a very well functioning tube connection (considering it is located at the border of zones 3 & 4, the connections are outright spectacular).

But as any statistican or computer engineer will attest, real-world data rarely conforms to an ideal model. The place is not scary, but it is quite run-down. The high street is dominated by betting and pawn shops, interleaved with pound stores. Geographically thinking, the place has every reason to be prosperous, but from the looks of it, the only people prospering from the neighbourhood may be slumlords.

And even then, that is not quite true. There is an almost schizophrenic geographical division in the outlooks of both the properties and the people living in them. The houses right next to Epping Forest or Bush Wood are airier and very well kept. But walk just half a block away from the green spaces, and you could have entered a different town altogether. At least the general air of being cramped remains constant..

The final argument against living with a family in Leytonstone comes from how the houses are kept. While many of the plots have their small gardens, the standard accessory for garden walls is nothing less than a spool of freaking razor wire. People don't tend to fake things like that.

The pub near the tube station is nice though. I can't shake a suspicion that given a chance, people merely opt to visit the place. It's not scary, but it's not really inviting either.

Chigwell

Chigwell (and areas nearby)

Central line punches through a big chunk of London. At the north-east end it splits into two branches: the main line, and a fascinating curved loop. The village of Chigwell is at the north end of this loop.

Technically the place is not in London. Coming from London, if you find yourself in Chigwell, you've crossed the border of Essex about 2 km earlier.

One of the first things welcoming a visitor near the tube station is a plant nursery. A large one at that. An omen, promising affordable land prices, but one that soon proves to be a lie.

The buildings in Chigwell are a happily mixed bunch. They range from imitation colonial style to classic, almost idyllic British cottages. They seem to range from quite old to very recent but for some reason the overall style is never too much unlike everything else nearby. The happiest finding was that there is very little feeling of picket fences. Lion's share of the houses are quite individual, and it is very rare for more than 4 buildings in a row to share an identical mold.

Somewhat related to this is the proliferation of "Hands off Chigwell" campaign posters on windows. At first one can't really understand why the locals would oppose the addition of 1200 new houses. But then the realisation sinks in: almost all the houses in and around the village are obviously results of individual builders doing what they have felt was appropriate. A mega-constructor, building 1200 houses all at once, could not help but carpet-bomb the entire region with practically identical, soulless constructs.

That could well ruin the identity of Chigwell.

A short walk due east, one encounters a small village of Grange Hill.

Grange Hill

There is a quirky division in how houses look in Grange Hill. The lots next to tube station are all generally clean and appear comfortable. Same applies to the buildings inside the Central Line arc.

But go just few hundred meters outside the region cordoned by the tube, and the look changes. Where one expects to see a variety of houses, they are suddenly surprised by collection of all too similar, crunched-up looking locations. Almost like a baby Titan had laid out his toys just to see how many duplicates there were. In a neighourhood otherwise so charming, a sudden spasm of unvitingness is even more striking thanks to the incidental contrast.

On a very positive note the cars in and around Grange Hill show a welcome lack of lavishness. At least people are not paying more for their rides than their homes.

Hither Green, revisited

Last fall when I first visited Hither Green, it felt like a non-place. There weren't many stores, and much of the commercial spaces were empty. In just six months, it has changed - for the better.

The place feels alive. The previously empty spaces have new businesses in them, servicing families. On my previous visit, there was a feeling of desolation around the train station. This too has changed.

It's almost as if the proximity to Lewisham is no longer a problem or a hindrance, but an asset. It may be a pre-emptive Crossrail effect - when the Canary Wharf terminal opens in 2018, everything within a decent distance becomes highly desirable. With the DLR, Canary Wharf is just 15 minutes away from Lewisham.

This may be more than just the usual gentrification playing out.