InfluxDB as Round-Robin Database replacement
Time-series databases are useful beasts. They are particularly useful for storing and visualising measurements.
When storing measurement data, a particularly useful property is the granularity decay. Most recent measurement data is kept with the highest granularity, and as time passes, the data is aggregated and downsampled in progressively coarser steps.
The venerable tools, RRD and Graphite (or more accurately, its Carbon/Whisper storage) require to configure upfront how the granularity, compaction and retention are set. InfluxDB doesn't.
If you want the same kind of retention logic and granularity decay with InfluxDB, there are a few hoops to jump through. Oddly enough, configuring such a setup is not really documented.
Data storage and downsampling in InfluxDB
Retention time and data granularity are tied to retention policies, which are used to specify how long the stored data is kept around. However, they say nothing about how this data should look like.
As time-series data comes in, it gets stored in, and is retained
according to, the DEFAULT
retention policy. (Yep, such a
thing always exists. Even if you didn't create it.)
When storing and accessing data, InfluxDB uses
$database.$retention_policy.$datapoint_name
as the full
data point path. Incidentally, $database/$retention_policy
is also an on-disk path under the main data directory.
We might as well call them buckets.
So, internally InfluxDB writes the incoming data points to the
DEFAULT
bucket. The retention policy is just a fancy way
of saying that data will be expired and deleted from the bucket once it
is older than the retention time.
What has this got to do with downsampling?
We're getting to that.
The standard usecase for downsampling is that all data, across all time series dimensions, is continously being aggregated according to configured granularity decay rules. So far nothing in the above has dealt with this aspect.
The trick with InfluxDB is that we can create individual buckets with
progressively longer and longer retention periods. And finally, we
tell InfluxDB how to populate these buckets with data. Until then,
only the DEFAULT
bucket will be written to.
Step 1: Choose your retention periods
Let's go with something truly extreme. Don't try this at home.1
We want the following:
- 1-second granularity for 60 hours
- 10-second granularity for 8 days
- 30-second granularity for 15 days
- 2-minute granularity for 45 days
- 5-minute granularity for 120 days
- 15-minute granularity for 220 days, and
- 1-hour granularity for 800 days
All the data will be coming from collectd.
Step 2: Create named retention policies
Now that we know how long we want to store data, and how we we want it to decay, it's time to get down and dirty.
Create a text file with the following contents:
CREATE DATABASE collectd CREATE RETENTION POLICY "1s_for_60d" ON collectd DURATION 60h REPLICATION 1 DEFAULT CREATE RETENTION POLICY "10s_for_8d" ON collectd DURATION 8d REPLICATION 1 CREATE RETENTION POLICY "30s_for_15d" ON collectd DURATION 15d REPLICATION 1 CREATE RETENTION POLICY "2m_for_45d" ON collectd DURATION 45d REPLICATION 1 CREATE RETENTION POLICY "5m_for_120d" ON collectd DURATION 120d REPLICATION 1 CREATE RETENTION POLICY "15m_for_220d" ON collectd DURATION 220d REPLICATION 1 CREATE RETENTION POLICY "1h_for_800d" ON collectd DURATION 800d REPLICATION 1
And run it with InfluxDB: influx < textfile
At this point we have the data buckets in place, but data is still only
being stored in the DEFAULT
bucket.
NOTE: There has to be a DEFAULT
. That is the only bucket
where incoming data is written to.
Step 3: Tell InfluxDB how to generate the downsampled data
As we have already learned, the out-of-the-box behaviour of InfluxDB
is to only write data points to DEFAULT
bucket. However,
we expect the RRD/Graphite semantics - at least they are intuitive.
InfluxDB has a concept of CONTINUOUS QUERY
. We can think of
them as time-based triggers. A continuous query runs at specified time
intervals, reads data from one bucket
and writes - likely modified - data to another.RETENTION POLICY
We have the missing piece of the puzzle.
In order to generate the downsampled data, we will need to create continuous queries that progressively aggregate all time-series data from one bucket to another.
So, we can create a file with contents like this:
CREATE CONTINUOUS QUERY "cq_10s_for_8d" ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."10s_for_8d".:MEASUREMENT FROM /.*/ GROUP BY time(10s),* END CREATE CONTINUOUS QUERY "cq_30s_for_15d" ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."30s_for_15d".:MEASUREMENT FROM collectd."10s_for_8d"./.*/ GROUP BY time(30s),* END CREATE CONTINUOUS QUERY "cq_2m_for_45d" ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."2m_for_45d".:MEASUREMENT FROM collectd."30s_for_15d"./.*/ GROUP BY time(2m),* END CREATE CONTINUOUS QUERY "cq_5m_for_120d" ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."5m_for_120d".:MEASUREMENT FROM collectd."2m_for_45d"./.*/ GROUP BY time(5m),* END [... and so on ...]
And run it: influx -database collectd < myqueries.txt
If we look at the first two continous queries, we can see that there is
a slight syntactical difference. The first aggregation level reads data
from the DEFAULT
bucket, and the subsequent ones read from
their respective higher granularity buckets.
However, internally the created queries are stored like this:
cq_10s_for_8d CREATE CONTINUOUS QUERY cq_10s_for_8d ON collectd BEGIN SELECT mean(*) INTO collectd."10s_for_8d".:MEASUREMENT FROM collectd."1s_for_60h"./.*/ GROUP BY time(10s), * END cq_30s_for_15d CREATE CONTINUOUS QUERY cq_30s_for_15d ON collectd BEGIN SELECT mean(*) INTO collectd."30s_for_15d".:MEASUREMENT FROM collectd."10s_for_8d"./.*/ GROUP BY time(30s), * END cq_2m_for_45d CREATE CONTINUOUS QUERY cq_2m_for_45d ON collectd BEGIN SELECT mean(*) INTO collectd."2m_for_45d".:MEASUREMENT FROM collectd."30s_for_15d"./.*/ GROUP BY time(2m), * END cq_5m_for_120d CREATE CONTINUOUS QUERY cq_5m_for_120d ON collectd BEGIN SELECT mean(*) INTO collectd."5m_for_120d".:MEASUREMENT FROM collectd."2m_for_45d"./.*/ GROUP BY time(5m), * END [... and so on ...]
The first query has been created with the name of the
DEFAULT
bucket as the data source, even though we didn't
specify it.
After this operation, we can inspect the files on the disk and see how the storage buckets and continuous queries behave:
fluxhost:/var/lib/influxdb% find data -path '*/collectd/*.tsm' data/collectd/10s_for_8d/13/000000005-000000002.tsm data/collectd/10s_for_8d/18/000000001-000000001.tsm data/collectd/autogen/2/000000020-000000002.tsm data/collectd/30s_for_15d/14/000000002-000000002.tsm data/collectd/2m_for_45d/15/000000001-000000001.tsm data/collectd/1s_for_60h/17/000000002-000000002.tsm data/collectd/1s_for_60h/9/000000002-000000002.tsm data/collectd/1s_for_60h/12/000000014-000000003.tsm
Result: Downsampled data for all collected series
It took a few somewhat unintuitive steps, but we have created a progressively decaying time-series storage in InfluxDB
At the time of writing, the above sequence has not been really
documented. Official docs explain how to build the individual
RETENTION POLICY
and CONTINUOUS QUERY
elements, but not really how they should be intuitively tied together.
Footnotes
- Most time-series setups store their highest granularity data at 10- or 20-second interval and start to decay it after just a few hours. Higher granularity with long retention period will explode the storage requirements. ↩