Perception is almost everything

Misunderstood Indexing

Holding the number-one spot on Corruption Perceptions Index tells nothing about how well a country is doing.

It merely highlights how bad the situation is even for the runner-up.

Dear online surveillance addicts

Ground rules for acceptable ads online

The following is an edit of a piece originally written in November 2015.

The pinnacle of non-intrusive online ads were the original Google search ads. They were out of the way, clearly marked as ads - and hence could be visually filtered out. They were pure text, so could be neatly included as elements on the rendered page. And they were always targeting an INTEREST. Not an individual.

I will take that as the minimum acceptable advertising behaviour. I'm not implying it's perfect, but at least we set a clear set of ground rules. With that in mind, my ideal, non-intrusive ads mechanism builds on the following rules:

  • Ads must never be inline to page content.
  • Even when clearly out of the way, ads must not be allowed to mimic page content; they must be clearly marked as ads.
  • Text only.
  • I might accept an image within the ad, provided it was always served from the content provider's system.
  • As an extension to previous point: if the served image size would exceed a notable fraction of the page size, it must not be included in the output.
  • No user tracking of any kind.
  • No third-party javascript. Ever.
  • At most 15% of display real estate allowed to be used by ads. Including the padding in the UI. (It all counts as space denied from content.)
  • Not allowed to affect page content load times. Ad material must be included at the end of the page code. If your service pushes ads from internal and separate system, hard timeouts must be imposed: if the internal system cannot serve an ad within an allotted time, the frontend must never be forced to wait. You just missed an ad impression. Tough.
  • If clicking an ad takes a user through a bounce page, all identifiable information from the user must be stripped. Bounce page or redirect must not impose any further page loading delay.
  • No beacons.

Breaking even one of the rules automatically disqualifies you.

If you, as an advertiser, find these rules unacceptable - well, then we are in mutual disagreement. I find your ads equally unacceptable and will treat them as a form of cancer.

However, as a genuine service to the user... please allow the users to search for ads that have been displayed to them. Preferably by display context. I would be glad to return to a subject at a later date and search for something I remember seeing earlier.

The above set of rules is still not ideal, but everything that behaved according to them would at least be palatable.

Waste not, want not

Visions of future past

In our lifetime we will have seen the Western countries not just close but to barricade their borders. Should an unannounced ship approach, carrying desperate human beings fleeing the wars and the devastation, we won't be even allowed to think about accepting them.

The ships will be, not stopped and turned around, but torpedoed and sunk on sight. The drowned will be harvested for feed and fertiliser.

My only consolation is that I am old enough to not necessarily witness all of it.

So many edges

Random musings, part [REDACTED]

Software is like a diamond ...

... the better it glistens, the more edges there are.

... the toughest substance on planet, yet can shatter from a single impact.

... creating one can destroy your tools.

... no matter how well it wears, it'll still burn.

Your passwords have been compromised

Another blog post at $dayjob

So I wrote another one at work. After explaining to various parties how and why password cracking attempts happen, I felt it was prudent to write the whole thing down for future reference outside the corporate walls.

With that in mind, your passwords have almost certainly been compromised

TL;DR: use high-entropy passwords, a password manager, and proper 2-Factor authentication.

Presentation slides - Intended Consequences

DC4420 Presentation: Intended Consequences

Faithful to habits, I found myself with a presentation at dc4420. Lessons of usability and search for sanity gave rise to a talk on how deal with conflicting auditing demands.

The talk was geared towards industry long-timers who have, for reasons only marginally in their control, found themselves appeasing externalities and ticking boxes. I wanted to highlight that hope is not lost.

You can find the slides here: Intended Consequences

Evolution of debugging prowess

Evolution and progression of debugging prowess

When things break in mysterious ways, developers tend to go through a familiar series of increasingly demanding steps. As experience, skills and even personal networks grow, we can find ourselves diving ever further in the following chain:

  1. "It must be in my code." -- hours of debugging
  2. "Okay, it must be somewhere in our codebase." -- days of intense debugging and code spelunking
  3. "It HAS TO be in the third party libraries" -- days of issue tracker excavations and never-before-enabled profiling runs
  4. "It can't possibly be in stdlib..." -- more of the same, but now profiling the core runtime libraries
  5. "Please let this not be a compiler bug" -- we become intensely familiar with mailing list archives
  6. "I will not debug drivers. I will not debug drivers. I will not debug drivers."
  7. "I don't even know anyone who could help me figure out the kernel innards."
  8. "NOBODY understands filesystems!"
  9. "What do you mean 'firmware edge case'?"
  10. "Where is my chip lab grade oscilloscope?"

How far have you found yourself?

InfluxDB With Cascaded Downsampling

InfluxDB as Round-Robin Database replacement

Time-series databases are useful beasts. They are particularly useful for storing and visualising measurements.

When storing measurement data, a particularly useful property is the granularity decay. Most recent measurement data is kept with the highest granularity, and as time passes, the data is aggregated and downsampled in progressively coarser steps.

The venerable tools, RRD and Graphite (or more accurately, its Carbon/Whisper storage) require to configure upfront how the granularity, compaction and retention are set. InfluxDB doesn't.

If you want the same kind of retention logic and granularity decay with InfluxDB, there are a few hoops to jump through. Oddly enough, configuring such a setup is not really documented.

Data storage and downsampling in InfluxDB

Retention time and data granularity are tied to retention policies, which are used to specify how long the stored data is kept around. However, they say nothing about how this data should look like.

As time-series data comes in, it gets stored in, and is retained according to, the DEFAULT retention policy. (Yep, such a thing always exists. Even if you didn't create it.)

When storing and accessing data, InfluxDB uses $database.$retention_policy.$datapoint_name as the full data point path. Incidentally, $database/$retention_policy is also an on-disk path under the main data directory.

We might as well call them buckets.

So, internally InfluxDB writes the incoming data points to the DEFAULT bucket. The retention policy is just a fancy way of saying that data will be expired and deleted from the bucket once it is older than the retention time.

What has this got to do with downsampling?

We're getting to that.

The standard usecase for downsampling is that all data, across all time series dimensions, is continously being aggregated according to configured granularity decay rules. So far nothing in the above has dealt with this aspect.

The trick with InfluxDB is that we can create individual buckets with progressively longer and longer retention periods. And finally, we tell InfluxDB how to populate these buckets with data. Until then, only the DEFAULT bucket will be written to.

Step 1: Choose your retention periods

Let's go with something truly extreme. Don't try this at home.1

We want the following:

  • 1-second granularity for 60 hours
  • 10-second granularity for 8 days
  • 30-second granularity for 15 days
  • 2-minute granularity for 45 days
  • 5-minute granularity for 120 days
  • 15-minute granularity for 220 days, and
  • 1-hour granularity for 800 days

All the data will be coming from collectd.

Step 2: Create named retention policies

Now that we know how long we want to store data, and how we we want it to decay, it's time to get down and dirty.

Create a text file with the following contents:

CREATE DATABASE collectd CREATE RETENTION POLICY "1s_for_60d" ON collectd DURATION 60h REPLICATION 1 DEFAULT CREATE RETENTION POLICY "10s_for_8d" ON collectd DURATION 8d REPLICATION 1 CREATE RETENTION POLICY "30s_for_15d" ON collectd DURATION 15d REPLICATION 1 CREATE RETENTION POLICY "2m_for_45d" ON collectd DURATION 45d REPLICATION 1 CREATE RETENTION POLICY "5m_for_120d" ON collectd DURATION 120d REPLICATION 1 CREATE RETENTION POLICY "15m_for_220d" ON collectd DURATION 220d REPLICATION 1 CREATE RETENTION POLICY "1h_for_800d" ON collectd DURATION 800d REPLICATION 1

And run it with InfluxDB: influx < textfile

At this point we have the data buckets in place, but data is still only being stored in the DEFAULT bucket.

NOTE: There has to be a DEFAULT. That is the only bucket where incoming data is written to.

Step 3: Tell InfluxDB how to generate the downsampled data

As we have already learned, the out-of-the-box behaviour of InfluxDB is to only write data points to DEFAULT bucket. However, we expect the RRD/Graphite semantics - at least they are intuitive.

InfluxDB has a concept of CONTINUOUS QUERY. We can think of them as time-based triggers. A continuous query runs at specified time intervals, reads data from one RETENTION POLICY bucket and writes - likely modified - data to another.

We have the missing piece of the puzzle.

In order to generate the downsampled data, we will need to create continuous queries that progressively aggregate all time-series data from one bucket to another.

So, we can create a file with contents like this:

CREATE CONTINUOUS QUERY "cq_10s_for_8d" ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."10s_for_8d".:MEASUREMENT FROM /.*/ GROUP BY time(10s),* END CREATE CONTINUOUS QUERY "cq_30s_for_15d" ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."30s_for_15d".:MEASUREMENT FROM collectd."10s_for_8d"./.*/ GROUP BY time(30s),* END CREATE CONTINUOUS QUERY "cq_2m_for_45d" ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."2m_for_45d".:MEASUREMENT FROM collectd."30s_for_15d"./.*/ GROUP BY time(2m),* END CREATE CONTINUOUS QUERY "cq_5m_for_120d" ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."5m_for_120d".:MEASUREMENT FROM collectd."2m_for_45d"./.*/ GROUP BY time(5m),* END [... and so on ...]

And run it: influx -database collectd < myqueries.txt

If we look at the first two continous queries, we can see that there is a slight syntactical difference. The first aggregation level reads data from the DEFAULT bucket, and the subsequent ones read from their respective higher granularity buckets.

However, internally the created queries are stored like this:

cq_10s_for_8d CREATE CONTINUOUS QUERY cq_10s_for_8d ON collectd BEGIN SELECT mean(*) INTO collectd."10s_for_8d".:MEASUREMENT FROM collectd."1s_for_60h"./.*/ GROUP BY time(10s), * END cq_30s_for_15d CREATE CONTINUOUS QUERY cq_30s_for_15d ON collectd BEGIN SELECT mean(*) INTO collectd."30s_for_15d".:MEASUREMENT FROM collectd."10s_for_8d"./.*/ GROUP BY time(30s), * END cq_2m_for_45d CREATE CONTINUOUS QUERY cq_2m_for_45d ON collectd BEGIN SELECT mean(*) INTO collectd."2m_for_45d".:MEASUREMENT FROM collectd."30s_for_15d"./.*/ GROUP BY time(2m), * END cq_5m_for_120d CREATE CONTINUOUS QUERY cq_5m_for_120d ON collectd BEGIN SELECT mean(*) INTO collectd."5m_for_120d".:MEASUREMENT FROM collectd."2m_for_45d"./.*/ GROUP BY time(5m), * END [... and so on ...]

The first query has been created with the name of the DEFAULT bucket as the data source, even though we didn't specify it.

After this operation, we can inspect the files on the disk and see how the storage buckets and continuous queries behave:

fluxhost:/var/lib/influxdb% find data -path '*/collectd/*.tsm' data/collectd/10s_for_8d/13/000000005-000000002.tsm data/collectd/10s_for_8d/18/000000001-000000001.tsm data/collectd/autogen/2/000000020-000000002.tsm data/collectd/30s_for_15d/14/000000002-000000002.tsm data/collectd/2m_for_45d/15/000000001-000000001.tsm data/collectd/1s_for_60h/17/000000002-000000002.tsm data/collectd/1s_for_60h/9/000000002-000000002.tsm data/collectd/1s_for_60h/12/000000014-000000003.tsm

Result: Downsampled data for all collected series

It took a few somewhat unintuitive steps, but we have created a progressively decaying time-series storage in InfluxDB

At the time of writing, the above sequence has not been really documented. Official docs explain how to build the individual RETENTION POLICY and CONTINUOUS QUERY elements, but not really how they should be intuitively tied together.

Footnotes

  1. Most time-series setups store their highest granularity data at 10- or 20-second interval and start to decay it after just a few hours. Higher granularity with long retention period will explode the storage requirements.

Engineering challenges at elsewhere

Engineering challenges at Smarkets

A few days ago I posted a long-in-the-making article about the common engineering problems and constant challenges at the $dayjob blog. There are a number of repeating themes and questions that come up when doing interviews, so I figured I might as well answer them all in one place, and with sufficient detail.

After all, why not?

Presentation slides - Professionally Paranoid Infrastructure

DC4420 Presentation: Infrastructure Design For Professionally Paranoid

I gave a surprisingly long presentation in April at the dc4420 monthly event. What was supposed to be at most 40'ish minutes followed by short Q&A ended up taking more than an hour thanks to lively discussion during the different segments.

This was an overview of infrastructural design constraints for a betting exchange, especially focused on finding practical ways to address the regulatory demands. At the cross-section between gambling and FinTech the environment comes with some fairly unique externally imposed requirements.

The slides are a revised version of the presentation material, designed as better suited for distribution.

You can find the slides here: Infrastructure Design for the Professionally Paranoid