Stacking Heuristics

26 October, 2020 - 4 min read

There's nothing like a good heuristic - in his bestseller Thinking, Fast and Slow, Nobel prize winner Daniel Kahneman points out their almost unreasonable effectiveness. In many cases, naively adding up the output of several scoring systems results in an ensemble model that is very hard to beat, even using way more resources and expert knowledge. In the field of machine learning, heuristics are sometimes a bit of a taboo -- often, engineers prefer to have a deep learning model learn some unknown function instead of applying simplified domain knowledge (which is, in essence, what heuristics usually are). The underlying motivation for this is typically the Idea of the ideal setup: one in which data is infinitely clean and abundant, and in which models train effortlessly at the slightest mention of a gradient (any similarity to Plato's Theory of Forms is purely coincidental).

Reality, in comparison, is pretty harsh: not only do models frequently fail to do their thing (well, rather learn their thing) for the smallest of reasons, often the problems start even before that, with data being scarce and poorly organised. One way to handle this is to make a model smaller, reducing its number of parameters and ability to overfit on the unfavourable aspects of the data. Funnily, this reduction in complexity brings the model closer and closer to the exact thing it was supposed to keep at bay: a heuristic. How? Think about the simplest heuristic you can imagine. Probably, that would be an approach where no matter the situation, your decision or output is fixed. In parallel, the simplest possible machine-based model would also have a constant output, i.e. $f(x0...xi) = c$. Take the complexity up a notch, and our first stop is something akin to a linear model. Both in this machine model and human heuristic, each input element translates to a first-order element, without complex combinations popping up just yet. At this point, what we've built is the base unit of a neural network: a neuron.

If there's one thing we love doing with these little computational units, it's stacking them. Historically, this hasn't always worked well, as getting layer upon layer to learn anything substantial requires efficient backpropagation, and even then ample computation power on top of that. Today, our cure for deep network's learning troubles is mostly powered by near-obscene stacks of GPUs, bringing computational horsepower that was impossible a decade ago. It's no surprise that when we stack our own, human heuristics (which we love doing just as much), we run into much of the same obstacles: with each assumption or simplification applied in sequence, our ability to adapt and learn from outcomes decreases dramatically. Of course, we can try to go the deep learning route and throw more resources at our self-optimisation, but without a human equivalent of Moore's law, any such attempt is inherently doomed to fail. The impact of this failure can be seen in almost every dysfunctional system -- it's in the failure of education systems to promote true equality, but equally in the impact of organisational silos on any SMB. Even in personal relations, with everyone optimising their behaviour with the best of intentions, things still go wrong due to a lack of sync. In summary, people are horrible at acting like neurons in a neural network, and yet very stubbornly keep trying to.

So what do you do when you realise you're not cut out for backpropagation? Moving away from our own local optimisation requires a paradigm shift -- one towards radical systems thinking: using a holistic approach to address our challenges rather than breaking them down into isolated parts to be handled separately. As systems thinking is an active attitude, it's not just hard to achieve, but also very easy to forget about when things get busy. A typical example of this shift can be found in agile methodologies. The goal of many of the rituals in these processes, such as daily standups and backlog refinement sessions, is to spread knowledge and promote collaboration, as to optimise solutions globally rather than locally. While this adds hours of overhead to any given week, it also saves us the cost of building the wrong thing and having to learn the same lesson twice. That's why, as a team that's all about machine learning, we won't be acting like a neural network anytime soon.

Want to stay up to date?

Leave your email address below to receive my newsletter!