279 – Garbage in, garbage out?
As the developer of various decision tools, I’ve lost track of the number of times I’ve heard somebody say, in a grave, authoritative tone, “a model is only as good as the information you feed into it”. Or, more pithily, “garbage in, garbage out”. It’s a truism, of course, but the implications for decision makers may not be quite what you think.
The value of the information generated by a decision tool depends, of course, on the quality of input data used to drive the tool. Usually, the outputs from a decision tool are less valuable when there is poor-quality information about the inputs than when there is good information.
But what should we conclude from that? Does it mean, for example, that if you have poor quality input information you may just as well make decisions in a very simple ad hoc way and not worry about weighing up the decision options in a systematic way? (In other words, is it not worth using a decision tool?) And does it mean that it is more important to put effort into collecting better input data rather than improving the decision process?
No, these things do not follow from having poor input data. Here’s why.
Imagine a manager looking at 100 projects and trying to choose which 10 projects to give money to. Let’s compare a situation where input data quality is excellent with one where it is poor.
From simulating hundreds of thousands of decisions like this, I’ve found that systematic decision processes that are consistent with best-practice principles for decision making (see Pannell 2013) do a reasonable job of selecting the best projects even when there are random errors introduced to the input data. On the other hand, simple ad hoc decision processes that ignore the principles often result in very poor decisions, whether the input data is good, bad or indifferent.
Not every decision made using a sound decision process is correct, but overall, on average, they are markedly better than quick-and-dirty decisions. So “garbage in, garbage out” is misleading. If you look across a large number of decisions (which is what you should do), then a better description for a good decision tool could be “garbage in, not-too-bad out”. On the other hand, the most apt description for a poor decision process could be “treasure or garbage in, garbage out”.
An interesting question is, if you are using a good process, why don’t random errors in the input data make a bigger difference to the outcomes of the decisions? Here are some reasons.
Firstly, poorer quality input data only matters if it results in different decisions being made, such as a different set of 10 projects being selected. In practice, over a large number of decisions, the differences caused by input data uncertainty are not as large as you might expect. For example, in the project-selection problem, there are several reasons why data uncertainty may have only a modest impact on which projects are selected:
- Uncertainty doesn’t mean that the input data for all projects is wildly inaccurate. Some are wildly inaccurate, but some, by chance, are only slightly inaccurate, and some are in between. The good projects with slightly inaccurate data still get selected.
- Even if the data is moderately or highly inaccurate, it doesn’t necessarily mean that a good project will miss out on funding. Some good projects look worse than they should do as a result of the poor input data, but others are actually favoured by the data inaccuracies, so of course they still get selected. These data errors that reinforce the right decisions are not a problem.
- Some projects are so outstanding that they still seem worth investing in even when the data used to analyse them is somewhat inaccurate.
- When ranking projects, there are a number of different variables to consider (e.g. values, behaviour change, risks, etc.). There is likely to be uncertainty about all of these to some extent, but the errors won’t necessarily reinforce each other. In some cases, the estimate of one variable will be too high, while the estimate of another variable will be too low, such that the errors cancel out and the overall assessment of the project is about right.
So input data uncertainty means that some projects that should be selected miss out, but many good projects continue to be selected.
Even where there is a change in project selection, some of the projects that come in are only slightly less beneficial than the ones that go out. Not all, but some.
Putting all that together, inaccuracy in input data only changes the selection of projects for those projects that: happen to have the most highly inaccurate input data; are not favoured by the data inaccuracies; are not amongst the most outstanding projects anyway; and do not have multiple errors that cancel out. Further, the changes in project selection that do occur only matter for the subset of incoming projects that are much worse than the projects they displace. Many of the projects that are mistakenly selected due to poor input data are not all that much worse than the projects they displace. So input data uncertainty is often not such a serious problem for decision making as you might think. As long as the numbers we use are more-or-less reasonable, results from decision making can be pretty good.
To me, the most surprising outcome from my analysis of these issues was the answer to the second question: is it more important to put effort into collecting better input data rather than improving the decision process?
As I noted earlier, the answer seems to be “no”. For the project choice problem I described earlier, the “no” is a very strong one. In fact, I found that if you start with a poor quality decision process, inconsistent with the principles I’ve outlined in Pannell (2013), there is almost no benefit to be gained by improving the quality of input data. I’m sure there are many scientists who would feel extremely uncomfortable with that result, but it does make intuitive sense when you think about it. If a decision process is so poor that its results are only slightly related to the best possible decisions, then of course better information won’t help much.
Further reading
Pannell, D.J. and Gibson, F.L. (2014) Testing metrics to prioritise environmental projects, Australian Agricultural and Resource Economics Society Conference (58th), February 5-7, 2014, Port Macquarie, Australia. Full paper
Pannell, D.J. (2013). Ranking environmental projects, Working Paper 1312, School of Agricultural and Resource Economics, University of Western Australia. Full paper
David,
Thank you for a stimulating post, as usual. I would point out that there is one reason for getting better data; so that the political process can’t easily use the lack of data or conflicting data as a reason for inaction. As we have seen in the U.S., they have been able to do that with climate change, but I think that we have gotten to a turning point. That is less relevant in cases where the organization has to choose between alternative projects for an agreed-upon goal.
Laura
Very interesting. I have always been fascinated with economists obsession with designing elaborate policy schemes (e.g., non-linear pricing schedules to address adverse selection) in an attempt to recover DWL triangles, which are often quite small in comparison to overall market surplus. I expect you would agree with me that at some point it is no longer worth improving the decision tool because the gain in efficiency is likely to be more than offset by the higher cost associated with more complex decision criteria. You probably already discussed this issue in another posting.
Hi Jim
I agree totally. Some welfare gains that are theoretically possible are too small to be worth pursuing. Actually, that relates to Pannell Discussion 277.
Hi David
Interesting comment. I want to make two points, one of pedantry and another of perception.
On pedantry, I would just remind you that data is a plural word. It comes from the Latin datum or fact of which the Latin plural is data or facts.
On perception, I would not want anyone reading your post to jump to the conclusion that logical, sound decision making necessarily requires some formal quantitative decision model or tool. I can envisage situations in which there is relatively very little information on which to rely or where the information is enormously unreliable. In such cases I would be happier if decision ,makers confessed their ignorance and then proceeded to go through a formal and analytical but not necessarily quantitative process to come to a decision as opposed to shoving unreliable data into some mathematical model.