Monday, 18 September 2017

Getting the question right

I have written quite a few posts in which I analyzed a dataset using some particular mathematical model. Obviously, the model chosen is of some importance here — different models might give different outcomes (although, hopefully not). However, the choice of model is actually determined by the original question being asked of the data — we need to match the question and the appropriate model.

This raises the important issue of getting the question right. This is especially true if we are trying to relate causes and effects. For example: is the causal factor the presence of some something, or the absence of something else? Sherlock Holmes is famous for drawing Inspector Lestrade's attention to "the curious incident of the dog in the night-time." It turned out that the important thing was that the dog did nothing, under circumstances when a guard dog should clearly have done something. Holmes solved the crime by asking a question about an absence, not a presence.

As an example from the wine world, consider the following graph. It shows the recent time-course of the percentage each of five countries has had of the global wine export market. The data are taken from Kym Anderson & Nanda R. Aryal (2015) Growth and Cycles in Australia’s Wine Industry: a Statistical Compendium, 1843 to 2013, with additions listed by the AAWE.

Global export percentages for the top five countries

We could ask any number of questions about these data. For example, we could ask about the general increase across the five countries since 1990, and whether it can be sustained. However, the most obvious question is likely to be about the time-course pattern for Australia, which seems to be dramatically different to the other four countries. But should that question be about the sudden increase that occurs from 2000 onwards, or the sudden decrease that occurs after 2005? Which pattern do we try to explain?

The second question (which seems to be the one that the Australian wine industry has been asking) would ask about why the "good times" suddenly crashed in 2005, and what the industry might do about it. On the other hand, the first question might ask about why the increase occurred in the first place, assuming that the subsequent decrease is simply a "return to normal" after a short-term aberration.

Let's look at how we might analyze Question 1. This next graph shows the Australian data compared to the average time-course of the other four wine-exporting countries (ie. excluding Australia).

The red line shows a very straightforward increase in export percentage through time. We might treat this line as a possible model of the "expected" pattern of growth, and then try to explain why the pattern for Australia does not fit in with it. This would be one way of answering Question 1. What we would do would be to apply a mathematical model to the red line, and then see how that model compares to the Australian data.

The next graph shows the fit of a simple Polynomial model to the average data, as indicated by the red dashed line. This model fits the data extremely well, as it accounts for 98% of the variation in the Average data.

We can, of course, now use this model to explore possible forecasts for future export growth. For example, the model forecasts that the Average export percentage will peak at 4.2%, which will occur in c. 2024. This might be a reasonable goal for an exporting country, to capture 4-5% of the market, and to consider themselves to have done well if they exceed this level.

More to the point, we can compare this model to the Australia data, as shown in the next graph. The blue dashed line is simply the red dashed line raised by 1.2 percentage points (which is the best fit to the Australia data). This reveals that from 2013 onwards the Australian exports were exactly where we would forecast them to be, based on the 1990-1995 data.

So, answering Question 1 would quite a reasonable way to tackle these data — the data do support the idea that the decrease in Australian export percentage may well be simply a return to "normal" after a short-term aberration. The downwards trend can be seen, not as a crisis, but merely as a correction. These are two quite different interpretations.

Getting the question right is crucial. Data analysis often suffers from what is called confirmation bias, in which we simply try to confirm the assumed answer to our initial question. That is, we look for what the dog did in the night-time, instead of looking for what it did not do — and we often find something that the dog did, no matter how irrelevant it may be!