This page shows the source for this entry, with WebCore formatting language tags and attributes highlighted.
Big data ignores lessons learned
The article <a href="http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html" source="Financial Times">Big data: are we making a big mistake?</a> bursts the bubble of the wide-eyed, overconfident and underinformed techies who think that their giant piles of data will fix everything. The article contains many interesting examples, some of which are touched on in the conclusion, cited below: <bq>Uncanny accuracy is easy to overrate if we simply ignore false positives, [...] The claim that causation has been “knocked off its pedestal” is fine if we are making predictions in a stable environment but not if the world is changing (as with Flu Trends) or if we ourselves hope to change it. The promise that “N = All”, and therefore that sampling bias does not matter, is simply not true in most cases that count. As for the idea that “with enough data, the numbers speak for themselves” – that seems hopelessly naive in data sets where spurious patterns vastly outnumber genuine discoveries. “Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever.</bq> The upshot is that you think your data is "big" but it is most likely not big enough. Whereas sampling bias is diminished compared to smaller datasets, the claims made based on the big data are correspondingly bigger, eradicating the increased confidence. Selectively filtering results to focus on the expected result is a pitfall not necessarily of bad statistics, but of bad scientists/engineers as well. The article is a good read for those who can get behind the FT paywall or who haven't used up all of their "free views" for the year.