Anyone who reads about data science
will be aware of the term “big data”. It’s one of those terms that seems to
have multiple definitions, so I thought it was about time to put my own
thoughts down on paper so I could have a broad understanding of what the term
means for myself. Also, I wanted to know if traditional statistical ideas
remain relevant with “big data”?
Business Intelligence Sales Pitch
Let’s get this use of the term out of the way first. Big data is used as a buzzword by Business Intelligence vendors to indicate a solution to almost all business problems - if any business could analyse all the data it owns it could find both the problems the business is facing and the solutions.
The best response to this use of the word big data is to ask why bother about big data is you can’t get basic customer service issues right.
Robert Plant uses the example of scheduling service calls :
Let’s get this use of the term out of the way first. Big data is used as a buzzword by Business Intelligence vendors to indicate a solution to almost all business problems - if any business could analyse all the data it owns it could find both the problems the business is facing and the solutions.
The best response to this use of the word big data is to ask why bother about big data is you can’t get basic customer service issues right.
Robert Plant uses the example of scheduling service calls :
Ever waited hours, in vain, for a repair service to
arrive at your home? Of course you have. We all have. Chances are you've also
shifted your allegiance away from a company that made you wait like that.
So why do companies spend millions on big data and
big-data-based market research while continuing to ignore the simple things
that make customers happy? Why do they buy huge proprietary databases yet fail
to use plain old scheduling software to tell you precisely when a technician is
going to arrive?
Size
The next obvious meaning of big data is the size of the data sets involved. These are big data sets.
The size of big data sets is a consequence of technology. If we go back to the early 1900’s, when Fisher, Gossett and others were developing sampling theory, data was collected manually. This necessarily meant that sample sizes were small (at least by today’s standards. For example, when Gossett was developing small sample techniques (such as the student’s t-test), biometricians were using samples that were comprised of hundreds of observations and saw no reason to develop small sample techniques.
Today,
computer technology is allowing the production and storage of data sets that
can be many terabytes in size – for example - click data from a website. An
ecologist can collect numerous virtually continuous measurements using digital
instruments – temperature, wind speed and direction, humidity, sunlight and so
on.
However, it
is worth considering that technology has developed over the centuries, and size
is relative. Annie Pettit sums it up well when says that “there is no such
thing as big data, just bigger data sets than you are used to working with”.
Annie continues by observing that to work with bigger data sets than you are
used to working with, “you just need the right tools”.
Tools and Technology
The
increasing size of data sets leads us on to tools and technology. When the term
big data is used, what is often being referred to is the technology used to
process data sets that are large in comparison to the computer resources
available. In this context, big data refers to computer software that can
handle data sets that are, for example, larger than the RAM available.
A related
use of big data refers to software that can handle data of differing structures
and sources.
For example,
think of the word “Hadoop” – here’s what the Cloudera website says:
Apache Hadoop was born out of necessity as data from the web exploded, and
grew far beyond the ability of traditional systems to handle it. Hadoop was
initially inspired by papers published by Google outlining its approach to
handling an avalanche of data, and has since become the de facto standard for
storing, processing and analyzing hundreds of terabytes, and even petabytes of
data.
Apache Hadoop is
100% open source, and pioneered a fundamentally new way of storing and
processing data. Instead of relying on expensive, proprietary hardware and
different systems to store and process data, Hadoop enables distributed
parallel processing of huge amounts of data across inexpensive,
industry-standard servers that both store and process the data, and can scale
without limits. With Hadoop, no data is too big. And in today’s hyper-connected
world where more and more data is being created every day, Hadoop’s
breakthrough advantages mean that businesses and organizations can now find
value in data that was recently considered useless.
[http://www.cloudera.com/content/cloudera/en/why-cloudera/hadoop-and-big-data.html
]
Here, whilst technologies such as Hadoop enable different analytical
approaches, it is analytic-agnostic. This approach to big data provides the
infrastructure to work with large size data sets in a timely and efficient
manner.
Data Structure
The ability to work with large volumes of data also enables the
researcher to work with disparate forms of data. Or at the least, to work with
disparate databases or sources. An example here would be the ability to access
geographical location data and then use that to link existing data sets to
other data sets
Sampling versus Census
Implicit in many uses of the big data term is the assumption is that
there is no need to sample, we can analyse all data “because we can”.
What are the implications of analysing “all data” rather than a sample?
With big data, we are mostly still sampling, it’s just that the sample sizes
are enormous. We can store and analyse click data from a website, but it still
remains a sample over time (past, present and future).
First, let’s look at why we analyse a sample rather than a census. The
main reason for sampling in a social sciences environment is that taking a
census is either too expensive or not practical.
Once we decide to sample however, then the next decision is what sample
size do we need. How many cases should we analyse.
In classical statistics, one issue we need to be aware of as sample size
increases is that weak effects can become statistically significant. So this
suggests a possible issue with big data; with huge sample sizes, very weak
relationships or patterns can be identified. These will be effects that are not
very insightful.
Signal versus Noise
Big data often includes a lot of noise. This is obvious with data
sources such as social media. Anyone who has worked with twitter feed will understand
Big data also involves repitition. A continuous digital record of say
temperature involves substantial redundant information : the temperature this
second is generally not greatly different from the temperature in 15 seconds.
Ability To Predict
The ability to produce better quality predictions is often cited as an
outcome or benedit of big data. There is a well known example of target US
being able to predict whether its female customers were pregnant.
[http://www.nytimes.com/2012/02/19/magazine/shopping-habits.htm ]
This illustrates what I think is one aspect where big data has a
different focus to classical statistics. Classical statistics has always had
the goal of prediction – it’s aim is to produce insights that can be
generalised to the population. Big data instead seems to have the aim of
classifying individual transactions or individual people so that a customised
transaction can take place. With classical statistics in a social science
setting, the aim is to produce an actionable insight that can be used to inform
policy development.
Case versus Variable
Another differentiating feature of big data is to increase not only the
number of cases, but also the number of variables.
Douglas Merrill makes this point:
This
means that to get more accurate results, you'll need to expand your data set.
There are a couple of ways to scale up the amount of data you are using to make
better predictions:
First,
you can add more cases.
But
the more powerful way is to add signals.
Adding signals (columns) allows you to do two things: First, it can reveal new
relationships, enabling new inferences — with a new variable, you may see a
correlation in the data you never realized before. Second, adding signals makes
your inferences less subject to bias in any number of individual signals. You
add cases, keeping the same signals, to make your understanding of those variables
better. In contrast, you add signals to make it possible to overcome errors in
other signals you rely on.
Although
much of the discussion of big data has focused on adding cases, — in fact, the
common perception of "big data" is being able to track lots of
transactions — but adding signals is most likely to transform a business. The
more signals you have, the more new knowledge you can create. For example, Google uses hundreds of signals to rank web pages.
In the early
1970's, Fair Isaac rose to global prominence as a provider of the standardized FICO score that
supplanted much of the credit officers' role. The standardized score massively
increased credit availability and thus lowered the cost of borrowing. However,
FICO scores have their limits. The scores perform especially poorly for those
without much information in their credit files, or those with relatively bad
credit. It's not FICO's fault — it's the math they use. With fairly few signals
in their models, the FICO score doesn't have the ability to distinguish between
credit risk in a generally high risk group.
The way to address
this is to add more signals. For example, thousands of signals can be used to
analyze an individual's credit risk. This can be everything from excess income
available, to the time an applicant spent on the application, to whether an
applicant's social security number shows up as associated with a dead person.
The more signals used, the more accurate a financial picture a lender can get,
particularly for thin file applicants who need the access to credit and likely
don't have the traditional data points a lender analyzes.
However, whichever way we increase
the size of the sample, we are not qualitatively changing the way we analyse
the data.
Summary
So in summary, what is “big data” ?
How is it different from classical statistics?
Overall, my view is that “big data”
is not a new way of looking at the world (as might be the case when comparing
Frequentist and Baysian statistics) and nor does it offer new methodologies .
Instead, I prefer to see big data as an incremental step in one direction that
is a response to changing and evolving technology.
However, there are a number of
issues to consider:
- Sampling versus census
One view is
that big data avoids the need to sample. However in my view this is not the
case. Big data sets are still samples, although large samples. For example,
click data from a website is a sample up to a point in time and the target
population is click data – past, present and future.
The
difficulty then with very large sample sizes is that very weak effects will be
virtually always statistically significant.
This will increase the chance of researchers detecting patterns and
relationships that are not meaningful, and which don’t expand the insights gained
from the data.
The issue
here is that the information in a sample does not increase at the same rate as
the amount of data in a sample.
A positive
advantage of large sample sizes is that rare cases or relationships can be more
easily detected. This may well be the case with the data mining example with
Target in the US, where their data mining efforts enabled them to identify
pregnant women, and indeed, to identify roughly what stage of pregnancy they
were in. A sample of even several thousand Target shoppers (and their
transactions) may not have selected sufficient pregnant women to identify
meaningful patterns in their purchasing.
- Big Data has led to increased use and development of data mining and machine learning techniques. This has had the advantage of expanding the number of tools available to statisticians and data scientists.
The black box use of machine learning algorithms could result in the classical problem of multiple comparisons. The more comparisons that are made, the more likely a statistically significant effect is likely – just by chance alone.
- One difference big data is making is in at the transactional level. With small data (in a social science context), the findings of a survey would be generalised to the target population, and used to inform, for example policy development.
- Finally, it’s clear that the technical aspects of big data are beneficial in allowing larger data sets to be processed and in allowing disparate data sources to be combined. But this is itself not a theoretical or methodological breakthrough. We’ve seen that sort of progress before. At one time, researchers had to handle their calculations manually, and then mechanical calculators were developed. What we are seeing with big data is a continuation of evolving technology.