Emma Gertlowitz: What is Big Data

Anyone who reads about data science will be aware of the term “big data”. It’s one of those terms that seems to have multiple definitions, so I thought it was about time to put my own thoughts down on paper so I could have a broad understanding of what the term means for myself. Also, I wanted to know if traditional statistical ideas remain relevant with “big data”?

Business Intelligence Sales Pitch

Let’s get this use of the term out of the way first. Big data is used as a buzzword by Business Intelligence vendors to indicate a solution to almost all business problems - if any business could analyse all the data it owns it could find both the problems the business is facing and the solutions.

The best response to this use of the word big data is to ask why bother about big data is you can’t get basic customer service issues right.

Robert Plant uses the example of scheduling service calls :

Ever waited hours, in vain, for a repair service to arrive at your home? Of course you have. We all have. Chances are you've also shifted your allegiance away from a company that made you wait like that.

So why do companies spend millions on big data and big-data-based market research while continuing to ignore the simple things that make customers happy? Why do they buy huge proprietary databases yet fail to use plain old scheduling software to tell you precisely when a technician is going to arrive?

[ http://blogs.hbr.org/cs/2012/10/big_data_doesnt_work_if_you_ig.html ]

Size

The next obvious meaning of big data is the size of the data sets involved. These are big data sets.

The size of big data sets is a consequence of technology. If we go back to the early 1900’s, when Fisher, Gossett and others were developing sampling theory, data was collected manually. This necessarily meant that sample sizes were small (at least by today’s standards. For example, when Gossett was developing small sample techniques (such as the student’s t-test), biometricians were using samples that were comprised of hundreds of observations and saw no reason to develop small sample techniques.

Today, computer technology is allowing the production and storage of data sets that can be many terabytes in size – for example - click data from a website. An ecologist can collect numerous virtually continuous measurements using digital instruments – temperature, wind speed and direction, humidity, sunlight and so on.

However, it is worth considering that technology has developed over the centuries, and size is relative. Annie Pettit sums it up well when says that “there is no such thing as big data, just bigger data sets than you are used to working with”. Annie continues by observing that to work with bigger data sets than you are used to working with, “you just need the right tools”.

[http://lovestats.wordpress.com/2012/11/16/the-big-data-myth-mrx/ ]

Tools and Technology

The increasing size of data sets leads us on to tools and technology. When the term big data is used, what is often being referred to is the technology used to process data sets that are large in comparison to the computer resources available. In this context, big data refers to computer software that can handle data sets that are, for example, larger than the RAM available.

A related use of big data refers to software that can handle data of differing structures and sources.

For example, think of the word “Hadoop” – here’s what the Cloudera website says:

Apache Hadoop was born out of necessity as data from the web exploded, and grew far beyond the ability of traditional systems to handle it. Hadoop was initially inspired by papers published by Google outlining its approach to handling an avalanche of data, and has since become the de facto standard for storing, processing and analyzing hundreds of terabytes, and even petabytes of data.

Apache Hadoop is 100% open source, and pioneered a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. And in today’s hyper-connected world where more and more data is being created every day, Hadoop’s breakthrough advantages mean that businesses and organizations can now find value in data that was recently considered useless.

[http://www.cloudera.com/content/cloudera/en/why-cloudera/hadoop-and-big-data.html ]

Here, whilst technologies such as Hadoop enable different analytical approaches, it is analytic-agnostic. This approach to big data provides the infrastructure to work with large size data sets in a timely and efficient manner.

Data Structure

The ability to work with large volumes of data also enables the researcher to work with disparate forms of data. Or at the least, to work with disparate databases or sources. An example here would be the ability to access geographical location data and then use that to link existing data sets to other data sets

Sampling versus Census

Implicit in many uses of the big data term is the assumption is that there is no need to sample, we can analyse all data “because we can”.

What are the implications of analysing “all data” rather than a sample? With big data, we are mostly still sampling, it’s just that the sample sizes are enormous. We can store and analyse click data from a website, but it still remains a sample over time (past, present and future).

First, let’s look at why we analyse a sample rather than a census. The main reason for sampling in a social sciences environment is that taking a census is either too expensive or not practical.

Once we decide to sample however, then the next decision is what sample size do we need. How many cases should we analyse.

In classical statistics, one issue we need to be aware of as sample size increases is that weak effects can become statistically significant. So this suggests a possible issue with big data; with huge sample sizes, very weak relationships or patterns can be identified. These will be effects that are not very insightful.

Signal versus Noise

Big data often includes a lot of noise. This is obvious with data sources such as social media. Anyone who has worked with twitter feed will understand

Big data also involves repitition. A continuous digital record of say temperature involves substantial redundant information : the temperature this second is generally not greatly different from the temperature in 15 seconds.

Ability To Predict

The ability to produce better quality predictions is often cited as an outcome or benedit of big data. There is a well known example of target US being able to predict whether its female customers were pregnant.

[http://www.nytimes.com/2012/02/19/magazine/shopping-habits.htm ]

This illustrates what I think is one aspect where big data has a different focus to classical statistics. Classical statistics has always had the goal of prediction – it’s aim is to produce insights that can be generalised to the population. Big data instead seems to have the aim of classifying individual transactions or individual people so that a customised transaction can take place. With classical statistics in a social science setting, the aim is to produce an actionable insight that can be used to inform policy development.

Case versus Variable

Another differentiating feature of big data is to increase not only the number of cases, but also the number of variables.

Douglas Merrill makes this point:

This means that to get more accurate results, you'll need to expand your data set. There are a couple of ways to scale up the amount of data you are using to make better predictions:

First, you can add more cases.

But the more powerful way is to add signals. Adding signals (columns) allows you to do two things: First, it can reveal new relationships, enabling new inferences — with a new variable, you may see a correlation in the data you never realized before. Second, adding signals makes your inferences less subject to bias in any number of individual signals. You add cases, keeping the same signals, to make your understanding of those variables better. In contrast, you add signals to make it possible to overcome errors in other signals you rely on.

Although much of the discussion of big data has focused on adding cases, — in fact, the common perception of "big data" is being able to track lots of transactions — but adding signals is most likely to transform a business. The more signals you have, the more new knowledge you can create. For example, Google uses hundreds of signals to rank web pages.

In the early 1970's, Fair Isaac rose to global prominence as a provider of the standardized FICO score that supplanted much of the credit officers' role. The standardized score massively increased credit availability and thus lowered the cost of borrowing. However, FICO scores have their limits. The scores perform especially poorly for those without much information in their credit files, or those with relatively bad credit. It's not FICO's fault — it's the math they use. With fairly few signals in their models, the FICO score doesn't have the ability to distinguish between credit risk in a generally high risk group.

The way to address this is to add more signals. For example, thousands of signals can be used to analyze an individual's credit risk. This can be everything from excess income available, to the time an applicant spent on the application, to whether an applicant's social security number shows up as associated with a dead person. The more signals used, the more accurate a financial picture a lender can get, particularly for thin file applicants who need the access to credit and likely don't have the traditional data points a lender analyzes.

[ http://blogs.hbr.org/cs/2012/11/a_practical_approach_to_readin.html ]

However, whichever way we increase the size of the sample, we are not qualitatively changing the way we analyse the data.

Summary

So in summary, what is “big data” ? How is it different from classical statistics?

Overall, my view is that “big data” is not a new way of looking at the world (as might be the case when comparing Frequentist and Baysian statistics) and nor does it offer new methodologies . Instead, I prefer to see big data as an incremental step in one direction that is a response to changing and evolving technology.

However, there are a number of issues to consider:

Sampling versus census

One view is that big data avoids the need to sample. However in my view this is not the case. Big data sets are still samples, although large samples. For example, click data from a website is a sample up to a point in time and the target population is click data – past, present and future.

The difficulty then with very large sample sizes is that very weak effects will be virtually always statistically significant. This will increase the chance of researchers detecting patterns and relationships that are not meaningful, and which don’t expand the insights gained from the data.

The issue here is that the information in a sample does not increase at the same rate as the amount of data in a sample.

A positive advantage of large sample sizes is that rare cases or relationships can be more easily detected. This may well be the case with the data mining example with Target in the US, where their data mining efforts enabled them to identify pregnant women, and indeed, to identify roughly what stage of pregnancy they were in. A sample of even several thousand Target shoppers (and their transactions) may not have selected sufficient pregnant women to identify meaningful patterns in their purchasing.

Big Data has led to increased use and development of data mining and machine learning techniques. This has had the advantage of expanding the number of tools available to statisticians and data scientists.

The disadvantage with the use of many of these machine learning tools is that users have not gained a good understanding of how they work, and we have “black box” models which detect correlations which may or may not be meaningful. The issue here is that this approach has allowed the practitioner to avoid developing a theory that has a hypothesis about a causal relationship, which the research can then test.

The black box use of machine learning algorithms could result in the classical problem of multiple comparisons. The more comparisons that are made, the more likely a statistically significant effect is likely – just by chance alone.

One difference big data is making is in at the transactional level. With small data (in a social science context), the findings of a survey would be generalised to the target population, and used to inform, for example policy development.

Big data has enabled organisations to skip broad policy development, and instead to develop customised responses for individuals. This was the “breakthrough” of the Obama data strategy.But this is not a methodological or theoretical breakthrough – it’s just adapting improving technology.

Finally, it’s clear that the technical aspects of big data are beneficial in allowing larger data sets to be processed and in allowing disparate data sources to be combined. But this is itself not a theoretical or methodological breakthrough. We’ve seen that sort of progress before. At one time, researchers had to handle their calculations manually, and then mechanical calculators were developed. What we are seeing with big data is a continuation of evolving technology.

Emma Gertlowitz

Pages

Wednesday, December 12, 2012

What is Big Data

No comments:

Post a Comment