Thursday, December 1, 2016

(small) samples versus alternative (big) data sources



Those of you who already have attended a meetup of the Brussels Data Science Community know that, besides excellent talks, those meetups are fun because of the traditional drinks afterwards. So after the last meetup we were on our way to a bar on the campus of the University of Brussels and I had this chat with @KrisPeeters from Dataminded. Now if you are expecting wild stories about beer and loose women (or loose men for that matter), I'm afraid I'll have to disappoint you. Instead we discussed ... sampling. Kris was questioning whether typical sample sizes market research companies work with (say in the hundreds or a few thousand at the max) still matter these days, given that we have other sources that give us much larger quantities of data. I told him everything depends on the (business) question the client has.

To start with we can look at history to answer this question. In 1936 the Literary Digest poll had a sample size in the millions. But, obviously, that sample wasn’t representative because it only consisted of its readers. They predicted that Republican Alf Landon would beat Democrat Franklin D. Roosevelt. Roosevelt won in one of the largest landslides ever.

A more recent example is a study that claimed that the Dutch are the best non-native English speakers. This was debunked in http://peilingpraktijken.nl/weblog/2016/11/beheersen-nederlanders-de-engelse-taal-echt-het-best/ (Dutch). Even though the sample size was 950,000 (in 72 countries) statistician Jelke Bethlehem, a Dutch national himself, concluded that the sample was not representative and did not allow to draw the conclusions that the researchers had claimed.

Of course samples can and are biased as well. But there is a difference: Samples are constructed specifically with a research question in mind, and often are designed to be unbiased. Big data or other sources of data are often created for other reasons than research questions. As a consequence big data might have some disadvantages that are not offset by its bigger size.

Take this hypothetical example. Say you have a population consisting of N=10,000,000 individuals and you want to estimate the proportion of people that watched a certain TV show. Say that you have an unbiased sample of size $n=1,000$ and that you find that 100 of them watched the television show. So, with 95% confidence, you would estimate p=0.10 with a margin of error of $z_{\alpha / 2} \times \sqrt{{pq\over n}}= 1.96 \times \sqrt{{0.1 \times 0.9 \over 1,000}}= 0.01859$, which amounts to an confidence interval in absolute figures from  814,058 to  1,185,942. Suppose your friend has an alternative datasource with $N'=6,000,000$, so for those you know exactly whether they watched or not, with no sample error at all, so no confidence interval (unless you are a Bayesian, but that's another story). Now you know the exact number of people who watched from the 6,000,000. For simplicity's sake assume this is 600,000. To be fair, you know nothing about the remaining $N''=4,000,000$ , but you could assume that since your subpopulation is so big, they will be close to what you already have. This effectively means that you consider the alternative data source as a very large sample of size $n'=6,000,000$. In this case the sample fraction is ${n' \over N}={6,000,000\over 10,000,000}=0.6$ which is pretty high,  so you get an additional bonus because of finite population correction yielding a confidence interval between $p_-=p-z_{\alpha / 2} \times \sqrt{{pq\over n}} \times \sqrt{{N-n'\over N-1}}=0.09984$ and  $p_+=p+z_{\alpha / 2} \times \sqrt{{pq\over n}} \times \sqrt{{N-n'\over N-1}}=0.10015$. In terms of absolute figures we end up with a confidence interval from 998,482 to 1,001,518, which is considerably more precise than the 814,058 and 1,185,942 we had in the case of $n=1000$. Of course, the crucial assumption is that we have considered the n'=6,000,000 to be representative for the whole population, which will seldom be the case. Indeed, it is very difficult to setup an unbiased sample, it is therefore not realistic to hope that an unbiased sample would pop up accidentally.  As argued above, big data sources are often created for other reasons than research questions and hence we can not simply assume they are unbiased.

The question now becomes, at what point is the biasedness offset by the increased precision. In this case bias would mean that individuals in our alternative data source are more likely or less likely to watch the television show of interest than is the case in the overall population. Let's call the proportion people from the alternative data source who watched the television show $p'$. Likewise we will call the proportion of remaining individuals from the population that are not in the alternative data source that have watched the relevision show, $p''$. We can then define the level of bias in our alternative data source as $p'-p$. Since the number of remaining individuals from the population that are not in the alternative data source is $N''=N-N'$, we know that
$$Np=N'p'+N''p'', $$
which is a rather convoluted way of saying that if your alternative data source has a bias, the remaining part will be biased as well (but in the other direction).
Let's consider different values of $p'$ going from 0.05 to 0.15, which, with $N'=6,000,000$ and $N''=4,000,000$, corresponds with $p''$ going from 0.175 to 0.025, and corresponds with levels of bias going from -0.05 to 0.05. We then can calculate confidence bounds like we did above. In figure 1 the confidence bounds for the alternative data source (in black) are hardly noticeable. We've also plotted the confidence bounds for the sample case of $n=1000$, assuming no bias (in blue). The confidence interval is obviously much larger. But we also see that as soon as the absolute value of the bias in the alternative data source is larger than 0.02, the unbiased sample is actually better.   (Note that I'm aware that I have loosely interpreted the notions of samples, confidence interval and bias, but I'm just trying to make the point that more is not always better).


As said before, samples can and are biased as well, but are generally designed to be unbiased, while this is seldom the case for other (big) data sources. The crucial thing to realize here is that bias is (to a very large extent) not a function of (the sample) size. Indeed, virtue of the equation above, as the fraction of the alternative data source becomes close to 1, bias is less likely to occur, even if it was not designed for unbiasedness. This is further illustrated in the figure 2. For a few possible values of p (0.10, 0.25, 0.50 and 0.75) we have calculated what biases the complement of the alternative data source should show in function of the fraction that the alternative data source represents in the total population (i.e. sample fraction $N'/N$) and the bias $p'-p$. The point here is that the range of possible bias is very wide, only for sample fractions that are above 0.80 the sheer relative size of the subpopulation starts to limit the possible biases one can encounter, but even then biases can range from -0.1 to 0.1 in the best of cases. Notice that this is even wider than the example we looked at in figure 1.


For most practical cases in market research the fraction of the alternative data source(s) can be high, but will seldom be as high as 0.80. In other words, for all practical purposes (in market research) we can safely say that the potential bias $p'-p$ of alternative data source(s) is not a function of size, but rather from design and execution. I believe it is fair to assume that well designed samples combined with a good execution will lead to biases that will be generally lower than is the case for alternative data sources where unbiasedness is not something that is cared about.


Some concluding remarks.

I focused on bias but with regard to precision the situation is inversed, alternative (big) data sources will generally be much larger than the usual survey sample sizes leading to much smaller confidence intervals such as those in figure 1. The point of course remains that it does not help you much to have a very tight (i.e. precise) confidence interval if it is on a biased estimate. Of course, sampling error is just one part of the story. Indeed, measurement error is very often much more an issue than sampling error.

Notice by the way that enriching the part of your subpopulation that is not covered by the subpopulation with a sample does not work in practice because, in all likelihood, the cost of enriching is the same as the cost for covering the whole population. This has to do with the fact that, except for very high sample fractions, precision is not a function of population size $N$ (or in this case $N''$).

Does that mean that there is no value in those alternative (big) data sources? No, the biggest advantage I see is in granularity and in measurement error. The Big Data datsets are typically generated by devices, and thus have less measurement error and because of size they allow for a much more granular analysis. My conclusion is that if your client cares less about representativity and is more interested in granularity, than, very often, larger data sources can be more meaningful than classical (small) samples, but even then you need to be careful when you generalize your findings to the broader population.

50 comments:

  1. Very useful information .Thank you for sharing pega online training


    ReplyDelete
  2. I visit your blog regularly and recommend it to all of those who wanted to enhance their knowledge with ease. The style of writing is excellent and also the content is top-notch. Thanks for that shrewdness you provide the readers! mosfet replacement

    ReplyDelete
  3. This is such a great resource that you are providing and you give it away for free. I love seeing blog that understand the value of providing a quality resource for free. big data analytics

    ReplyDelete
  4. That is the excellent mindset, nonetheless is just not help to make every sence whatsoever preaching about that mather. Virtually any method many thanks in addition to i had endeavor to promote your own article in to delicius nevertheless it is apparently a dilemma using your information sites can you please recheck the idea. thanks once more. 토토커뮤니티

    ReplyDelete
  5. Replies
    1. Remarkable article, it is particularly useful! I quietly began in this, and I'm becoming more acquainted with it better! Delights, keep doing more and extra impressive! 릴게임

      Delete
  6. This article was written by a real thinking writer without a doubt. I agree many of the with the solid points made by the writer. I’ll be back day in and day for further new updates. 메이저사이트

    ReplyDelete
  7. I can’t imagine focusing long enough to research; much less write this kind of article. You’ve outdone yourself with this material. This is great content. 토토커뮤니티

    ReplyDelete
  8. Thank you for helping people get the information they need. Great stuff as usual. Keep up the great work!!! ฉีดฟิลเลอร์ปาก

    ReplyDelete
  9. I am very enjoyed for this blog. Its an informative topic. It help me very much to solve some problems. Its opportunity are so fantastic and working style so speedy. nursing test bank

    ReplyDelete
  10. i am always looking for some free stuffs over the internet. there are also some companies which gives free samples. 먹튀검증

    ReplyDelete
  11. Hi there! Nice stuff, do keep me posted when you post again something like this! 토토사이트

    ReplyDelete
  12. You know your projects stand out of the herd. There is something special about them. It seems to me all of them are really brilliant! หนังโป๊ฟรี

    ReplyDelete
  13. You know your projects stand out of the herd. There is something special about them. It seems to me all of them are really brilliant! หนังโป๊ใหม่

    ReplyDelete
  14. The post is written in very a good manner and it contains many useful information for me. เว็บเดิมพัน

    ReplyDelete
  15. Thank you again for all the knowledge you distribute,Good post. I was very interested in the article, it's quite inspiring I should admit. I like visiting you site since I always come across interesting articles like this one.Great Job, I greatly appreciate that.Do Keep sharing! Regards, 대전마사지

    ReplyDelete
  16. I'm glad I found this web site, I couldn't find any knowledge on this matter prior to.Also operate a site and if you are ever interested in doing some visitor writing for me if possible feel free to let me know, im always look for people to check out my web site. 대전스웨디시

    ReplyDelete
  17. 바카라사이트
    This is my website and it was very helpful. You are so cool! I don't think I've read anything like this before. It would be nice to find someone who incorporates full-fledged support on this topic. Thanks for getting started. This site is what anyone on the web needs after a bit of ingenuity. Bringing something new from the web is a useful matter!

    ReplyDelete
  18. We have sell some products of different custom boxes.it is very useful and very low price please visits this site thanks and please share this post with your friends. 먹튀검증

    ReplyDelete
  19. Nice to be visiting your blog once more, it has been months for me. Well this article that ive been waited for therefore long. i want this article to finish my assignment within the faculty, and it has same topic together with your article. Thanks, nice share. 꽁나라

    ReplyDelete
  20. I love visiting sites in my free time. I have visited many sites but did not find any site more efficient than yours. Thanks for the nudge! 꽁머니 커뮤니티

    ReplyDelete
  21. Thanks for the nice blog. It was very useful for me. I'm happy I found this blog. Thank you for sharing with us,I too always learn something new from your post. 먹튀검증

    ReplyDelete
  22. It is a fantastic post – immense clear and easy to understand. I am also holding out for the sharks too that made me laugh. 오피사이트

    ReplyDelete
  23. Hi, I find reading this article a joy. It is extremely helpful and interesting and very much looking forward to reading more of your work.. 오피

    ReplyDelete
  24. Positive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work. 먹튀검증

    ReplyDelete