## Thursday, December 1, 2016

### (small) samples versus alternative (big) data sources

Those of you who already have attended a meetup of the Brussels Data Science Community know that, besides excellent talks, those meetups are fun because of the traditional drinks afterwards. So after the last meetup we were on our way to a bar on the campus of the University of Brussels and I had this chat with @KrisPeeters from Dataminded. Now if you are expecting wild stories about beer and loose women (or loose men for that matter), I'm afraid I'll have to disappoint you. Instead we discussed ... sampling. Kris was questioning whether typical sample sizes market research companies work with (say in the hundreds or a few thousand at the max) still matter these days, given that we have other sources that give us much larger quantities of data. I told him everything depends on the (business) question the client has.

To start with we can look at history to answer this question. In 1936 the Literary Digest poll had a sample size in the millions. But, obviously, that sample wasn’t representative because it only consisted of its readers. They predicted that Republican Alf Landon would beat Democrat Franklin D. Roosevelt. Roosevelt won in one of the largest landslides ever.

A more recent example is a study that claimed that the Dutch are the best non-native English speakers. This was debunked in http://peilingpraktijken.nl/weblog/2016/11/beheersen-nederlanders-de-engelse-taal-echt-het-best/ (Dutch). Even though the sample size was 950,000 (in 72 countries) statistician Jelke Bethlehem, a Dutch national himself, concluded that the sample was not representative and did not allow to draw the conclusions that the researchers had claimed.

Of course samples can and are biased as well. But there is a difference: Samples are constructed specifically with a research question in mind, and often are designed to be unbiased. Big data or other sources of data are often created for other reasons than research questions. As a consequence big data might have some disadvantages that are not offset by its bigger size.

Take this hypothetical example. Say you have a population consisting of N=10,000,000 individuals and you want to estimate the proportion of people that watched a certain TV show. Say that you have an unbiased sample of size $n=1,000$ and that you find that 100 of them watched the television show. So, with 95% confidence, you would estimate p=0.10 with a margin of error of $z_{\alpha / 2} \times \sqrt{{pq\over n}}= 1.96 \times \sqrt{{0.1 \times 0.9 \over 1,000}}= 0.01859$, which amounts to an confidence interval in absolute figures from  814,058 to  1,185,942. Suppose your friend has an alternative datasource with $N'=6,000,000$, so for those you know exactly whether they watched or not, with no sample error at all, so no confidence interval (unless you are a Bayesian, but that's another story). Now you know the exact number of people who watched from the 6,000,000. For simplicity's sake assume this is 600,000. To be fair, you know nothing about the remaining $N''=4,000,000$ , but you could assume that since your subpopulation is so big, they will be close to what you already have. This effectively means that you consider the alternative data source as a very large sample of size $n'=6,000,000$. In this case the sample fraction is ${n' \over N}={6,000,000\over 10,000,000}=0.6$ which is pretty high,  so you get an additional bonus because of finite population correction yielding a confidence interval between $p_-=p-z_{\alpha / 2} \times \sqrt{{pq\over n}} \times \sqrt{{N-n'\over N-1}}=0.09984$ and  $p_+=p+z_{\alpha / 2} \times \sqrt{{pq\over n}} \times \sqrt{{N-n'\over N-1}}=0.10015$. In terms of absolute figures we end up with a confidence interval from 998,482 to 1,001,518, which is considerably more precise than the 814,058 and 1,185,942 we had in the case of $n=1000$. Of course, the crucial assumption is that we have considered the n'=6,000,000 to be representative for the whole population, which will seldom be the case. Indeed, it is very difficult to setup an unbiased sample, it is therefore not realistic to hope that an unbiased sample would pop up accidentally.  As argued above, big data sources are often created for other reasons than research questions and hence we can not simply assume they are unbiased.

The question now becomes, at what point is the biasedness offset by the increased precision. In this case bias would mean that individuals in our alternative data source are more likely or less likely to watch the television show of interest than is the case in the overall population. Let's call the proportion people from the alternative data source who watched the television show $p'$. Likewise we will call the proportion of remaining individuals from the population that are not in the alternative data source that have watched the relevision show, $p''$. We can then define the level of bias in our alternative data source as $p'-p$. Since the number of remaining individuals from the population that are not in the alternative data source is $N''=N-N'$, we know that
$$Np=N'p'+N''p'',$$
which is a rather convoluted way of saying that if your alternative data source has a bias, the remaining part will be biased as well (but in the other direction).
Let's consider different values of $p'$ going from 0.05 to 0.15, which, with $N'=6,000,000$ and $N''=4,000,000$, corresponds with $p''$ going from 0.175 to 0.025, and corresponds with levels of bias going from -0.05 to 0.05. We then can calculate confidence bounds like we did above. In figure 1 the confidence bounds for the alternative data source (in black) are hardly noticeable. We've also plotted the confidence bounds for the sample case of $n=1000$, assuming no bias (in blue). The confidence interval is obviously much larger. But we also see that as soon as the absolute value of the bias in the alternative data source is larger than 0.02, the unbiased sample is actually better.   (Note that I'm aware that I have loosely interpreted the notions of samples, confidence interval and bias, but I'm just trying to make the point that more is not always better).

As said before, samples can and are biased as well, but are generally designed to be unbiased, while this is seldom the case for other (big) data sources. The crucial thing to realize here is that bias is (to a very large extent) not a function of (the sample) size. Indeed, virtue of the equation above, as the fraction of the alternative data source becomes close to 1, bias is less likely to occur, even if it was not designed for unbiasedness. This is further illustrated in the figure 2. For a few possible values of p (0.10, 0.25, 0.50 and 0.75) we have calculated what biases the complement of the alternative data source should show in function of the fraction that the alternative data source represents in the total population (i.e. sample fraction $N'/N$) and the bias $p'-p$. The point here is that the range of possible bias is very wide, only for sample fractions that are above 0.80 the sheer relative size of the subpopulation starts to limit the possible biases one can encounter, but even then biases can range from -0.1 to 0.1 in the best of cases. Notice that this is even wider than the example we looked at in figure 1.

For most practical cases in market research the fraction of the alternative data source(s) can be high, but will seldom be as high as 0.80. In other words, for all practical purposes (in market research) we can safely say that the potential bias $p'-p$ of alternative data source(s) is not a function of size, but rather from design and execution. I believe it is fair to assume that well designed samples combined with a good execution will lead to biases that will be generally lower than is the case for alternative data sources where unbiasedness is not something that is cared about.

Some concluding remarks.

I focused on bias but with regard to precision the situation is inversed, alternative (big) data sources will generally be much larger than the usual survey sample sizes leading to much smaller confidence intervals such as those in figure 1. The point of course remains that it does not help you much to have a very tight (i.e. precise) confidence interval if it is on a biased estimate. Of course, sampling error is just one part of the story. Indeed, measurement error is very often much more an issue than sampling error.

Notice by the way that enriching the part of your subpopulation that is not covered by the subpopulation with a sample does not work in practice because, in all likelihood, the cost of enriching is the same as the cost for covering the whole population. This has to do with the fact that, except for very high sample fractions, precision is not a function of population size $N$ (or in this case $N''$).

Does that mean that there is no value in those alternative (big) data sources? No, the biggest advantage I see is in granularity and in measurement error. The Big Data datsets are typically generated by devices, and thus have less measurement error and because of size they allow for a much more granular analysis. My conclusion is that if your client cares less about representativity and is more interested in granularity, than, very often, larger data sources can be more meaningful than classical (small) samples, but even then you need to be careful when you generalize your findings to the broader population.

1. Good information .thank you for sharing Data Science online training

2. Very useful information .Thank you for sharing pega online training

3. I visit your blog regularly and recommend it to all of those who wanted to enhance their knowledge with ease. The style of writing is excellent and also the content is top-notch. Thanks for that shrewdness you provide the readers! mosfet replacement

4. This is such a great resource that you are providing and you give it away for free. I love seeing blog that understand the value of providing a quality resource for free. big data analytics

5. That is the excellent mindset, nonetheless is just not help to make every sence whatsoever preaching about that mather. Virtually any method many thanks in addition to i had endeavor to promote your own article in to delicius nevertheless it is apparently a dilemma using your information sites can you please recheck the idea. thanks once more. 토토커뮤니티

6. Thanks for an interesting blog. What else may I get that sort of info written in such a perfect approach? I have an undertaking that I am just now operating on, and I have been on the lookout for such info. source

1. Remarkable article, it is particularly useful! I quietly began in this, and I'm becoming more acquainted with it better! Delights, keep doing more and extra impressive! 릴게임

7. This article was written by a real thinking writer without a doubt. I agree many of the with the solid points made by the writer. I’ll be back day in and day for further new updates. 메이저사이트

8. I can’t imagine focusing long enough to research; much less write this kind of article. You’ve outdone yourself with this material. This is great content. 토토커뮤니티

9. This is my first time i visit here. I found so many interesting stuff in your blog especially its discussion. From the tons of comments on your articles, I guess I am not the only one having all the enjoyment here keep up the good work 대전스웨디시

10. Thank you for helping people get the information they need. Great stuff as usual. Keep up the great work!!! ฉีดฟิลเลอร์ปาก

11. I am very enjoyed for this blog. Its an informative topic. It help me very much to solve some problems. Its opportunity are so fantastic and working style so speedy. nursing test bank

12. i am always looking for some free stuffs over the internet. there are also some companies which gives free samples. 먹튀검증

13. Great post, you have pointed out some fantastic points , I likewise think this s a very wonderful website. 먹튀검증

14. nice post, keep up with this interesting work. It really is good to know that this topic is being covered also on this web site so cheers for taking time to discuss this! 먹튀검증

15. Hi there! Nice stuff, do keep me posted when you post again something like this! 토토사이트

16. You know your projects stand out of the herd. There is something special about them. It seems to me all of them are really brilliant! หนังโป๊ฟรี

17. You know your projects stand out of the herd. There is something special about them. It seems to me all of them are really brilliant! หนังโป๊ใหม่

18. The post is written in very a good manner and it contains many useful information for me. เว็บเดิมพัน

19. Nice to be visiting your blog again, it has been months for me. Well this article that i've been waited for so long. I need this article to complete my assignment in the college, and it has same topic with your article. Thanks, great share. 먹튀검증

20. Thank you again for all the knowledge you distribute,Good post. I was very interested in the article, it's quite inspiring I should admit. I like visiting you site since I always come across interesting articles like this one.Great Job, I greatly appreciate that.Do Keep sharing! Regards, 대전마사지

21. I'm glad I found this web site, I couldn't find any knowledge on this matter prior to.Also operate a site and if you are ever interested in doing some visitor writing for me if possible feel free to let me know, im always look for people to check out my web site. 대전스웨디시

22. I just couldn't leave your website before telling you that I truly enjoyed the top quality info you present to your visitors? Will be back again frequently to check up on new posts. 바둑이게임

23. 바카라사이트
This is my website and it was very helpful. You are so cool! I don't think I've read anything like this before. It would be nice to find someone who incorporates full-fledged support on this topic. Thanks for getting started. This site is what anyone on the web needs after a bit of ingenuity. Bringing something new from the web is a useful matter!

24. We have sell some products of different custom boxes.it is very useful and very low price please visits this site thanks and please share this post with your friends. 먹튀검증

25. Thanks For sharing this Superb article.I use this Article to show my assignment in college.it is useful For me Great Work. 먹튀검증

26. Very interesting blog. Alot of blogs I see these days don't really provide anything that I'm interested in, but I'm most definately interested in this one. Just thought that I would post and let you know. 먹튀검증

27. Nice to be visiting your blog once more, it has been months for me. Well this article that ive been waited for therefore long. i want this article to finish my assignment within the faculty, and it has same topic together with your article. Thanks, nice share. 꽁나라

28. I love visiting sites in my free time. I have visited many sites but did not find any site more efficient than yours. Thanks for the nudge! 꽁머니 커뮤니티

29. Superbly written article, if only all bloggers offered the same content as you, the internet would be a far better place.. 토토커뮤니티

30. Wonderful article. Fascinating to read. I love to read such an excellent article. Thanks! It has made my task more and extra easy. Keep rocking. 먹튀검증

31. Thanks for the nice blog. It was very useful for me. I'm happy I found this blog. Thank you for sharing with us,I too always learn something new from your post. 먹튀검증

32. When you use a genuine service, you will be able to provide instructions, share materials and choose the formatting style. 메이저사이트

33. The content is utmost interesting! I have completely enjoyed reading your points and have come to the conclusion that you are right about many of them. You are great, and your efforts are outstanding! 토토사이트

34. Thanks For sharing this Superb article.I use this Article to show my assignment in college.it is useful For me Great Work. 먹튀사이트

35. It is a fantastic post – immense clear and easy to understand. I am also holding out for the sharks too that made me laugh. 오피사이트