(small) samples versus alternative (big) data sources

December 01, 2016

Those of you who already have attended a meetup of the Brussels Data Science Community know that, besides excellent talks, those meetups are fun because of the traditional drinks afterwards. So after the last meetup we were on our way to a bar on the campus of the University of Brussels and I had this chat with @KrisPeeters from Dataminded. Now if you are expecting wild stories about beer and loose women (or loose men for that matter), I'm afraid I'll have to disappoint you. Instead we discussed ... sampling. Kris was questioning whether typical sample sizes market research companies work with (say in the hundreds or a few thousand at the max) still matter these days, given that we have other sources that give us much larger quantities of data. I told him everything depends on the (business) question the client has.

To start with we can look at history to answer this question. In 1936 the Literary Digest poll had a sample size in the millions. But, obviously, that sample wasn’t representative because it only consisted of its readers. They predicted that Republican Alf Landon would beat Democrat Franklin D. Roosevelt. Roosevelt won in one of the largest landslides ever.

A more recent example is a study that claimed that the Dutch are the best non-native English speakers. This was debunked in http://peilingpraktijken.nl/weblog/2016/11/beheersen-nederlanders-de-engelse-taal-echt-het-best/ (Dutch). Even though the sample size was 950,000 (in 72 countries) statistician Jelke Bethlehem, a Dutch national himself, concluded that the sample was not representative and did not allow to draw the conclusions that the researchers had claimed.

Of course samples can and are biased as well. But there is a difference: Samples are constructed specifically with a research question in mind, and often are designed to be unbiased. Big data or other sources of data are often created for other reasons than research questions. As a consequence big data might have some disadvantages that are not offset by its bigger size.

Take this hypothetical example. Say you have a population consisting of N=10,000,000 individuals and you want to estimate the proportion of people that watched a certain TV show. Say that you have an unbiased sample of size $n=1,000$ and that you find that 100 of them watched the television show. So, with 95% confidence, you would estimate p=0.10 with a margin of error of $z_{\alpha / 2} \times \sqrt{{pq\over n}}= 1.96 \times \sqrt{{0.1 \times 0.9 \over 1,000}}= 0.01859$, which amounts to an confidence interval in absolute figures from 814,058 to 1,185,942. Suppose your friend has an alternative datasource with $N'=6,000,000$, so for those you know exactly whether they watched or not, with no sample error at all, so no confidence interval (unless you are a Bayesian, but that's another story). Now you know the exact number of people who watched from the 6,000,000. For simplicity's sake assume this is 600,000. To be fair, you know nothing about the remaining $N''=4,000,000$ , but you could assume that since your subpopulation is so big, they will be close to what you already have. This effectively means that you consider the alternative data source as a very large sample of size $n'=6,000,000$. In this case the sample fraction is ${n' \over N}={6,000,000\over 10,000,000}=0.6$ which is pretty high, so you get an additional bonus because of finite population correction yielding a confidence interval between $p_-=p-z_{\alpha / 2} \times \sqrt{{pq\over n}} \times \sqrt{{N-n'\over N-1}}=0.09984$ and $p_+=p+z_{\alpha / 2} \times \sqrt{{pq\over n}} \times \sqrt{{N-n'\over N-1}}=0.10015$. In terms of absolute figures we end up with a confidence interval from 998,482 to 1,001,518, which is considerably more precise than the 814,058 and 1,185,942 we had in the case of $n=1000$. Of course, the crucial assumption is that we have considered the n'=6,000,000 to be representative for the whole population, which will seldom be the case. Indeed, it is very difficult to setup an unbiased sample, it is therefore not realistic to hope that an unbiased sample would pop up accidentally. As argued above, big data sources are often created for other reasons than research questions and hence we can not simply assume they are unbiased.

The question now becomes, at what point is the biasedness offset by the increased precision. In this case bias would mean that individuals in our alternative data source are more likely or less likely to watch the television show of interest than is the case in the overall population. Let's call the proportion people from the alternative data source who watched the television show $p'$. Likewise we will call the proportion of remaining individuals from the population that are not in the alternative data source that have watched the relevision show, $p''$. We can then define the level of bias in our alternative data source as $p'-p$. Since the number of remaining individuals from the population that are not in the alternative data source is $N''=N-N'$, we know that
$$Np=N'p'+N''p'', $$
which is a rather convoluted way of saying that if your alternative data source has a bias, the remaining part will be biased as well (but in the other direction).
Let's consider different values of $p'$ going from 0.05 to 0.15, which, with $N'=6,000,000$ and $N''=4,000,000$, corresponds with $p''$ going from 0.175 to 0.025, and corresponds with levels of bias going from -0.05 to 0.05. We then can calculate confidence bounds like we did above. In figure 1 the confidence bounds for the alternative data source (in black) are hardly noticeable. We've also plotted the confidence bounds for the sample case of $n=1000$, assuming no bias (in blue). The confidence interval is obviously much larger. But we also see that as soon as the absolute value of the bias in the alternative data source is larger than 0.02, the unbiased sample is actually better. (Note that I'm aware that I have loosely interpreted the notions of samples, confidence interval and bias, but I'm just trying to make the point that more is not always better).

As said before, samples can and are biased as well, but are generally designed to be unbiased, while this is seldom the case for other (big) data sources. The crucial thing to realize here is that bias is (to a very large extent) not a function of (the sample) size. Indeed, virtue of the equation above, as the fraction of the alternative data source becomes close to 1, bias is less likely to occur, even if it was not designed for unbiasedness. This is further illustrated in the figure 2. For a few possible values of p (0.10, 0.25, 0.50 and 0.75) we have calculated what biases the complement of the alternative data source should show in function of the fraction that the alternative data source represents in the total population (i.e. sample fraction $N'/N$) and the bias $p'-p$. The point here is that the range of possible bias is very wide, only for sample fractions that are above 0.80 the sheer relative size of the subpopulation starts to limit the possible biases one can encounter, but even then biases can range from -0.1 to 0.1 in the best of cases. Notice that this is even wider than the example we looked at in figure 1.

For most practical cases in market research the fraction of the alternative data source(s) can be high, but will seldom be as high as 0.80. In other words, for all practical purposes (in market research) we can safely say that the potential bias $p'-p$ of alternative data source(s) is not a function of size, but rather from design and execution. I believe it is fair to assume that well designed samples combined with a good execution will lead to biases that will be generally lower than is the case for alternative data sources where unbiasedness is not something that is cared about.

Some concluding remarks.

I focused on bias but with regard to precision the situation is inversed, alternative (big) data sources will generally be much larger than the usual survey sample sizes leading to much smaller confidence intervals such as those in figure 1. The point of course remains that it does not help you much to have a very tight (i.e. precise) confidence interval if it is on a biased estimate. Of course, sampling error is just one part of the story. Indeed, measurement error is very often much more an issue than sampling error.

Notice by the way that enriching the part of your subpopulation that is not covered by the subpopulation with a sample does not work in practice because, in all likelihood, the cost of enriching is the same as the cost for covering the whole population. This has to do with the fact that, except for very high sample fractions, precision is not a function of population size $N$ (or in this case $N''$).

Does that mean that there is no value in those alternative (big) data sources? No, the biggest advantage I see is in granularity and in measurement error. The Big Data datsets are typically generated by devices, and thus have less measurement error and because of size they allow for a much more granular analysis. My conclusion is that if your client cares less about representativity and is more interested in granularity, than, very often, larger data sources can be more meaningful than classical (small) samples, but even then you need to be careful when you generalize your findings to the broader population.

Comments

UnknownOctober 3, 2017 at 12:27 AM
Good information .thank you for sharing Data Science online training
ReplyDelete
Replies
sreedeviOctober 27, 2017 at 2:36 AM
Very useful information .Thank you for sharing pega online training

ReplyDelete
Replies
M. TahaSeptember 28, 2020 at 8:23 AM
I visit your blog regularly and recommend it to all of those who wanted to enhance their knowledge with ease. The style of writing is excellent and also the content is top-notch. Thanks for that shrewdness you provide the readers! mosfet replacement
ReplyDelete
Replies
SHAHZAIBJune 13, 2021 at 1:04 AM
This is such a great resource that you are providing and you give it away for free. I love seeing blog that understand the value of providing a quality resource for free. big data analytics
ReplyDelete
Replies
spark riderJuly 15, 2021 at 10:36 PM
That is the excellent mindset, nonetheless is just not help to make every sence whatsoever preaching about that mather. Virtually any method many thanks in addition to i had endeavor to promote your own article in to delicius nevertheless it is apparently a dilemma using your information sites can you please recheck the idea. thanks once more. 토토커뮤니티
ReplyDelete
Replies
İREM KUZUJuly 26, 2021 at 6:28 PM
kayseriescortu.com - alacam.org - xescortun.com
ReplyDelete
Replies
playAugust 5, 2021 at 8:52 AM
This article was written by a real thinking writer without a doubt. I agree many of the with the solid points made by the writer. I’ll be back day in and day for further new updates. 메이저사이트
ReplyDelete
Replies
GORILLAAugust 13, 2021 at 1:38 PM
I can’t imagine focusing long enough to research; much less write this kind of article. You’ve outdone yourself with this material. This is great content. 토토커뮤니티
ReplyDelete
Replies
SHANKERSeptember 8, 2021 at 1:45 PM
Thank you for helping people get the information they need. Great stuff as usual. Keep up the great work!!! ฉีดฟิลเลอร์ปาก
ReplyDelete
Replies
SHANKERSeptember 9, 2021 at 1:34 PM
I am very enjoyed for this blog. Its an informative topic. It help me very much to solve some problems. Its opportunity are so fantastic and working style so speedy. nursing test bank
ReplyDelete
Replies
pastrySeptember 16, 2021 at 8:23 AM
i am always looking for some free stuffs over the internet. there are also some companies which gives free samples. 먹튀검증
ReplyDelete
Replies
jamesSeptember 26, 2021 at 3:43 AM
Hi there! Nice stuff, do keep me posted when you post again something like this! 토토사이트
ReplyDelete
Replies
SHANKERSeptember 27, 2021 at 9:47 AM
You know your projects stand out of the herd. There is something special about them. It seems to me all of them are really brilliant! หนังโป๊ฟรี
ReplyDelete
Replies
MindestSeptember 27, 2021 at 4:02 PM
You know your projects stand out of the herd. There is something special about them. It seems to me all of them are really brilliant! หนังโป๊ใหม่
ReplyDelete
Replies
MindestSeptember 27, 2021 at 5:24 PM
The post is written in very a good manner and it contains many useful information for me. เว็บเดิมพัน
ReplyDelete
Replies
M. TahaSeptember 29, 2021 at 1:44 AM
Thank you again for all the knowledge you distribute,Good post. I was very interested in the article, it's quite inspiring I should admit. I like visiting you site since I always come across interesting articles like this one.Great Job, I greatly appreciate that.Do Keep sharing! Regards, 대전마사지
ReplyDelete
Replies
spark riderSeptember 29, 2021 at 6:42 AM
I'm glad I found this web site, I couldn't find any knowledge on this matter prior to.Also operate a site and if you are ever interested in doing some visitor writing for me if possible feel free to let me know, im always look for people to check out my web site. 대전스웨디시
ReplyDelete
Replies
AnonymousOctober 10, 2021 at 10:23 PM
바카라사이트
This is my website and it was very helpful. You are so cool! I don't think I've read anything like this before. It would be nice to find someone who incorporates full-fledged support on this topic. Thanks for getting started. This site is what anyone on the web needs after a bit of ingenuity. Bringing something new from the web is a useful matter!
ReplyDelete
Replies
jamesOctober 11, 2021 at 8:53 AM
We have sell some products of different custom boxes.it is very useful and very low price please visits this site thanks and please share this post with your friends. 먹튀검증
ReplyDelete
Replies
bombayOctober 22, 2021 at 9:10 PM
Nice to be visiting your blog once more, it has been months for me. Well this article that ive been waited for therefore long. i want this article to finish my assignment within the faculty, and it has same topic together with your article. Thanks, nice share. 꽁나라
ReplyDelete
Replies
bombayOctober 23, 2021 at 3:58 AM
I love visiting sites in my free time. I have visited many sites but did not find any site more efficient than yours. Thanks for the nudge! 꽁머니 커뮤니티
ReplyDelete
Replies
jamesOctober 29, 2021 at 3:03 AM
Thanks for the nice blog. It was very useful for me. I'm happy I found this blog. Thank you for sharing with us,I too always learn something new from your post. 먹튀검증
ReplyDelete
Replies
jamesNovember 6, 2021 at 2:32 AM
It is a fantastic post – immense clear and easy to understand. I am also holding out for the sharks too that made me laugh. 오피사이트
ReplyDelete
Replies
Robert DukNovember 6, 2021 at 10:28 PM
Hi, I find reading this article a joy. It is extremely helpful and interesting and very much looking forward to reading more of your work.. 오피
ReplyDelete
Replies
riderNovember 13, 2021 at 8:19 AM
Positive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work. 먹튀검증
ReplyDelete
Replies
pastryNovember 14, 2021 at 12:23 AM
very interesting keep posting. 먹튀검증
ReplyDelete
Replies
vbetDecember 23, 2022 at 4:05 AM
Good content. You write beautiful things.
taksi
hacklink
mrbahis
hacklink
sportsbet
vbet
sportsbet
korsan taksi
mrbahis
ReplyDelete
Replies
halisJuly 24, 2023 at 2:28 PM
hatay
kars
mardin
samsun
urfa
2CRİ2
ReplyDelete
Replies
batuAugust 6, 2023 at 9:36 AM
kıbrıs
niğde
tunceli
diyarbakır
uşak

2İL11
ReplyDelete
Replies
cemSeptember 5, 2023 at 7:56 PM
adapazarı
adıyaman
afyon
alsancak
antakya
22S7
ReplyDelete
Replies
QuantumByteChaser666September 30, 2023 at 7:38 PM
Antalya
Antep
Burdur
Sakarya
istanbul
16XWA
ReplyDelete
Replies
CelestialPhoenixOctober 22, 2023 at 3:02 AM
aydın evden eve nakliyat
bursa evden eve nakliyat
trabzon evden eve nakliyat
bilecik evden eve nakliyat
antep evden eve nakliyat
TZ4G8V
ReplyDelete
Replies
SpaceOdyssey31November 5, 2023 at 1:51 AM
urfa evden eve nakliyat
malatya evden eve nakliyat
burdur evden eve nakliyat
kırıkkale evden eve nakliyat
kars evden eve nakliyat
E6M58
ReplyDelete
Replies
DB952Ciara76C33November 8, 2023 at 1:23 PM
78834
Yobit Güvenilir mi
Huobi Güvenilir mi
Aydın Parça Eşya Taşıma
Kütahya Evden Eve Nakliyat
Aydın Lojistik
Sivas Lojistik
Elazığ Parça Eşya Taşıma
Giresun Lojistik
Bursa Evden Eve Nakliyat
ReplyDelete
Replies
9760BJacquelineE4C2BNovember 10, 2023 at 5:59 AM
7455A
Sakarya Şehir İçi Nakliyat
Aydın Parça Eşya Taşıma
Ankara Şehirler Arası Nakliyat
Yalova Parça Eşya Taşıma
Çerkezköy Oto Boya
Balıkesir Şehir İçi Nakliyat
Adıyaman Şehir İçi Nakliyat
Silivri Cam Balkon
Tokat Evden Eve Nakliyat
ReplyDelete
Replies
23F33Briley92636November 10, 2023 at 5:23 PM
D4967
Ankara Şehir İçi Nakliyat
Kars Şehir İçi Nakliyat
Muş Evden Eve Nakliyat
Isparta Parça Eşya Taşıma
Uşak Lojistik
Ardahan Evden Eve Nakliyat
Erzurum Lojistik
Btcturk Güvenilir mi
Erzurum Parça Eşya Taşıma
ReplyDelete
Replies
74E32Layne5D17FNovember 11, 2023 at 2:12 AM
CE4EF
Muğla Evden Eve Nakliyat
Ünye Çelik Kapı
Kırklareli Lojistik
Tunceli Şehirler Arası Nakliyat
Uşak Lojistik
Bayburt Parça Eşya Taşıma
Bitrue Güvenilir mi
Kırşehir Şehir İçi Nakliyat
Kocaeli Şehirler Arası Nakliyat
ReplyDelete
Replies
900A3Leanna105EANovember 12, 2023 at 11:16 AM
DF5FB
İstanbul Lojistik
Satoshi Coin Hangi Borsada
Mexc Güvenilir mi
Ordu Parça Eşya Taşıma
Diyarbakır Evden Eve Nakliyat
Ünye Boya Ustası
Shibanomi Coin Hangi Borsada
Osmaniye Evden Eve Nakliyat
Erzincan Şehirler Arası Nakliyat
ReplyDelete
Replies
431F2Shay1AC74November 13, 2023 at 9:13 AM
43D72
Niğde Evden Eve Nakliyat
Bingöl Lojistik
Lbank Güvenilir mi
Bingöl Şehir İçi Nakliyat
Düzce Lojistik
Şırnak Şehir İçi Nakliyat
Okex Güvenilir mi
Düzce Şehir İçi Nakliyat
Sakarya Parça Eşya Taşıma
ReplyDelete
Replies
667A2Ansley0067FNovember 14, 2023 at 7:13 PM
A22E0
Eryaman Fayans Ustası
Erzurum Şehir İçi Nakliyat
Siirt Evden Eve Nakliyat
Hatay Şehir İçi Nakliyat
Bayburt Evden Eve Nakliyat
Hakkari Parça Eşya Taşıma
Kırıkkale Şehirler Arası Nakliyat
Afyon Evden Eve Nakliyat
Çerkezköy Fayans Ustası
ReplyDelete
Replies
95470Santos37484November 16, 2023 at 8:20 AM
028FF
Yozgat Parça Eşya Taşıma
Batman Lojistik
Silivri Parke Ustası
Kütahya Şehirler Arası Nakliyat
Ünye Evden Eve Nakliyat
Rize Şehir İçi Nakliyat
Bursa Lojistik
Diyarbakır Şehirler Arası Nakliyat
Tekirdağ Şehir İçi Nakliyat
ReplyDelete
Replies
37B84Denver61E0FNovember 16, 2023 at 11:00 AM
AFE0C
Sincan Parke Ustası
Çerkezköy Boya Ustası
Karabük Parça Eşya Taşıma
Denizli Şehirler Arası Nakliyat
Artvin Şehirler Arası Nakliyat
İzmir Şehirler Arası Nakliyat
Tekirdağ Evden Eve Nakliyat
Antalya Parça Eşya Taşıma
Burdur Şehirler Arası Nakliyat
ReplyDelete
Replies
4BFDBDana64EA1November 17, 2023 at 8:35 AM
9C4F4
Çerkezköy Buzdolabı Tamircisi
https://steroidvip6.com/
Altındağ Boya Ustası
Eryaman Boya Ustası
fat burner for sale
Yozgat Evden Eve Nakliyat
Samsun Şehirler Arası Nakliyat
Sivas Parça Eşya Taşıma
Ağrı Şehir İçi Nakliyat
ReplyDelete
Replies
472C3Tracy60CA0November 18, 2023 at 2:20 AM
6839C
Malatya Şehir İçi Nakliyat
Ünye Fayans Ustası
Zonguldak Şehir İçi Nakliyat
Afyon Şehirler Arası Nakliyat
Edirne Şehir İçi Nakliyat
Niğde Parça Eşya Taşıma
Tekirdağ Evden Eve Nakliyat
Amasya Parça Eşya Taşıma
Şırnak Şehir İçi Nakliyat
ReplyDelete
Replies
752D0Allen33A45December 9, 2023 at 9:51 AM
BF570
Yeni Çıkacak Coin Nasıl Alınır
Bitcoin Nasıl Üretilir
Coin Nasıl Çıkarılır
Kripto Para Çıkarma Siteleri
Coin Nasıl Oynanır
resimli magnet
Binance Yaş Sınırı
Coin Kazma
Coin Çıkarma
ReplyDelete
Replies
C994FMatthew2C503December 15, 2023 at 8:58 AM
ED143
resimli magnet
binance referans kodu
binance referans kodu
binance referans kodu
referans kimliği nedir
resimli magnet
resimli magnet
referans kimliği nedir
binance referans kodu
ReplyDelete
Replies
DC3D2HaroldA880AJanuary 8, 2024 at 8:21 AM
BED6B
Threads Beğeni Hilesi
Clubhouse Takipçi Satın Al
Kwai Takipçi Hilesi
Parasız Görüntülü Sohbet
MEME Coin Hangi Borsada
Loop Network Coin Hangi Borsada
Threads Takipçi Hilesi
Gate io Borsası Güvenilir mi
Binance Borsası Güvenilir mi
ReplyDelete
Replies
______takipci satin alma_February 15, 2024 at 12:21 AM
3E062
en iyi kripto para uygulaması
binance
kaldıraç nasıl yapılır
en güvenilir kripto borsası
binance referans
bitexen
bitcoin ne zaman çıktı
sohbet canlı
en düşük komisyonlu kripto borsası
ReplyDelete
Replies
--takipciMarch 1, 2024 at 12:08 PM
706AE
binance
kaldıraç ne demek
binance
btcturk
bitcoin nasıl oynanır
bitcoin giriş
bitget
ilk kripto borsası
en iyi kripto para uygulaması
ReplyDelete
Replies
78744Azul33194April 15, 2024 at 4:28 AM
220BF
----
----
----
----
----
----
----
----
matadorbet
ReplyDelete
Replies
AnonymousAugust 19, 2024 at 11:54 PM
TJHUYTJYU
تسليك مجاري
ReplyDelete
Replies
AnonymousAugust 22, 2024 at 1:17 AM
nghgfhgbfhbfghbrtghrt
شركة مكافحة الحمام
ReplyDelete
Replies
AnonymousAugust 23, 2024 at 4:31 AM
GNJHJMYGH
تسليك مجاري بالاحساء
ReplyDelete
Replies
AnonymousOctober 8, 2024 at 12:05 AM
شركة مكافحة الفئران بالاحساء vwPMkg3Vmm
ReplyDelete
Replies
AnonymousOctober 31, 2024 at 5:02 AM
شركة صيانة خزانات بعنيزة wwYYc7YCvy
ReplyDelete
Replies
AnonymousNovember 5, 2024 at 8:47 AM
شركة تسليك مجاري بالقطيف 9TYTw2HzY8
ReplyDelete
Replies
AnonymousNovember 28, 2024 at 5:05 AM
شركة تنظيف منازل بتنومة sKuX44s30m
ReplyDelete
Replies
AnonymousDecember 4, 2024 at 12:28 AM
شركة تسليك مجاري بالدمام 5L8CHw9aig
ReplyDelete
Replies
AnonymousDecember 4, 2024 at 3:28 AM
yRiLigf3xa
ReplyDelete
Replies
AnonymousDecember 12, 2024 at 3:46 AM
습니다. 여러면에서 도움이됩니다. 다시 게시 해 주셔서 감사합니다.
먹튀사이트
ReplyDelete
Replies
AnonymousJanuary 30, 2025 at 10:42 AM
صيانة افران بمكه 8iyFVRA3ZR
ReplyDelete
Replies
AnonymousFebruary 5, 2025 at 10:35 PM
B19787FA17
Anadoluslot Yeni Adres
Anadoluslot
Anadoluslot
Anadoluslot
Trwin
Trwin Güncel Giriş Adresi
Trwin
Trwin Giriş
Trwin Güncel Giriş Adresi
ReplyDelete
Replies

Add comment

Search This Blog

All Things Data Science

(small) samples versus alternative (big) data sources

Comments

Post a Comment

Popular posts from this blog

Why The Nielsen Company is an #mrx Big Data company avant la lettre

Market Research and Big Data: A difficult relationship