(small) samples versus alternative (big) data sources
Those of you who already have attended a meetup of the Brussels Data Science Community know that, besides excellent talks, those meetups are fun because of the traditional drinks afterwards. So after the last meetup we were on our way to a bar on the campus of the University of Brussels and I had this chat with @KrisPeeters from Dataminded. Now if you are expecting wild stories about beer and loose women (or loose men for that matter), I'm afraid I'll have to disappoint you. Instead we discussed ... sampling. Kris was questioning whether typical sample sizes market research companies work with (say in the hundreds or a few thousand at the max) still matter these days, given that we have other sources that give us much larger quantities of data. I told him everything depends on the (business) question the client has.
To start with we can look at history to answer this question. In 1936 the Literary Digest poll had a sample size in the millions. But, obviously, that sample wasn’t representative because it only consisted of its readers. They predicted that Republican Alf Landon would beat Democrat Franklin D. Roosevelt. Roosevelt won in one of the largest landslides ever.
A more recent example is a study that claimed that the Dutch are the best non-native English speakers. This was debunked in http://peilingpraktijken.nl/weblog/2016/11/beheersen-nederlanders-de-engelse-taal-echt-het-best/ (Dutch). Even though the sample size was 950,000 (in 72 countries) statistician Jelke Bethlehem, a Dutch national himself, concluded that the sample was not representative and did not allow to draw the conclusions that the researchers had claimed.
Of course samples can and are biased as well. But there is a difference: Samples are constructed specifically with a research question in mind, and often are designed to be unbiased. Big data or other sources of data are often created for other reasons than research questions. As a consequence big data might have some disadvantages that are not offset by its bigger size.
Take this hypothetical example. Say you have a population consisting of N=10,000,000 individuals and you want to estimate the proportion of people that watched a certain TV show. Say that you have an unbiased sample of size $n=1,000$ and that you find that 100 of them watched the television show. So, with 95% confidence, you would estimate p=0.10 with a margin of error of $z_{\alpha / 2} \times \sqrt{{pq\over n}}= 1.96 \times \sqrt{{0.1 \times 0.9 \over 1,000}}= 0.01859$, which amounts to an confidence interval in absolute figures from 814,058 to 1,185,942. Suppose your friend has an alternative datasource with $N'=6,000,000$, so for those you know exactly whether they watched or not, with no sample error at all, so no confidence interval (unless you are a Bayesian, but that's another story). Now you know the exact number of people who watched from the 6,000,000. For simplicity's sake assume this is 600,000. To be fair, you know nothing about the remaining $N''=4,000,000$ , but you could assume that since your subpopulation is so big, they will be close to what you already have. This effectively means that you consider the alternative data source as a very large sample of size $n'=6,000,000$. In this case the sample fraction is ${n' \over N}={6,000,000\over 10,000,000}=0.6$ which is pretty high, so you get an additional bonus because of finite population correction yielding a confidence interval between $p_-=p-z_{\alpha / 2} \times \sqrt{{pq\over n}} \times \sqrt{{N-n'\over N-1}}=0.09984$ and $p_+=p+z_{\alpha / 2} \times \sqrt{{pq\over n}} \times \sqrt{{N-n'\over N-1}}=0.10015$. In terms of absolute figures we end up with a confidence interval from 998,482 to 1,001,518, which is considerably more precise than the 814,058 and 1,185,942 we had in the case of $n=1000$. Of course, the crucial assumption is that we have considered the n'=6,000,000 to be representative for the whole population, which will seldom be the case. Indeed, it is very difficult to setup an unbiased sample, it is therefore not realistic to hope that an unbiased sample would pop up accidentally. As argued above, big data sources are often created for other reasons than research questions and hence we can not simply assume they are unbiased.
The question now becomes, at what point is the biasedness offset by the increased precision. In this case bias would mean that individuals in our alternative data source are more likely or less likely to watch the television show of interest than is the case in the overall population. Let's call the proportion people from the alternative data source who watched the television show $p'$. Likewise we will call the proportion of remaining individuals from the population that are not in the alternative data source that have watched the relevision show, $p''$. We can then define the level of bias in our alternative data source as $p'-p$. Since the number of remaining individuals from the population that are not in the alternative data source is $N''=N-N'$, we know that
$$Np=N'p'+N''p'', $$
which is a rather convoluted way of saying that if your alternative data source has a bias, the remaining part will be biased as well (but in the other direction).
Let's consider different values of $p'$ going from 0.05 to 0.15, which, with $N'=6,000,000$ and $N''=4,000,000$, corresponds with $p''$ going from 0.175 to 0.025, and corresponds with levels of bias going from -0.05 to 0.05. We then can calculate confidence bounds like we did above. In figure 1 the confidence bounds for the alternative data source (in black) are hardly noticeable. We've also plotted the confidence bounds for the sample case of $n=1000$, assuming no bias (in blue). The confidence interval is obviously much larger. But we also see that as soon as the absolute value of the bias in the alternative data source is larger than 0.02, the unbiased sample is actually better. (Note that I'm aware that I have loosely interpreted the notions of samples, confidence interval and bias, but I'm just trying to make the point that more is not always better).
As said before, samples can and are biased as well, but are generally designed to be unbiased, while this is seldom the case for other (big) data sources. The crucial thing to realize here is that bias is (to a very large extent) not a function of (the sample) size. Indeed, virtue of the equation above, as the fraction of the alternative data source becomes close to 1, bias is less likely to occur, even if it was not designed for unbiasedness. This is further illustrated in the figure 2. For a few possible values of p (0.10, 0.25, 0.50 and 0.75) we have calculated what biases the complement of the alternative data source should show in function of the fraction that the alternative data source represents in the total population (i.e. sample fraction $N'/N$) and the bias $p'-p$. The point here is that the range of possible bias is very wide, only for sample fractions that are above 0.80 the sheer relative size of the subpopulation starts to limit the possible biases one can encounter, but even then biases can range from -0.1 to 0.1 in the best of cases. Notice that this is even wider than the example we looked at in figure 1.
For most practical cases in market research the fraction of the alternative data source(s) can be high, but will seldom be as high as 0.80. In other words, for all practical purposes (in market research) we can safely say that the potential bias $p'-p$ of alternative data source(s) is not a function of size, but rather from design and execution. I believe it is fair to assume that well designed samples combined with a good execution will lead to biases that will be generally lower than is the case for alternative data sources where unbiasedness is not something that is cared about.
Some concluding remarks.
I focused on bias but with regard to precision the situation is inversed, alternative (big) data sources will generally be much larger than the usual survey sample sizes leading to much smaller confidence intervals such as those in figure 1. The point of course remains that it does not help you much to have a very tight (i.e. precise) confidence interval if it is on a biased estimate. Of course, sampling error is just one part of the story. Indeed, measurement error is very often much more an issue than sampling error.
Notice by the way that enriching the part of your subpopulation that is not covered by the subpopulation with a sample does not work in practice because, in all likelihood, the cost of enriching is the same as the cost for covering the whole population. This has to do with the fact that, except for very high sample fractions, precision is not a function of population size $N$ (or in this case $N''$).
Does that mean that there is no value in those alternative (big) data sources? No, the biggest advantage I see is in granularity and in measurement error. The Big Data datsets are typically generated by devices, and thus have less measurement error and because of size they allow for a much more granular analysis. My conclusion is that if your client cares less about representativity and is more interested in granularity, than, very often, larger data sources can be more meaningful than classical (small) samples, but even then you need to be careful when you generalize your findings to the broader population.
Good information .thank you for sharing Data Science online training
ReplyDeleteVery useful information .Thank you for sharing pega online training
ReplyDeleteI visit your blog regularly and recommend it to all of those who wanted to enhance their knowledge with ease. The style of writing is excellent and also the content is top-notch. Thanks for that shrewdness you provide the readers! mosfet replacement
ReplyDeleteThis is such a great resource that you are providing and you give it away for free. I love seeing blog that understand the value of providing a quality resource for free. big data analytics
ReplyDeleteThat is the excellent mindset, nonetheless is just not help to make every sence whatsoever preaching about that mather. Virtually any method many thanks in addition to i had endeavor to promote your own article in to delicius nevertheless it is apparently a dilemma using your information sites can you please recheck the idea. thanks once more. 토토커뮤니티
ReplyDeletekayseriescortu.com - alacam.org - xescortun.com
ReplyDeleteRemarkable article, it is particularly useful! I quietly began in this, and I'm becoming more acquainted with it better! Delights, keep doing more and extra impressive! 릴게임
DeleteThis article was written by a real thinking writer without a doubt. I agree many of the with the solid points made by the writer. I’ll be back day in and day for further new updates. 메이저사이트
ReplyDeleteI can’t imagine focusing long enough to research; much less write this kind of article. You’ve outdone yourself with this material. This is great content. 토토커뮤니티
ReplyDeleteThank you for helping people get the information they need. Great stuff as usual. Keep up the great work!!! ฉีดฟิลเลอร์ปาก
ReplyDeleteI am very enjoyed for this blog. Its an informative topic. It help me very much to solve some problems. Its opportunity are so fantastic and working style so speedy. nursing test bank
ReplyDeletei am always looking for some free stuffs over the internet. there are also some companies which gives free samples. 먹튀검증
ReplyDeleteHi there! Nice stuff, do keep me posted when you post again something like this! 토토사이트
ReplyDeleteYou know your projects stand out of the herd. There is something special about them. It seems to me all of them are really brilliant! หนังโป๊ฟรี
ReplyDeleteYou know your projects stand out of the herd. There is something special about them. It seems to me all of them are really brilliant! หนังโป๊ใหม่
ReplyDeleteThe post is written in very a good manner and it contains many useful information for me. เว็บเดิมพัน
ReplyDeleteThank you again for all the knowledge you distribute,Good post. I was very interested in the article, it's quite inspiring I should admit. I like visiting you site since I always come across interesting articles like this one.Great Job, I greatly appreciate that.Do Keep sharing! Regards, 대전마사지
ReplyDeleteI'm glad I found this web site, I couldn't find any knowledge on this matter prior to.Also operate a site and if you are ever interested in doing some visitor writing for me if possible feel free to let me know, im always look for people to check out my web site. 대전스웨디시
ReplyDelete바카라사이트
ReplyDeleteThis is my website and it was very helpful. You are so cool! I don't think I've read anything like this before. It would be nice to find someone who incorporates full-fledged support on this topic. Thanks for getting started. This site is what anyone on the web needs after a bit of ingenuity. Bringing something new from the web is a useful matter!
We have sell some products of different custom boxes.it is very useful and very low price please visits this site thanks and please share this post with your friends. 먹튀검증
ReplyDeleteNice to be visiting your blog once more, it has been months for me. Well this article that ive been waited for therefore long. i want this article to finish my assignment within the faculty, and it has same topic together with your article. Thanks, nice share. 꽁나라
ReplyDeleteI love visiting sites in my free time. I have visited many sites but did not find any site more efficient than yours. Thanks for the nudge! 꽁머니 커뮤니티
ReplyDeleteThanks for the nice blog. It was very useful for me. I'm happy I found this blog. Thank you for sharing with us,I too always learn something new from your post. 먹튀검증
ReplyDeleteIt is a fantastic post – immense clear and easy to understand. I am also holding out for the sharks too that made me laugh. 오피사이트
ReplyDeleteHi, I find reading this article a joy. It is extremely helpful and interesting and very much looking forward to reading more of your work.. 오피
ReplyDeletePositive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work. 먹튀검증
ReplyDeletevery interesting keep posting. 먹튀검증
ReplyDeleteGood content. You write beautiful things.
ReplyDeletetaksi
hacklink
mrbahis
hacklink
sportsbet
vbet
sportsbet
korsan taksi
mrbahis
hatay
ReplyDeletekars
mardin
samsun
urfa
2CRİ2
kıbrıs
ReplyDeleteniğde
tunceli
diyarbakır
uşak
2İL11
adapazarı
ReplyDeleteadıyaman
afyon
alsancak
antakya
22S7
Antalya
ReplyDeleteAntep
Burdur
Sakarya
istanbul
16XWA
aydın evden eve nakliyat
ReplyDeletebursa evden eve nakliyat
trabzon evden eve nakliyat
bilecik evden eve nakliyat
antep evden eve nakliyat
TZ4G8V
urfa evden eve nakliyat
ReplyDeletemalatya evden eve nakliyat
burdur evden eve nakliyat
kırıkkale evden eve nakliyat
kars evden eve nakliyat
E6M58
78834
ReplyDeleteYobit Güvenilir mi
Huobi Güvenilir mi
Aydın Parça Eşya Taşıma
Kütahya Evden Eve Nakliyat
Aydın Lojistik
Sivas Lojistik
Elazığ Parça Eşya Taşıma
Giresun Lojistik
Bursa Evden Eve Nakliyat
7455A
ReplyDeleteSakarya Şehir İçi Nakliyat
Aydın Parça Eşya Taşıma
Ankara Şehirler Arası Nakliyat
Yalova Parça Eşya Taşıma
Çerkezköy Oto Boya
Balıkesir Şehir İçi Nakliyat
Adıyaman Şehir İçi Nakliyat
Silivri Cam Balkon
Tokat Evden Eve Nakliyat
D4967
ReplyDeleteAnkara Şehir İçi Nakliyat
Kars Şehir İçi Nakliyat
Muş Evden Eve Nakliyat
Isparta Parça Eşya Taşıma
Uşak Lojistik
Ardahan Evden Eve Nakliyat
Erzurum Lojistik
Btcturk Güvenilir mi
Erzurum Parça Eşya Taşıma
CE4EF
ReplyDeleteMuğla Evden Eve Nakliyat
Ünye Çelik Kapı
Kırklareli Lojistik
Tunceli Şehirler Arası Nakliyat
Uşak Lojistik
Bayburt Parça Eşya Taşıma
Bitrue Güvenilir mi
Kırşehir Şehir İçi Nakliyat
Kocaeli Şehirler Arası Nakliyat
DF5FB
ReplyDeleteİstanbul Lojistik
Satoshi Coin Hangi Borsada
Mexc Güvenilir mi
Ordu Parça Eşya Taşıma
Diyarbakır Evden Eve Nakliyat
Ünye Boya Ustası
Shibanomi Coin Hangi Borsada
Osmaniye Evden Eve Nakliyat
Erzincan Şehirler Arası Nakliyat
43D72
ReplyDeleteNiğde Evden Eve Nakliyat
Bingöl Lojistik
Lbank Güvenilir mi
Bingöl Şehir İçi Nakliyat
Düzce Lojistik
Şırnak Şehir İçi Nakliyat
Okex Güvenilir mi
Düzce Şehir İçi Nakliyat
Sakarya Parça Eşya Taşıma
A22E0
ReplyDeleteEryaman Fayans Ustası
Erzurum Şehir İçi Nakliyat
Siirt Evden Eve Nakliyat
Hatay Şehir İçi Nakliyat
Bayburt Evden Eve Nakliyat
Hakkari Parça Eşya Taşıma
Kırıkkale Şehirler Arası Nakliyat
Afyon Evden Eve Nakliyat
Çerkezköy Fayans Ustası
028FF
ReplyDeleteYozgat Parça Eşya Taşıma
Batman Lojistik
Silivri Parke Ustası
Kütahya Şehirler Arası Nakliyat
Ünye Evden Eve Nakliyat
Rize Şehir İçi Nakliyat
Bursa Lojistik
Diyarbakır Şehirler Arası Nakliyat
Tekirdağ Şehir İçi Nakliyat
AFE0C
ReplyDeleteSincan Parke Ustası
Çerkezköy Boya Ustası
Karabük Parça Eşya Taşıma
Denizli Şehirler Arası Nakliyat
Artvin Şehirler Arası Nakliyat
İzmir Şehirler Arası Nakliyat
Tekirdağ Evden Eve Nakliyat
Antalya Parça Eşya Taşıma
Burdur Şehirler Arası Nakliyat
9C4F4
ReplyDeleteÇerkezköy Buzdolabı Tamircisi
https://steroidvip6.com/
Altındağ Boya Ustası
Eryaman Boya Ustası
fat burner for sale
Yozgat Evden Eve Nakliyat
Samsun Şehirler Arası Nakliyat
Sivas Parça Eşya Taşıma
Ağrı Şehir İçi Nakliyat
6839C
ReplyDeleteMalatya Şehir İçi Nakliyat
Ünye Fayans Ustası
Zonguldak Şehir İçi Nakliyat
Afyon Şehirler Arası Nakliyat
Edirne Şehir İçi Nakliyat
Niğde Parça Eşya Taşıma
Tekirdağ Evden Eve Nakliyat
Amasya Parça Eşya Taşıma
Şırnak Şehir İçi Nakliyat
BF570
ReplyDeleteYeni Çıkacak Coin Nasıl Alınır
Bitcoin Nasıl Üretilir
Coin Nasıl Çıkarılır
Kripto Para Çıkarma Siteleri
Coin Nasıl Oynanır
resimli magnet
Binance Yaş Sınırı
Coin Kazma
Coin Çıkarma
ED143
ReplyDeleteresimli magnet
binance referans kodu
binance referans kodu
binance referans kodu
referans kimliği nedir
resimli magnet
resimli magnet
referans kimliği nedir
binance referans kodu
BED6B
ReplyDeleteThreads Beğeni Hilesi
Clubhouse Takipçi Satın Al
Kwai Takipçi Hilesi
Parasız Görüntülü Sohbet
MEME Coin Hangi Borsada
Loop Network Coin Hangi Borsada
Threads Takipçi Hilesi
Gate io Borsası Güvenilir mi
Binance Borsası Güvenilir mi
3E062
ReplyDeleteen iyi kripto para uygulaması
binance
kaldıraç nasıl yapılır
en güvenilir kripto borsası
binance referans
bitexen
bitcoin ne zaman çıktı
sohbet canlı
en düşük komisyonlu kripto borsası
706AE
ReplyDeletebinance
kaldıraç ne demek
binance
btcturk
bitcoin nasıl oynanır
bitcoin giriş
bitget
ilk kripto borsası
en iyi kripto para uygulaması
220BF
ReplyDelete----
----
----
----
----
----
----
----
matadorbet
Sridevi Satta Penal Chart, sridevi satta result penal chart, sridevi satta matka penal jodi patti record chart, sridevi day night satta chart, day night sridevi satta patti penal chart
ReplyDeleteTJHUYTJYU
ReplyDeleteتسليك مجاري
nghgfhgbfhbfghbrtghrt
ReplyDeleteشركة مكافحة الحمام
GNJHJMYGH
ReplyDeleteتسليك مجاري بالاحساء
شركة مكافحة الفئران بالاحساء vwPMkg3Vmm
ReplyDeleteشركة صيانة خزانات بعنيزة wwYYc7YCvy
ReplyDeleteشركة تسليك مجاري بالقطيف 9TYTw2HzY8
ReplyDelete