All Things Data Science

Posts

Showing posts with the label statistics

Show all

Calibration, weighting and post-stratification in audience measurement

June 22, 2025

Read my post on substack: Calibration, weighting and post-stratification in audience measurement

Het aandeel blanco en ongeldige stemmen bij de gemeenteraadsverkiezingen in Vlaaanderen in 2024 is gedaald, maar niet overal even sterk.

February 08, 2025

Enkele weken geleden maakte de Vlaamse overheid de publicatie van de fijnmazige stemresultaten van de afgelopen lokale en provinciale verkiezingen bekend . Als datawetenschapper was ik meteen geïnteresseerd in wat deze fijnmazige resultaten juist inhielden. Wat je dan in eerste instantie vaak doet is eenvoudige data exploratie eerder dan onmiddelijk beginnen te modelleren. In eerste instantie ging mijn aandacht naar de resultaten op het niveau van telbureaus en kiesbureaus, en de mate waarin de variatie tussen telbureaus en kiesbureaus binnnen een gemeente zich verhoudt tot de variatie tussen gemeenten. Al snel viel mijn oog op het feit dat het aandeel van blanco en ongeldige stemmen overal sterk was gedaald, maar de mate waarin sterk geografisch bepaald was. Vooreerst, het feit dat het aandeel blanco en ongeldige stemmen sterk gedaald is, hoeft niet te verrassen aangezien vanaf 2024 de stemplicht in Vlaanderen werd afgeschaft. Ik merk hier meteen op dat dit niet het geval was in Bruss...

Over Lampedusa, asielaanvragen, Europa en een grafiek in De Standaard

October 26, 2013

Op vrijdag 25 Oktober 2013 verscheen er in "De Standaard" een artikel onder de kop "Vluchtelingen moeten het doen met beloftes". Het artikel zelf is prima, het handelt over het probleem van de vluchtelingen in Europa, dat omwille van de ramp voor Lampedusa, hoog op de Europse agenda is geraakt. De grafiek bij het artikel, echter, is niet onmiddellijk een schot in de roos te noemen. Het probleem bij deze grafiek is dat men de oppervlakte van cirkels gebruikt om verhoudingen te vergelijken, en dat is bijzonder moeilijk. Neem bijvoorbeeld het Verenigd Koninkrijk. Ongeveer de helft (14600) van de 28200 asielaanvragen wordt goedgekeurd. De oppervlakte van de rode cirkel is dan ook ongeveer de helft van de blauwe cirkel. Ik heb het eens nagerekend, en het klopt vrij aardig, maar de modale lezer zal allicht niet onmiddellijk aan die verhouding denken. Maar bon, de getallen zelf staan er netjes bij, dus ook al werkt het visueel niet goed, dan heb je toch nog de getallen ...

A small experiment with Twitter's language detection algorithm

October 17, 2013

Some time a go I captured quite a lot of geo-located tweets for a spatial statistics project I'm doing. The tweets I collected were all confined to be in Belgium. One of the things I looked at was the language of tweets. As you might know, Belgium officially has three languages, Dutch, French and German. Of course, when you analyze a large set of tweets, you can't manually determine the language, on the other hand blindly relying on Twitter's language detection algorithm doesn't feel good either. That's why I set up a little experiment to assess to what extent Twitter's language detection algorithm can be trusted, in the context of my geo-location project. I stress this because I don't have the ambition to make overall judgments on how Twitter takes care of language detection. First, let's look at the languages as determined by the Twitter language detection algorithm of the 150,000 or so tweets I collected. The barchart below shows the frequency of...

De Moivre's equation and the solar panels of Lo-Reninge

September 15, 2013

A few weeks a go I saw an innocent little article on solar panels in the Flemish quality newspaper ' De Standaard ', entitled " Niemand maakt meer zonne-energie dan inwoners Lo-Reninge ", which roughly translates to " no one produces more solar energy than the inhabitants of Lo-Reninge ". The article reports on the production of solar energy by individual households, typically produced by small installations on rooftops. The Flemish authorities support solar energy by subsidizing households who install solar panels. An important part of the subsidies is handled by issuing so called 'Green certificates' (or renewable energy certificates) per fixed amount of 'kilowatt per hour' produced. See here for more details on solar power in Belgium. De Standaard newspaper, citing data from the Flemish Regulator of the Electricity and Gas market ( VREG ), reported on the number of these certificates issued in 2012 relative to the number of inhabitant...

An introduction to probability theory with Elvis Costello

July 26, 2013

Last week I released a paper entitled "The Generalized $S^3$-problem. A probabilistic view on Elvis Costello's Spectacular Spinning Songbook". You can find the pdf here . The paper is bit of a parody on statistical papers, so it shouldn't be taken too seriously. But at the same time it gives a very gentle introduction in some concepts of probability theory (Laplace, independence, the birthday paradox, ...). Enjoy!

Are partygoers in Belgium using more cocaine?

July 10, 2013

Last week the Belgian newspaper De Morgen ran an article on drug use amongst Belgian partygoers. The headline of the article was "Partygoers use less cannabis and more cocaine" ("Minder cannabis, meer cocaïne bij feestvierders"). The graph that accompanied the article looked like this: While this is dutch, the language of drugs is universal, so I'm sure you will have no difficulty in understanding what it says. There are a couple of remarks to make on this graph: While there are small grey bars between the 3 groups, Alcohol/Cannabis, Xtc/Cocaine and LSD/GHB/Ketamine, initially I was fooled by thinking they were all using the same Y-axis. They're not, so you need to be careful to take scale into account. Secondly, at the first glance there seems to be a drop in cannabis use, but the increase in cocaine that was mentioned in the title is less clear cut (no pun intended). Thirdly, alcohol use seems to decline as well, although this is difficult to j...

Visualisatiefouten deren "De Morgen" niet

June 22, 2013

Op woensdag 19 juni 2013 verscheen er een artikel in De Morgen met als kop " Crisis deert superrijken niet ". Eén van de twee grafieken bij het artikel verdient nadere bespreking. Ziehier de grafiek waar het over gaat: Om de tekst iets beter leesbaar te maken voor deze blog heb ik de grafiek iets aangepast: Let wel dat je rekening moet houden met de lengte verhoudingen in de eerste grafiek. Het eerste dat opvalt is dat de lengte van de twee kleinste staafdiagrammen niet in verhouding staan met de blauwe getallen (de frequenties, dus). Voor de hoogste frequentie is er nog een excuus omdat daar een zogenaamde schaalonderbreking wordt weergegeven (i.e. de onderbreking halverwege de staaf met de hoogste frequentie). Zoals de grafiek er nu staat had men ook een schaalonderbreking bij de 1.068.500 moeten zetten, maar aangezien de hoogte van de eerste staaf arbitrair is ten opzichte van de voorgestelde frequentie, zouden twee schaalonderbrekingen bij een grafiek met drie ...

Addendum bij "Enkele bedenkingen bij de recente "De Standaard/VRT/TNS" peiling"

June 03, 2013

Beste Tim en @_3s_, Vooreerst dank voor jullie reacties op Enkele bedenkingen bij de recente "De Standaard/VRT/TNS" peiling . Ik wil er wel meteen aan toe voegen dat het niet mijn bedoeling was om Maarten op z'n plaats te zetten, zoals Tim schrijft. Wel in tegendeel, ik vind dat Maarten intuïtief een juiste redenering had opgezet. Wat betreft m'n opmerking over de Bayesiaanse redenering van Maarten, dat was eerder als grap/compliment bedoeld. Als @_3s_ zegt dat dit niet Baysesiaans is, geloof ik hem vrij, hij is daar meer specialist in dan ik. Ik meen wel, dat in het specifieke geval van het TNS onderzoek, de journalisten gelijk hadden op te focussen op de daling die voor NVA werd geobserveerd. Uiteraard ben ook ik ervan overtuigd dat je in het algemeen ook moet kijken naar de onzekerheid die er heerst rond het vergelijkingspunt. Het maakt inderdaad uit of dat komt van de verkiezingsuitslag (geen steekproeffout, zeer kleine meetfout) of van een andere opiniepeil...

Enkele bedenkingen bij de recente "De Standaard/VRT/TNS" peiling

May 30, 2013

Ik geef geregeld commentaar op de verslaggeving over peilingen en aanverwante onderwerpen op deze blog. Bij de recente DS/VRT peiling heb ik dat niet gedaan, omdat ik al bij al vond dat de verslaggeving niet zo slecht was. Ik heb niet alle artikels gelezen, maar in het algemeen staarde m'n zich niet blind op kleine verschillen en werd de betrekkelijkheid van de resultaten vrij goed onderstreept. Tussen haakjes, Maarten Lambrechts (@maartenzam) maakte wel een aardig overzicht van de verschillende visuele weergaven van de peilingsresultaten. Ik was dus niet van plan om te regearen, maar, op populair verzoek (nu ja, enkel @janvandenbulck) toch enkele bedenkingen, met name over een twitter conversatie tussen @OmbudsDS en @maartencorten. Het uitgangspunt was de bijdrage van @OmbudsDS waarin hij schreef dat de berichtgeving over de peiling in zijn krant over het algemeen goed was. Eén van de argumenten was dat de berichtgeving zich spitste op de significante daling voor de NVA en niet...

A reaction on "On a First-name Basis with Success? Your Mom Chose Your Name Wisely."

May 11, 2013

Earlier this week, the Business section of the Flemish quality newspaper 'De Standaard' reported that the shorter the first name, the higher the income (see here ). The article showed a pricture of Bill Gates, with the caption: "Was using the nickname 'Bill' the key to the success of William Henry Gates?". The newspaper was refering to research carried out by TheLadders , a "job-matching service for career-driven professionals" and reported here . Basically, they analyzed data around first names from TheLadders’ nearly 6 million members and salary level. The blog is more tongue in cheek than De Standaard article led us to believe, but the blog has found its way in social media, being liked and tweeted more than thousand times, and was caught up by the popular (and sometimes serious) press. There are, however, a few concerns with this research. Let me mention them one by one: The first concern is an obvious one: " Correlation is not caus...

Election fraud detection in Armenia and in Flanders

March 08, 2013

Last week, a tweet by the Dutch political scientist, Armen Hakhverdian (@hakhverdian), pointed to an interesting blog post from Fredrik M Sjoberg, a Postdoctoral Scholar at Columbia University – The Harriman Institute. It's a guest post on The Monkey Cage dealing with the recent election in Armenia and the (alleged) election fraud. One of the things he did was a very simple test. He did a $\chi^2$-test based on the assumption that: In the absence of manipulation of vote totals the last digit should follow a uniform distribution of 10 percent in each of the 0-9 digit categories He did that for the ruling party at the polling station level both for 2012 (no fraud allegations) and 2013 (fraud allegations). The results are summarized by the graph below (copied from the original blog post). The $\chi^2$-test for 2013 turns out to be significant at the 0.1 percent level, but non-significant in 2012 ($p$-value = 0.981). Apart from the fact that I'm always a bit worried when ...

Wat er mis is met de Porno-grafiek van De Morgen

February 28, 2013

Op Donderdag 28/0213 verscheen er in De Morgen een artikel, 'Pleidooi voor porno. Maar niet helemaal', geschreven door Sjoukje Smedts. Het is een genuanceerd artikel waarin allerlei experten hun zegje kunnen doen over het fenomeen porno. Een prima stuk dus ... Maar niet helemaal. De grafiek die erbij staat, een spindiagram, kon beter. Ten eerste valt onmiddellijk op dat de verhoudingen niet kloppen. Het percentage Dagelijkse porno kijkers is ongeveer 10 maal kleiner dan het percentage mensen die zeggen ongeveer één keer per week naar porno te kijken. In de spindiagram lijkt de verhouding eerder de helft. Het lijkt erop dat de binnenste 7-hoek niet wordt meegeteld. Maar zelfs dan lijkt de 13.3% van 'ongeveer één keer per maand' niet te kloppen. Maar los daarvan kan je je afvragen of zo'n spindiagram (spider of radar graph) wel de beste voorstelling is. In principe is bij een spin diagram de volgorde van de verschillende variabelen niet van belang, en dus is het...

A data scientist looks at the Belgian Municipal Elections.

February 18, 2013

Remark: This is the english version of my previous Dutch blog post. After the provincial and municipal elections of the 14th of October in Belgium, media reported several cases of candidates who had received more preference votes than what normally could have been expected. The additional votes were attributed to a problem with the touch screens of the voting machines. When voters pressed too long when selecting a party, the system would sometimes register a preference vote for the candidate whose name appeared in the same area as where the party was. The figures below illustrates well what the issue is. The figure on the left is the Parties Screen, i.e. the first screen the voter sees. On that screen the voter selects a party (sometimes called a list). In this case, the voter has selected the PVDA+ party, as indicated by the blue rectangle. The figure on the right hand side is the candidates screen, i.e. the subsequent screen the voter sees. This screen shows all the c...