Saturday, October 26, 2013

Over Lampedusa, asielaanvragen, Europa en een grafiek in De Standaard

Op vrijdag 25 Oktober 2013 verscheen er in "De Standaard" een artikel onder de kop "Vluchtelingen moeten het doen met beloftes". Het artikel zelf is prima, het handelt over het probleem van de vluchtelingen in Europa, dat omwille van de ramp voor Lampedusa, hoog op de Europse agenda is geraakt. De grafiek bij het artikel, echter, is niet onmiddellijk een schot in de roos te noemen.
Het probleem bij deze grafiek is dat men de oppervlakte van cirkels gebruikt om verhoudingen te vergelijken, en dat is bijzonder moeilijk. Neem bijvoorbeeld het Verenigd Koninkrijk. Ongeveer de helft (14600) van de 28200 asielaanvragen wordt goedgekeurd. De oppervlakte van de rode cirkel is dan ook ongeveer de helft van de blauwe cirkel. Ik heb het eens nagerekend, en het klopt vrij aardig, maar de modale lezer zal allicht niet onmiddellijk aan die verhouding denken. Maar bon, de getallen zelf staan er netjes bij, dus ook al werkt het visueel niet goed, dan heb je toch nog de getallen

Erger is dat deze grafiek het bijhorende verhaal niet echt ondersteunt. Het gaat erom dat de zuiderse landen vinden dat ze het grootste gedeelte van de lasten van de vluchtelingen moeten opnemen, maar dat de cijfers dit beeld nuanceren. En inderdaad, achteraan in lijstje vinden we Malta, Griekenland en Spanje terug. In het artikel wordt er, terecht, op gewezen dat men de cijfers moet bekijken in het licht van het aantal inwoners per land, maar de cijfers van de grafiek worden wel niet relatief gegeven. Als voorbeeld wordt Italië genomen, maar helaas zit die in de grafiek in de bovenste, betere, helft. Verder meldt het artikel dat ook Frankrijk tot de gelegenheidscoalitie hoort. Maar ook dat land zit in de bovenste helft van de grafiek.
Overigens vind ik dat de journalist best had aangegeven waarom sommige landen wel en andere landen niet zijn opgenomen in de grafiek.

Deze grafiek moet beter kunnen. Ik heb de cijfers overgenomen en de bevolkingscijfers voor 2012 van Eurostat er aan toegevoegd. Vervolgens heb ik het aantal asielaanvragen per miljoen inwoners uitgedrukt. Omwille van de moeilijke interpretatie van oppervlakten van cirkels kies ik hier voor een eenvoudige staafdiagram.

De grafiek is geordend  van het hoogste relatieve aantal verleende asielaanvragen naar het laagste (i.e. het groene gedeelte van de staaf). De afgekeurde asielaanvragen staan in het rood. Op deze wijze valt zowel de verhouding goedgekeurde en afgekeurde aanvragen per land op, en is het meteen ook duidelijk welke landen relatief veel asielaanvragen goedkeuren (t.o.v. hun bevolkingsaantal) en welke niet. De informatie die je hier niet ziet, en die wel aanwezig was in de grafiek van De Standaard, zijn de absolute aantallen. Dat is een nadeel, maat anderzijds is het zo dat dit niet onmiddellijk onderwerp uitmaakte van het artikel.

Op de herwerkte grafiek zie je dat de gelegenheidscoalitie helemaal onderaan de grafiek bungelt, enkel Malta heeft allicht een punt, en staat in deze grafiek helemaal bovenaan. Ik denk dat de journalist een sterker verhaal had kunnen maken als hij/zij een betere grafische voorstelling had gekozen.

Los daarvan zie je ook dat, als je Malta buiten beschouwing laat, de noordelijke landen relatief meer asiel verlenen dan de zuidelijke landen. Bemerk ook dat België eerder bij de Scandinavische landen aansluit dan bij de Zuiderse landen. Je ziet ook goed dat bij de vier landen in de staart, Frankrijk en Griekenland veel meer aanvragen krijgen dan Spanje en Italië.

In het licht van dat laatste zou ik er toch op willen wijzen dan de cijfers bij het artikel betekenen dat Spanje, bijvoorbeeld, in 2012 slechts 600 asielaanvragen zou hebben goedgekeurd. In principe zou dat kunnen, bvb. mocht er een asielstop zijn in dat land, maar dat er slechts 2600 aanvragen zouden zijn geweest in dat jaar lijkt me heel sterk, zeker als je weet dat elders in de krant van dezelfde dag er gewag wordt gemaakt van  honderden bootvluchtelingen voor die dag alleen al. Toegegeven, het gaat hier om zevenhonderd vluchtelingen opgepikt bij vijf verschillende reddingsoperaties in Italië en niet in Spanje, maar mij lijkt het waarschijnlijker dat het werkelijke aantal asielzoekers in Spanje, en allicht ook Italië, veel hoger is dan wat je op basis van deze administratieve gegevens zou kunnen denken. Allicht zijn er andere kanalen dan deze vorm van naturalisatieaanvragen om in Spanje en Italië te verblijven, maar dat laat ik aan de migratie specialisten over.

Wednesday, October 23, 2013

Managing Data Scientists

With the rise of the 'Data Scientist', a lot has been said about the definition, role, qualifications and skills of the Data Scientist, and how to hire them. A somewhat neglected topic is how to manage data scientists. Indeed, data scientists, by their very nature, are hard to manage.

They love to resolve problems, but those problems are not always the business problems you want them to tackle. They are ace players, but they're not always the best team players and some of them can sometimes have difficulty in dealing with (higher) management. They can have bright ideas, but they often lose interest when it comes to implementing those ideas in a profit making activity. They will find clever solutions for you, but they don't always excel in making sure that a structured process is place, let alone the administrative follow up that comes with it. Some of them were hired as 'rock-stars' and have developed an ego that goes with that...

On the other hand, they are (sometimes) the 'heroes' of the company so you need to deal with it, it comes with the territory, as they say. Also, very often you can't apply the usual bag of tricks that 'ordinary managers' can use, simply because these tricks don't always work with them.

If your data scientists are all well behaved in this respect, this blog post is not for you. If you have experienced the issues I described above, read on!

One of the things I picked up early on as a manager was that a good manager should help his people rather than command them. Often I found myself doing things that my reports were asking me to do rather than doing what my manager was asking me to do. Mind you that I would take the general strategy and direction from my manager or people above her/him, but to make it happen I found it often more useful to listen what people who were closer to reality were saying. I would help them to make them more efficient in achieving my goals. And my goals were generally the goals of my boss. I've always tried to avoid micro-management and over reliance on procedures. But I will admit that in some cases I did micromanage and I did emphasize procedures. The thing is that I only did that when a certain unit was in problems, not when it was successfully achieving its goals.

Another thing I noticed is that data scientists, but also statisticians and  some top coders, often have difficulties in accepting orders from managers who don't have technical skills themselves. This does not mean that they would publicly disobey, but rather they would use some technical excuse to do whatever they wanted to do, knowing very well that the manager didn't have the technical knowledge to challenge them. Coming from an IT and statistics background gave me (just enough) credibility to be taken seriously, and that gave me a head start compared to other managers.

But nonetheless, I had my share of problems managing data scientists.  
When I was working for a large market research company a few years ago, I had to work with a lot of statisticians and the like. Some of them were direct reports, some of them indirect and sometimes, horror oh horror, we were acting in a matrix organization. I believe I had some credit with them because I was able to speak the same (technical) language as they did. But still I had difficulties in making sure standard procedures and administrative follow up was done correctly. Now there are two opposite ways to react in such a situation. On the one hand, you can put all your energy in making sure the administrative procedures are followed, or, you can let go of any administrative follow up completely. The former will make it very hard for you to get your ace players on board, because they generally hate this stuff, and the latter might cause problems with higher management, might create chaos and is seldom sustainable. So, as most things in live, the truth is somewhere in the middle. But how do you prioritize?
 
When I tried to explain my vision on these things I found it useful to use the following schema:


This rule has helped me in focusing on the priorities by not trying to force successful people and groups in a very rigid process driven structure, but on the other hand it was also a warning for those people and groups that they could only get away with it as long as they were successful. This rule also took some of the fear out my teams that were in trouble. If they were in trouble but they followed the normal procedures, there was no reason for fear. On the contrary, I would help them in resolving the problem. I'm sure this might have led to some situations that you might call micro management, but at least it was micro management applied on disfunctional groups and it would leave the successful ones doing whatever they were doing. Essentially there's nothing new with this rule and I guess you can't apply it to all situation or in all industries. 
But for me, it worked. 



Thursday, October 17, 2013

A small experiment with Twitter's language detection algorithm

Some time a go I captured quite a lot of geo-located tweets for a spatial statistics project I'm doing. The tweets I collected were all confined to be in Belgium. One of the things I looked at was the language of tweets. As you might know, Belgium officially has three languages, Dutch, French and German. Of course, when you analyze a large set of tweets, you can't manually determine the language, on the other hand blindly relying on Twitter's language detection algorithm doesn't feel good either.

That's why I set up a little experiment to assess to what extent Twitter's language detection algorithm can be trusted, in the context of  my geo-location project. I stress this because I don't have the ambition to make overall judgments on how Twitter takes care of language detection.

First, let's look at the languages as determined by the Twitter language detection algorithm of the 150,000 or so tweets I collected. The barchart below shows the frequency of each of the languages.




I'm not sure if this chart is readable enough, so let me guide you through it. The green bars are the 3 official languages of Belgium, Dutch, French and German. French and Dutch take the top positions, German is on the seventh position. Based on population figures you would expect more Dutch posts than French posts, while this data shows the opposite. There can be many good reasons why this happens. To start with the obvious, the twitter population is not the general population, and hence the distribution of languages can be different as well. Another obvious reason is that tweets can also come from foreigners, tourists for instance. While the sample is large (about 150,000 tweets), I need to rely on Twitter on providing a good sample of all tweets, and I'm not too sure about that. Also, it might be possible that Dutch speaking Belgians tweet more in English than their French speaking counterparts. And finally, it is possible that the Twitter detection algorithm is more successful in detecting some languages than others.

The fact that English (the blue bar) comes in third will not come as a surprise. Turkish is fourth (the top red bar), which can be explained by the relative large immigrant population coming from Turkey. The other languages, such as Spanish and portuguese (the remaining red bars) decrease quite rapidly in terms of frequency. But notice that the scale of the chart is somewhat deceiving in that the lower ranked languages such as Thai and Chinese, that are barely visible in the chart still are representing 40 and 20 tweets respectively. Overall this looks like another example of a power law, where we see that a few languages are responsible for the vast majority of tweets, while a large number of languages are used in the remaining tweets

You will have noticed that the fifth most important language, the orange bar is "Undecided", these are the tweets where the Twitter detection algorithm was not able to detect which language was used. Two other cases stand out (purple bars) on positions 9 and 10 are Indonesian and Tagalog. Tagalog is an Austronesian language spoken in The Philippines.   In a blog post on the Twitter languages of London by  Ed Manley (@EdThink) had noticed that Tagalog came on the seventh place in London. He writes:
One issue with this approach that I did note was the surprising popularity of Tagalog, a language of the Philippines, which initially was identified as the 7th most tweeted language. On further investigation, I found that many of these classifications included just uses of English terms such as ‘hahahahaha’, ‘ahhhhhhh’ and ‘lololololol’.  I don’t know much about Tagalog but it sounds like a fun language.
Here are the eight first Tagalog tagged tweets in my dataset:
  • @xxx hahaha!!! 
  • @xxx hahaha 
  • @xxx das ni goe eh? 
  • @xxx hahaha 
  • SUMBARIE ! 
  • Swedish couple named their kid "Brfxxccxxmnpcccclllmmnprxvclmnckssqlbb11116." The name is pronounced "Albin. 
  • #LRT hahahahahaha le salaud 
  • hahah
Basically what we see in Belgium is very similar as what was observed in London. Tweets containing expression such as 'hahaha' are catalogued as Tagalog. So for my spatial statistics exercise (and for this experiment) I think it is safe to consider both Tagalog and Indonesian as Undecided.
(My thoughts go to the poor researchers in The Philippines who must face quite a challenge when they analyze Twitter data. On the other hand, they now have, yet another, good reason not to touch Twitter data ;-)

Back to the experiment. I took a simple random sample of 100 tweets and asked 4 coders (including myself) to determine in what language a tweet was expressed. I gave the coders only minimal instructions in an attempt not to influence them too much. I did provide them with a very simple 'coding scheme', based on the most common languages (Dutch, French, or English, and a category for both the cases where the coder was not able to determine the language used and all other languages). Now, this might sound like a trivial exercise, but a tweet like "I'm at Comme Chez Soi in Brussel", can be seen as English, French or Dutch, depending on how you interpret the instructions.

This results in datamatrix consisting of 100 rows and 5 columns (i.e. the language assessments of Twitter and the 4 coders). A data scientist will immediately start to think how to analyze this (small) dataset. There are many ways of doing that. Let's first start with the obvious, i.e. comparing the Twitter outcome with one of the coders. You can easily represent that in a frequency table:

     EN FR NL WN
  EN 14  2  1  2
  FR  3 34  0  3
  NL  1  0 24  0
  WN  5  5  1  5

The rows represent the language of a tweet according to Twitter (EN=English, FR=French, NL=Dutch and WN=Don't know or another language). The columns represent the language according to the first coder. We now have different options. Some folks do a Chi-Square-test on this table, but this is not without problems. To start with, testing the hypothesis of independence is not necessarily relevant for assessing the agreement between two coders and we can get into troubles with zero or near zero cells and marginals. Either way, here are the results for such a test:

X-squared = 136.6476, df = 9, p-value < 2.2e-16

As the $p$-value is smaller than the usual 0.05, we would reject the null hypothesis and thus accept that the two coders are not independent and hence somehow 'related'. Again, this seems to be a rather weak requirement given the coding task at hand. Also, $\chi^2$ is sensitive to sample size, so just simply increasing the number of tweets would eventually lead to significance in case we wouldn't have reached it at $n=100$.

One of the alternatives for that is to normalize the $\chi^2$-statistic somehow.   There are many ways to do that, one approach is to divide by the sample size $n$ and the number of categories (minus 1). This is called Cramer's v:
$$r_V=\sqrt{{\chi^2 \over n \times \min[R-1, C-1]}}$$,
where $C$ is the number of columns and $R$ is the number of rows. Cramer's v is often used in statistics to measure the association between two categorical variables. If there is no association at all it becomes 0 and perfect association leads to 1. In this example $R=C=4$ because we consider 4 language categories which then results in $r_V=0.6749016$.

Sometimes simpler or at least more obvious approaches are used, such as taking the proportion of the items for which the two coders agreed. If we assume that both coders have used the same number of categories $G=R=C$, we can formalize this with:
$$r_{pca}= {\sum_{i=1}^G f_{ii}\over n}$$.
In the example this results in $r_{pca}=0.77$. So in more than three quarters of tweets, Twitter and the first coder agree on the language.
The drawback here is that we don't account for chance agreement. Cohen's $\kappa$ is alternative for that. This is generally done by subtracting the original statistic by its expected value and by dividing by the maximum value of that statistic minus the expected value. In the case of   Cohen's $\kappa$ this results in:
$$r_\kappa={r_{pca} - E(r_{pca})\over 1-E(r_{pca})}$$,
with
$$E(r_{pca})={\sum_{i=1}^G{f_{i.}\times f_{.j}\over n}\over n}$$,
in which $f_{i.}$ and $f_{.j}$ are the marginal frequencies. Calculating this for our example yields $r_\kappa=0.6766484$

Yet another interesting alternative are approaches which consider the ${n \choose 2}$ pairs of judgments rather the $n$ judgments directly. This approach is popular in the cluster analysis and psychometrics literature, with indexes such as the Rand Index and all sorts of variations on the that index, such as the Hubert and Arabie Adjusted Rand Index. Recently I stumbled on a very interesting article  "On the Equivalence of Cohen’s Kappa and the Hubert-Arabie Adjusted Rand Index" in Journal of Classification by Matthijs J. Warrens, that I recommend very strongly.

But one of the issues that is tackled less often in the literature is the fact that in this type of situations we have often more than one coder or judge. The classical approach is then to calculate all pairwise combinations and take a decision from there.

Incidentally, there are a few areas in research where multiple coders are often used, i.e. in qualitative research. Indeed, qualitative research, has a long tradition to handle situations where 'subjectivity' can play an important role. Very often this is done, amongst others, by using multiple coders.  The literature on the methodology is quite separate from the mainstream statistical literature, but nonetheless there are some interesting things to learn from that field. One of the popular indices in qualitative research is Krippendorff's $\alpha$.

In Content Analysis reliability data refers to a situation in which independent coders assign a value from a set of instructed values to a common set of units of analysis. This overall reliability or agreement is expressed as:
$$ \alpha=1-{D\over E(D)}$$,
in which $D$ is a disagreement measure and $E(D)$ its expectation, and the details of the calculation would lead us too far. A simple example is available on the wikipedia page.

The index can be used for any number of coders, it deals with missing data, and can handle different levels of measurement such as binary, nominal, ordinal, interval, and so on. It claims to  'adjusts itself to small sample sizes of the reliability data'. It is not clear to me where and to what extent these claims are proven. Nonetheless in practice this index is used to have one single coefficient that allows to compare reliabilities 'across any numbers of coders and values, different metrics, and unequal sample sizes'.

I used the irr library in the R-language to calculate  Krippendorff's $\alpha$ for all 5 coders, which resulted in $0.796$, which is just below the commonly used threshold in the social sciences. So we can't claim that all coders, including Twitter, agreed completely on the language detection task, on the other hand we are not too far of what would be considered good.

There were 84 tweets where all 4 human coders agreed on. In 71 of those 83, Twitter came up with the same language as the human coders. That's about 85%. That's not excellent, but it's not bad either.

Let's take a look at a few examples where all 4 human coders agreed, but Twitter didn't:

  1. Deze shit is hard
  2. @xxxx Merci belle sœur
  3. @xxxx de domste is soms ook de snelste
  4. Just posted a photo @ Fontein Jubelpark / Fontaine Parc du Cinquantenaire
  5. Mddrrr j'ziar ..!!
  6. @xxxx ADORABLE!
  7. OGBU EH! Samba don wound Tiki Taka. The Champs are back!
  8. I'm at Proxy Delhaize (Sint-Gillis / Saint-Gilles, Brussels)
The examples 1,4 and 8, seem intrinsically hard because there is no correct answer, so we can't hold that against Twitter. The examples 2,3 and 6 seem to be very straightforward cases that Twitter didn't capture. Example 5 was catalogued as French by Twitter, while the human coders put it in the rest/Don't know category. 
All in all I believe that the number of obvious mistakes is not too high, although that assessment, of course, depends on the type of application. I can very well imagine that for some applications this is not good enough. 

Based on all the different indices, interpretations and examples, my conclusion is that for my spatial statistics project, the Twitter language detection algorithm is not perfect, but good enough. I will use the language suggestion, but only after regrouping and after making sure that Tagalog and the like are recoded towards 'undecided'.