Stack Overflow’s mishap with self-reported data - tabs-vs-spaces debacle
After writing the blogpost about Interactive programming for Machine Learning I wanted to demonstrate in practice how this methodology and toolset helps solving real world problems and speeds up prototyping.
Lately I came across a controversial article about developers and money: Developers Who Use Spaces Make More Money Than Those Who Use Tabs. The explanation of this phenomenon did not really resonate with me and I too eagerly assumed that this will be a typical mistake caused by confusing correlation and causation.
I thought it would be a good topic to demonstrate Interactive programming in action, so I sat down to analyze the data using R and Python, I gathered some satirical stuff, like texts that
we need more pirates on our seas, because in their time there was no global warming and prepared some pictures showing how confused correlation and causation looks like.
You can find more amusing examples here. But something was still not right.
Where SoftwareMill stands up is that when working for our clients, we get used to evaluate our assumptions as fast as possible. If it is possible, we discourage our clients from expensive and long solutions, we go beyond code and our expertise to find simpler and cheaper solutions. Same story goes here. Before I moved on to create more plots and more advanced analysis, I decided to get a big picture of the entire problem. I assumed that the best way would be to clear my head and start over - from filling up the questionnaire myself. It turned out that the answer was not in data itself and generating more plots. And for sure, analysis could not help. But first things first.
Let's analyze this Stack Overflow's blogpost from the beginning to reach the essence of the problem.
We can read in the text that:
There were 28,657 survey respondents who provided an answer to tabs versus spaces and who considered themselves a professional developer (as opposed to a student or former programmer). Within this group, 40.7% use tabs and 41.8% use spaces (with 17.5% using both).
But if we analyze repositories on GitHub, we will find that in fact, a substantial number of repositories use spaces over tabs. You can see one of such analysis here, done by one of googlers: 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?
So, where do those almost equal proportions of people using tabs and spaces in Stack Overflow's results come from? Let's look closely at the question in the survey:
Tabs or Spaces?
It is clear, right? What represents your indentations in code, tabs or spaces, right? Now watch this:
If you do not have the time to watch it, in the video, one of the characters uses spaces to indent code, but... by pressing spacebar. What?! Ok, let's assume that this is made only for satirical reasons, no one indents code by pressing space and no one could think about it in terms of this question. Or am I wrong? Now let's jump to the least voted comments in the blogpost in Stack Overflow (least voted means also less visible to the general audience):
Am a bit confused, what does matter in this study ? is it press TAB versus pressing SPACEBAR on the keyboard, or is it comparing the spaces characters VS tabulations in your code ?
How about (me, for example) using 2 or 4 spaced tabs to prevent repeating keystrokes and let the code formatter convert it to spaces? Tabs are like variables; if my team wants to use 3 spaces per indent, I modify it and... ta-daa
So, does this mean that we need to change how we TEACH coding, and tell the students that they need to use the space bar rather than tabs when indenting?
We also got dozens of comments explaining that no one is using spacebar for indentation. So yes, we do not have clear situation with answers to this question. First of all, the question is ambiguous, especially for less experienced and skilled developers. Secondly, most IDEs are configured by default to use 2-4 spaces when we press tab (and we have confirmation in analysis of GitHub repositories posted before that this default setting persist, at least in public repositories). Again, less experienced developers will not know about this default setting.
So my theory is that less experienced and skilled, and therefore less earning developers did not answer in the survey that they are using spaces, because... they thought that question is about physically pressing spacebar or did not know they are using spaces when pressing tab.
Could it be the case? Let’s compare how the most junior programmers (less than 2 years of programming) answered when compared with the general population:
They tend to strongly overreport using tabs. It seems we’ve found yet another case where using self-reported data got us into trouble. This is a common problem with surveys - data often gets muddied by different understanding of the question by different population groups.
So shouldn’t the more precise title for Stack Overflow’s blogpost be,
developers who KNOW they are using spaces earn more than other developers? ;) .
You can access raw data and investigate it yourself here.