The mini State of Vim 2019

by Duarte O.C - July 2020

Why?
Analysis
Closing thoughts

Why?¶

However, (at least for me), the Vim user base is still a mistery. Don't get me wrong, ressources such as the Vim Subreddit, and the VimConf youtube videos are great.

But I think that with a "data-driven" approach, we might actually find some interesting things about the VimUniverse!

Taking the Stack Overflow Developer Survey from 2019, I will try to answer some questions about the VIM community:

Notes:

This is not meant to be a highly scientific and extensive analysis. I'm doing this for fun.
I'm using the 2019 survey because the 2020 version does not have information on IDEs used by respondents.

Analysis¶

0. Data and Notebook preparation¶

back to top

Let's start by importing some libraries and the data.

import pandas
import pathlib
import matplotlib.pyplot as plt
import numpy
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource
from bokeh.transform import jitter
output_notebook()

We load the data in, and print some example rows from the survey.

to_print = 15
interesting_cols = ["DevEnviron", "Country", "Age", "ConvertedComp"]

# fetch the data from the folder
data_folder = pathlib.Path("data")
survey_file = data_folder /"2019" / "survey_results_public.csv"

# load it into pandas
raw_data = pandas.read_csv(survey_file)

# print the dataframe
raw_data[interesting_cols].sample(to_print)

Looks alright. Let's get started.

1. Where is VIM most popular?¶

back to top

# define constraints
editor_in_focus = "Vim"
columns_of_interest = ["Country", "DevEnviron"]
min_respondents_per_country = 50
color_main = "CornFlowerBlue"
color_sec = "Crimson"
figure_thres = 30

# create a dataframe only with VIM users
country_respondents = raw_data.Country.value_counts()
data = raw_data[raw_data.Country.isin(country_respondents.index[country_respondents.gt(min_respondents_per_country)])]
data = data.dropna(subset=columns_of_interest)
data = data[data.DevEnviron.str.contains(editor_in_focus)]

# convert the dataframe to arrays for bokeh
data_to_plot = (data.Country.value_counts() / raw_data.Country.value_counts()).dropna()
countries = data_to_plot.index.to_list()
usage = [v * 100 for v in data_to_plot.values]
sorted_countries = sorted(countries, key=lambda x: usage[countries.index(x)])

# create bokeh figure
tooltips = [("Country", "@x"), ("Percentage", "@top{1.11}%")]
fill_color = [color_main if usage[i] < figure_thres  else color_sec for i in range(len(usage))]
p = figure(x_range=countries, tooltips=tooltips, plot_width=1000, title=f"Percentage of survey respondents that are regular {editor_in_focus} users.")
p.vbar(x=countries, top=usage, width=0.9, color=fill_color)
p.xgrid.grid_line_color = "white"
p.y_range.start = 0
p.xaxis.major_label_orientation = "vertical"
show(p)

# print informative percentages
perc = round(data.shape[0] / raw_data.shape[0] * 100, 2)
print(f"In {raw_data.shape[0]} respondents, {data.shape[0]} use {editor_in_focus} regularly. (or {perc}%)")

In 88883 respondents, 21989 use Vim regularly. (or 24.74%)

Interesting, the country where the percentage of respondents that are regular users of VIM is higher is Paraguay 🇵🇾, where 44%(!) of respondents are regular VIM users!

Keep in mind that paraguay has a total of 52 ppl responding to the survey, so we might wonder if that is representative.

Here are the ones that follow:

China 🇨🇳
South Korea 🇰🇷
Switzerland 🇨🇭
and the USA 🇺🇸

I'm particularly interested on why this might be the case, particularly for Paraguay, Switzerland and South Korea..

2. Where are our VimLadies?¶

back to top

# contraints and variables
editor_in_focus = "Vim"
gender_of_interest = "Woman"
columns_of_interest = ["Country", "DevEnviron"]
min_respondents_per_country = 50
color_main = "LightPink"
color_sec = "Plum"
thres = 3

# data only with female respondents
country_respondents = raw_data.Country.value_counts()
data = raw_data[raw_data.Country.isin(country_respondents.index[country_respondents.gt(min_respondents_per_country)])]
data = data.dropna(subset=columns_of_interest)
data = data[data.Gender == gender_of_interest]
data = data[data.DevEnviron.str.contains(editor_in_focus)]

# convert data for bokeh
data_to_plot = (data.Country.value_counts() / raw_data.Country.value_counts()).dropna()
countries = data_to_plot.index.to_list()
usage = [v * 100 for v in data_to_plot.values]
sorted_countries = sorted(countries, key=lambda x: usage[countries.index(x)])

# create bokeh figure
tooltips = [("Country", "@x"), ("Percentage", "@top{1.11}%")]
fill_color = [color_main if usage[i] < thres  else color_sec for i in range(len(usage))]
p = figure(x_range=countries, tooltips=tooltips, plot_width=1000, title=f"Percentage of females that are regular {editor_in_focus} users per country.")
p.vbar(x=countries, top=usage, width=0.9, color=fill_color)
p.xgrid.grid_line_color = "white"
p.y_range.start = 0
p.xaxis.major_label_orientation = "vertical"
show(p)

# print informative percentage
perc = round(data.shape[0] / raw_data.shape[0] * 100, 2)
print(f"In {raw_data.shape[0]} respondents, {data.shape[0]} respondents of gender {gender_of_interest} use {editor_in_focus} regularly. (or {perc}%)")

In 88883 respondents, 1171 respondents of gender Woman use Vim regularly. (or 1.32%)

Oh! South Korea 🇰🇷 appears to pop up here again, along with Paraguay. But it appears that female VIM users are on average about 1% of the respondents of each country..

3. Are VIM users old?¶

back to top

# constraints
ide = "Vim"
color = "green"
column_of_interest = "Age"
columns_of_interest = ["DevEnviron"]
data = raw_data.dropna(subset=columns_of_interest)
editors = list(set([item for sublist in raw_data.DevEnviron.str.split(";").dropna().tolist() for item in sublist]))
editor_dict = {}

# build data
for editor in editors:
    if editor == "Notepad++":
        editor_string = "Notepad"
    else:
        editor_string = editor
    editor_dict[editor] = data[data.DevEnviron.str.contains(editor_string)].dropna()[column_of_interest].values.mean()
    
# prepare data for graph
right = list(editor_dict.values())
y = list(editor_dict.keys())
sorted_y = sorted(y, key=lambda x: right[y.index(x)])

# plot the graph
fill_color = [color if y[i] == ide  else "lightgrey" for i in range(len(right))]
tooltips = [("IDE", "@y"), (column_of_interest, "@right{1.1}")]
p = figure(plot_height=500, plot_width=900, y_range=sorted_y, title=f"Average age of IDE users",  tooltips=tooltips)
p.hbar(y=y, height=0.5, left=0, right=right, color=fill_color,)
p.x_range.start = 29
p.x_range.end = 34
show(p)

Some interesting observations in this graph:

There is a very high average age in "established" IDEs such as Komodo, TextMate, of Zend.
Emacs, has the 5th oldest user base at 31.7 years old.
Vim in the other hand, has a relatively young user base, sitting at roughly 30 years old.

4. Are VIM users rich?¶

back to top

# constraints
ide = "Vim"
color = "green"
column_of_interest = "ConvertedComp"
columns_of_interest = ["DevEnviron"]
data = raw_data.dropna(subset=columns_of_interest)
editors = list(set([item for sublist in raw_data.DevEnviron.str.split(";").dropna().tolist() for item in sublist]))
editor_dict = {}

# build data source
for editor in editors:
    if editor == "Notepad++":
        editor_string = "Notepad"
    else:
        editor_string = editor
    editor_dict[editor] = data[data.DevEnviron.str.contains(editor_string)].dropna()[column_of_interest].values.mean()
    
# prepare data for graph
right = list(editor_dict.values())
y = list(editor_dict.keys())
sorted_y = sorted(y, key=lambda x: right[y.index(x)])

# plot
fill_color = [color if y[i] == ide  else "lightgrey" for i in range(len(right))]
tooltips = [("IDE", "@y"), (column_of_interest, "@right{1.1}")]
p = figure(plot_height=500, plot_width=900, y_range=sorted_y, title=f"Average yearly salary in USD of IDE users",  tooltips=tooltips)
p.hbar(y=y, height=0.5, left=0, right=right, color=fill_color,)
p.below[0].formatter.use_scientific = False
p.x_range.start = 0
show(p)

VIM users earn on average 155k USD per year. This makes them the 4th most well payed "IDE users".

Moreover, the highest payed developpers are more likely to use Emacs on a regular basis!

Closing thoughts¶

back to top

Well, this was a fun exercise, along the way we discovered some interesting things about VIM users:

While very popular in the main developped countries such as the USA and China, some outliers such as Paraguay, have a very big relative user base.
While popular among men, in these niche countries, the user base appears to have a slight increasing female base.
VIM users appear to be young, and are quite well payed looking at the overall landscape.

Questions? Ideas for improvements? Just contact me!

	DevEnviron	Country	Age	ConvertedComp
55318	IPython / Jupyter	Spain	34.0	58433.0
37727	Atom;Eclipse;IntelliJ;NetBeans;Notepad++;PyCharm	India	19.0	NaN
73285	Android Studio;Eclipse;IntelliJ;NetBeans;Visua...	Italy	17.0	NaN
53712	Visual Studio	United States	44.0	56000.0
2843	Android Studio;Coda;Eclipse;Notepad++;Visual S...	Israel	25.0	NaN
37781	Atom;Visual Studio Code	Philippines	35.0	28212.0
78404	Notepad++;Visual Studio	Poland	37.0	NaN
85097	Sublime Text;Visual Studio;Visual Studio Code	Portugal	30.0	41244.0
61350	Android Studio;Visual Studio Code	Canada	24.0	66796.0
35249	IntelliJ;Vim;Visual Studio Code	United States	29.0	93000.0
78770	Sublime Text;Visual Studio Code	Indonesia	20.0	6024.0
56147	Emacs;Visual Studio Code;Xcode	United States	24.0	2000000.0
36753	Visual Studio;Visual Studio Code	Portugal	23.0	19116.0
24216	Visual Studio;Visual Studio Code	United Kingdom	52.0	117763.0
26211	Vim;Visual Studio Code	United States	22.0	72400.0