I'm a data scientist working at the intersection of technology and design. Reformed astrophysicist & former e-Research/data consultant.
Research Profiles of Insight Data Science Fellows

The Insight Data Science Fellows Program is a seven week postdoctoral fellowship deigned by Jake Klamka (a high-energy physicist by training), that helps scientists bridge the gap from academia to data science careers in the tech industry. The program combines mentoring by data experts from local companies (such as Facebook, Twitter, Google, and LinkedIn) that are on the forefront of big data challenges in industry. Originally aiming for 10 fellows, Klamka wound up accepting 30, from an applicant pool numbering more than 200 [1]. To date more than 300 fellows* have participated in the program not including fellows from Insight's expanded Data Engineering and Health Data Science programs.

The application deadline for the next round of the Insight Data Science Fellows Program fast approaching –  March 21st 2016 for Silicon Valley and New York City programs, so I thought it timely to look at some of the projects and research profiles of past fellows and see where the Insight path has taken them.

Visualising the data using the D3js javascript libraries

My idea was to build up some sort of an interactive matrix (or a node-link, Force Layout diagram) using the d3js javascript libraries, that could be used to explore the research profiles (or backgrounds) of fellows and if that revealed any trends. I also wanted to know what universities fellows come from. Based on nothing other than a hunch I expected that many fellows would already be based in the Bay Area. The dearth of permanent academic positions in makes data science careers a very attractive career option, and usually when people leave academia they tend to stay where they are. 

There is a wealth of information on the Insight alumni webpage and the below visualisations are based on the data from that page. Incidentally, this is the same data I started using to map the technology companies in the "Mapping the Rise of Data Science Institutes Around the World" blog post. About a third are female which is always great to see (I'll talk about coding and STEM in a future blog post) and there is a wide variety of research backgrounds: from the physical sciences (astronomy, physics, chemistry etc.) to medical imaging, psychology and political economics.

Based on the profiles of the few Insight Fellows that I know personally, Background appears to refer to their most recent research position prior to applying for a fellowship, rather than their PhD (or other) research qualifications. Regardless of their actual career path, previous position is a good indicator of the types of researchers the Insight program accepts.... and for a small data visualisation project it's a good place to start.

Very quickly it became apparent that my first choice of visualisation (based on Mike Bostock's Les Miserables co-occurrence matrix) wasn't really going to cut it. While the correlation and frequency aspect of this visualisation is exactly what I was after, the 100 odd research disciplines and 90 odd universities were not going to display well. At this point by naive expectation of primarily a dozen or so US-based universities was copletely shattered (but in a good way).

The d3js co-occurrence matrix

In the Les Miserables co-occurrence matrix, the data is visualised as an adjacency matrix, a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph. This type of visualisation is commonly used in sociology, literature, and linguistics research, for example to determine the frequency of word pairs in a written text. A really nice feature of the d3js co-occurrence matrix is the way you can shuffle the data. Rob Simpson (@orbitingfrog) and Brooke Simmons (@vrooje) previously used it to explore the astronomical literature.  

Somewhat perversely, I persisted** with it anyway, but I did limit it to the 118 Fellows with backgrounds in the physical sciences and mathematics, e.g. astronomy, physics, earth and climate sciences, mathematics and statistics. This represents roughly 30% of Insight fellows. 

The opacity of each cell represents the number of fellows within specific combinations of research discipline and university, relative to the total number of times the research discipline or university on the vertical axis appears in total. Darker cells indicate a greater number of fellows with that background. Colour is used to separate disciplines into broader research areas of Earth & Climate Science (pink), Astrophysics & Cosmology (green), Theoretical & Applied Physics (orange) and Mathematics & Statistics (blue). Purple is used to separate out universities. The data can be re-ordered to show the frequency by university and broad research area. This interactive aspect of the d3 co-occurrence matrix is why I persisted in the first place.

The co-occurrence matrix sort of works well here as a way of showing the frequency of research discipline/university pairs. However, since university/university and discipline/discipline don't mean anything in this context, those parts of the matrix are only useful in that they tell you the most represented universities. 

Perhaps not surprisingly Astronomy & Astrophysics, and Mathematics dominate, with the most represented universities being the University of California (Berkeley) and Stanford University –  Note this is only for a subset of fellows and doesn't include other research areas such as neuroscience, that are also highly represented.

The d3js heat map

Another way to visualise the data (in matrix form) and show the frequency of elements, is to use a heat map. The following visualisation uses the information from all 300 Insight fellows. It shows the research profiles of fellows, represented by their previous research area and university prior to applying for an Insight Fellowship. The different colours represent the frequency of fellows having the same research area/university combination. Typically the darker the colour the greater frequency. In this plot colours represent discrete numbers: one fellow (green), two fellows (teal), three fellows (dark blue), and four fellows (red - the maximum so far). 

The banding pattern highlights the most common research areas for the full dataset – Astronomy & Astrophysics, Physics, Mathematics, Neuroscience – and universities – Stanford University & University of California (Berkeley).

I'm sure there is better way to represent the data. When I figure that out I'll let you know. In the meantime the data, HTML, CSS and custom d3js templates are available on GitHub. For HTML and CSS editing I recommend using Light Table.


* I don't have the number of applicants for each round.

** Understanding how HTML, CSS and javascript work together is a great skill to have.


[1]  Data Scientist: The Sexiest Job of the 21st Century, Harvard Business Review, 2012

Data-intensive research in the era of "big data"

Code: debugging the gender gap