Data-intensive research in the era of "big data"

A flurry of thoughts about data-intensive research in universities...

Across all disciplines, large, complex, rich datasets are playing a key role in transforming the culture and conduct of science and research in society. Social scientists, many of whom previously relied on relatively small samples and tools such as surveys and historical data, can now make use of digital, real-time and dynamic datasets and determine statistical trends for whole populations. Social media has also opened a door to new areas in digital humanities research, enabling researchers to answer complex questions and gain a global understanding of society and behaviour. Similarly sports science, health sciences, economics and business now recognise the commercial value of large, complex and interlinked datasets, giving rise to "data analytic"’ as its own research discipline. 

In the physical sciences, researchers are challenged by a somewhat more complicated problem, the raw data itself. The high spatial and temporal resolution of functional magnetic resonance (brain) imaging or functional MRI (fMRI), means that even small datasets (i.e. a few subjects/brains) can be quickly exceed 10s of Gigabytes. The 3D (shape) and 4D (shape+time) nature of this data required complex image analysis and advanced statistical methods. The next generation radio astronomy datasets are rapidly approaching Petabytes scales (1000 Terabytes, or 2^50 bytes) and more,  making storage, processing (particularly bandwidth limited) a massive headache. As the leading pre-cursor to the Square Kilometre Array (SKA), the Murchison Wide-Field Array (MWA) is tackling the big data challenges facing radio astronomy head on. The systems in the MWA produce approximately 60 gigabits per second data streams (the average Australian internet is 6.9Mbps) that are processed in real-time on site, using GPU-based signal processing as the first stage in a hierarchical data processing strategy. This set up means the MWA can produce data at a rate almost 8,700 times faster than the average Australian internet connection can download [1]. While the so-called "data deluge" doesn't really affect individual optical astronomy data, at least not in terms of the individual CCD images [2], the amount of data and catalogues that will be produced by the Large Synoptic Survey Telescope (LSST) verges on the ridiculous. The system will produce the deepest, widest, image of the Universe. It's  27-ft (8.4-m) mirror is the width of a singles tennis court and houses a 3200 megapixel camera. Each image (on the sky) will be the size of 40 full moons. 37 billion stars and galaxies will be imaged over 10 years, resulting in 10 million alerts, 1000 pairs of exposures, and 15 Terabytes of data .. every night! For those who are more familiar with the language of databases, this is something like 37 billion "objects",  30 billion "detections", database tables up to 50 trillion rows with the largest database table ~5 PB, and a total (compressed) data size of 83 Petabytes [3]. Wow...

In order to effectively harness large volumes and complex datasets, today’s researchers need to acquire new skills, be able to reliably access significantly better tools, or be able to develop in-house applications that are supported by a sustainable infrastructure. To remain competitive in academia, or to successfully transition into industry, data-intensive researchers need to expand their knowledge to include new and innovative methodologies that can be applied to research across all domains; adopting new approaches that are fundamentally different from more traditional discipline-specific methods. This new inter-disciplinary philosophy, or modus operandi of how data-intensive research should be conducted has become synonymous with the more popular "data science’' and researchers – particularly in the business and technology industry – are typically referred to as "data scientists". 

Data science offers a powerful new approach to making new discoveries and improving the way we do research. In essence it entails a cultural shift in the way the academic community and approach issues around big-data and scientific computing. 

the value of data science institutes at universities

Data Science Institutes provide a platform for researchers to engage in interdisciplinary, innovative, technology based research that is often difficult to pursue within individual research disciplines. The also offer researchers as supportive environment to become expert tool builders, with many researchers seeing this as an effective way to transition to the tech industry.  Sadly few universities recognise the value of dual researcher/software-tool developer roles, in favour of the more traditional academic metrics (publications, grants etc.). The downside to this will inevitably be a new type of "academic brain drain" with increasing numbers of researchers leaving academia for more lucrative careers in industry.

data science as an alternative career path

Data scientists combine scientific expertise, computational knowledge and statistical skills to build critical tools and make new discoveries. The research community recognises the need for these skills, but the lack of academic incentives (and the publication- and funding-based metrics valued for career progression) has created a critical shortage of researchers in dual research/data analysis & software development roles. Science may be data-rich, but will remain discovery-poor without the institutional commitment, people-power and technology needed to mine data and reveal hidden breakthroughs. 

Fortunately, the technology industry has discovered this untapped resource. The rise of fellowship programs, for example the Insight Data Fellows,  Science to Data Science and Data Incubator programs can offer a variety of careers options for trained astronomy professionals.  Skilled data scientists are in great demand in the worldwide job market and researchers with analytical backgrounds are being offered data science and data lab management positions in leading tech-companies. The interactive map above also shows some of the technology companies that are hiring former academics as data scientists. The data is based on the career paths of Insight Data Science Fellows. To date more than 300 fellows have participated in the program not including fellows from Insight's expanded Data Engineering and Health Data Science programs. In a recent blog post, I created this  interactive visualisation to explore the research background of researchers. It was an interesting exercise and I wasn't at all surprised to find many ex-astronomers.

According to DJ Patil, the former Chief Scientist at LinkedIn (who helped coin the term data scientist) and now U.S. Chief Data Scientist with the White House Office of Science and Technology Policy (@WhiteHouseOSTP),  “the best data scientists tend to be ‘hard scientists,’ particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem . ” 

The Rise of Data Science (TWDI, 2011) 

 Data Scientist: The Sexiest Job of the 21st Century (Harvard Review, 2012) 

A Very Short History Of Data Science (Forbes Tech, 2013)


[1] Radio astronomy backed by big data projects, April 21, 2015 (http://phys.org/news/2015-04-radio-astronomy-big.html)

[2] Also check out this amusing, yet very  informative 3-minute rant by Dr. Geert Barentsen (@GeertHub), presented at the .Astronomy6 conference in Chicago 2015.

[3] Conference talk by Jacek Becla (SLAC), Enabling Scalable Data Analytics for LSST and Beyond through Qserv  (http://www.noao.edu/meetings/bigdata/schedule.php)