Research computing training for Australian astronomers

Spent the last few days putting together a short (well let's be honest, it's rather long) report on research computing training as part of Astronomy Australia Limited's (AAL) new Computing Infrastructure Planning Working Group. This a closed working group that overlaps with the AeRAC Committee and to some extent replaces the High Performance Computing Working Group (HPCWG). The purpose of the working group is to assess the Australian astronomy community’s computing and software infrastructure requirements over the next 5 years, and provide recommendations to AAL on behalf of the community. Discussions include; how to deal with next generation datasets (i.e. Petabyte and Exabyte datasets from future telescopes), storing and accessing data, data processing, future data mining and analysis techniques, computing training for researchers and consolidating resources and building capability within Australian astronomy. 

Why is research computing training important?

The rise of “big data” and “data science” in the technology industry and its prevalence in academic research is creating a new generation of savvy astronomy that are adopting more general, industry standard tools and practises. This has led many astronomy research groups and individuals to shift their efforts towards dealing with issues around data-intensive research,  and to create new tools to address next-generation datasets. The majority of tools are developed in collaboration by early and mid-career researchers in dual software development/data science roles and research roles, and are widely used by the community. 

Traditionally astronomy researchers have no formal computing training. They tend to learn how to program in a somewhat ad hoc fashion and their programming language of choice (e.g. Fortran, C/C++, IDL, Python, bash/shell scripting) is somewhat driven by their immediate peers, research group and supervisor. Observational astronomers tend to be well versed in either IDL or Python and/or shell scripting, and have a pretty thorough understanding of astronomy data analysis software and tools, e.g. IRAF, PyRAF, IDL, SExtractor, AIPS, Miriad, Astropy, Topcat. Similarly, computational/theoretical astronomers typically have a good understanding of Python, C/C++ and GPU programming. Explaining what these software programs are to non-astronomers can be difficult, but it becomes really important when pursuing jobs outside of astronomy. Being able to effectively translate computing skills can make or break a job application. 

The immensity of big data – in this context Terabyte and Petabyte data rates, demands new approaches and techniques to store, analyse, interpret and explore data. In order to effectively harness large volumes and complex datasets, today’s astronomer need to be proficient in scientific computing and programming languages, be able to exploit the plethora tools designed to better handle large datasets (e.g. HPC, databases, cloud computing), and have good understanding of the data-mining methodologies appropriate for the next-generation astronomy surveys.. Researchers involved in collaborative coding projects, or those who actively seek to learn new, more general (industry standard) computing skills, tend to be the most successful at transitioning into non-astronomy "big-data/data science" careers.

Programs that aim to teach software programming, or increase uptake of more general tools, and/or promote best practise in scientific computing, range from one-off to semi-regular workshops (e.g. SciCoder, Software Carpentry), to community connected and regularly scheduled tutorials and meet-ups (e.g. Hacker Within). The result is that researchers are no longer limited to specific astronomy data analysis tools and traditional IT infrastructure. Satellite conferences and hackathons (e.g. SciPy Conference, DotAstronomy, Astro Hack Week), connect researchers, foster collaboration between research institutes, and provide a forum for engaging with industry.

In some cases universities have established successful Data Science Institutes that support dual astronomy research/technical roles and offer comprehensive training in data science and scientific computing. Notable institutes include the Berkeley Institute for Data ScienceeScience Institute (UW), Centre for Data Science (NYU), and Centre for Data-Driven Discovery (Caltech). The majority of these were initiated by astronomers involved in large survey projects (e.g. LSST), or leaders in astroinformatics. Training in scientific computing and new methodologies is a core part of Institute activity. Promoting best practises in scientific computing, exploring new machine-learning techniques, initiating collaborative coding projects (e.g. AstroPy), and developing community resources (e.g. IPython Notebook, Jupyter, AstroML, scikit-image, scikit-learnGlueViz) has become increasingly important reflects the broader cultural shift in the way data-intensive research is conducted. It also reflects a growing trend for astronomers to seek alternative careers outside of astronomy.

Computing training and workshops in Australia

The research/computing training needs for Australian astronomers can be divided into a number of categories:, training critical for specific Australian-led survey projects, training specific to big data and high performance computing, and more general scientific computing/tools training that provides researchers at all levels with new skills, enabling them to be more effective researchers and/or to prepare them for alternative career paths within astronomy or the tech industry.

The ANITA Chapter of the ASA is a well-established community of theoretical astronomers, and among other things provides scientific computing training specifically targeted to theoretical astronomers (particularly those at the early– and mid– career level). ANITA hosts an annual summer school and runs a series of somewhat ad hoc  online workshops. Past workshops have addressed specific issues around big data and data mining and include databases and SQL –  applied, computational Bayesian theory, R Statistics for astronomers. In recent years astro-informatics has also fallen under their domain. As a solid training platform ANITA has the potential to expand its suite of workshops and lectures to include more rigorous scientific computing and data science methodologies.

Swinburne’s gSTAR supercomputing team offers community-wide HPC training for astronomers nationally and Swinburne users from other disciplines. The gSTAR webinar series aims to support existing users and introduces prospective users to GPU programming, code testing and optimisation. The Pawsey Supercomputing Centre also offers an extensive on-site HPC training program and short courses.

In Australian astronomy, there appears to be a lack of formal/comprehensive training in astro-statistics, data mining, and machine learning in astronomy, necessary for analysing large optical survey datasets (e.g. LSST, SkyMapper). Developed by LSST and SDSS astronomers, Statistics, Data Mining, and Machine Learning in Astronomy (Ivezic, Z, Connolly, A.J, VanderPlas, J. and Gray, A.) is now regarded as one of the foremost texts in astro-informatics, and in the US at least, there are a number of regular workshops/summer schools based around this text. Incidentally, another really great book is Katy Huff's Effective Computation In Physics.

Increased research training around ASVO projects, data access portals, data and software citation, version control, database languages (e.g. SQL, XML) and architectures, advanced visualisation (3D cubes, interactive plotting), linked-data analysis and representation (e.g. Python+GlueViz), statistical and graphical computing (e.g. R Statistics), cloud computing (AWS, Digital Ocean, NeCTAR Cloud), would benefit the whole astronomy community and also provide early– and mid– career astronomers with the skills they need to transition into alternative careers;  either the tech industry (e.g. data science), or within astronomy (e.g. UI/web developer, software developer or astronomy data science research and support roles).

The next challenge is to identify priorities and resourcing, and to come up with a strategy for effective and sustainable training programs.