Big Universe, Big Data: Machine Learning and Image Analysis for Astronomy


Big Universe, Big Data: Machine Learning and Image Analysis for Astronomy

by Jan Kremer, Kristoffer Stensbo-Smidt, Fabian Gieseke, Kim Steenstrup Pedersen, and Christian Igel, from the University of Copenhagen. Article published in IEEE Intelligent Systems, 32 (2):16-22, 2017

I recently came across the above conference paper/journal article on Twitter.

It talks about data rates for future astronomical surveys and facilities, specifically the Large Synoptic Survey Telescope (LSST) – which will generate roughly 30Tb per night,  the Thirty Meter Telescope (TMT), and the rise of citizen science projects to support part of the data analysis. Interestingly none of the five authors are astronomy researchers, but data scientists or computer scientists specialising in machine learning, big data analytics, computer vision and image analysis. I'm not sure how often large scale astronomy projects feature in other other disciplines, but it certainly got my attention. They bring a unique and valuable perspective to the discussion of big-data in astronomy and the challenges that need to be faced in future large surveys.

The authors talk about how astronomical big-data can trigger advancements in machine learning and image analysis. In astronomy I think it's quite rare that the this is discussed in detail, other than noting that that astronomy "data analysis techniques are translatable" and that large projects tend to drive innovation in large scale computing and data management, as well as advanced the development of detector technologies, and lightweight engineering and infrastructure, among other things. Perhaps because machine learning is in such an early stage in astronomy  we are only really starting to get a good handle on how useful it can be for research, let alone improving the algorithms themselves.

In the second half of their paper the authors talk about describing the shape of the galaxy using machine learning, and how the;

"star formation rate (SFR) could be predicted from the shape index, which measures the local structure around a pixel going from dark blobs over valley-, saddle point-, and ridge-like structures to white blobs." 

This is a pretty big claim since it requires some understanding of how the properties of light in a given filter, translate to the physics of stellar emission. I need to look at the two papers they reference to see to see how they extractinformation on SFR or their ideas about how you could do this. Assuming you had a good working knowledge of the HST detectors, the image filters used, and a robust method for measuring the photometry then it's possible that you could draw some reasonably good results from the shape index. Of course what you don't see in an astronomical image (the interconnected dust lanes that obscure light) is scientifically just as important as what you do see. 

The also talk about sample selection bias in machine learning, and note that this a real challenge in astronomy. Training sets are typically created with "old" surveys,  the most comprehensive being the Sloan Digital Sky Survey (SDSS). Whereas future astronomy surveys will be taken with far superior cameras, with ground-based telescopes with much larger collecting areas – resulting in deeper images, and space-based telescopes that will see the Universe quite literally in a different light. So the challenge is then being able to create reasonably good proxy training sets. The effects of selection bias can be mitigated, to some extent, using importance–weighting, where more weight to examples in the training sample that lie in regions of the feature space underrepresented in the test sample, and vice versa. The challenge lies in reliably estimating these weights.

In the final section, the authors address the issue of interpretability of machine learning models. This is something that I think about quite a lot when discussing the merits of data-driven discovery using data-mining and machine learning models. Ultimately you're aiming to answer scientific questions, and you need to be able to feed observations and measured parameters back into theoretical models.  The problem is that by using purely data-driven techniques that you may end up with accurate models and predictions, but the results may be not be meaningful, especially if thee violate the laws of physics. On the other hand machine-learning to extract or predict potentially interesting classes of objects is really useful, particularly for projects like LSST that will detect millions of transients each night...