I'm a data scientist working at the intersection of technology and design. Reformed astrophysicist & former e-Research/data consultant.

Mozilla Science Lab's Australasia community call

Last week I was invited to speak at Mozilla Science Lab's Australasia Community Call. This was a fantastic experience and a real privilege to be invited into the community.

what are mozilla science lab community calls?

Mozilla's Science Lab meetings and Community Calls, are a chance to find out what Mozilla Science Lab up to, as well as the work of the community relevant to science and the web. They are great way to engage with Mozilla Science Lab and hear about their current projects, ways you can get involved, and participate in discussions of issues around open science. The calls are held every second Thursday of every month // 11 am Eastern  and you can join from anywhere in the world.

Three topics were discussed at this meeting;

  • Ariel Gold (@arielsgold) talked about Open Data at Amazon Web Services, 
  • Tim Dettrick (@tjdett) talked about DIT4C
  • Arna Karick (@drarnakarick) Notebooks in Astronomy: Data Management, and Reproducible Science

open data at amazon web services / Ariel Gold / @arielsgold

  • Open data on AWS: Amazon Web Services (AWS) provides a comprehensive tool kit for sharing and analyzing data at any scale. When organizations make data open on AWS, the public can analyze it quickly and easily with our scalable computing and analytics services.
  • AWS Public Data Sets: To enable more innovation, AWS hosts a selection of data sets that anyone can access for free. Previously, large data sets – such as Landsat satellite imagery or the mapping of the Human Genome – required hours or days to locate, download, and analyze. Through Public Data Sets on AWS, anyone can access these data sets and analyze them immediately using Amazon Elastic Compute Cloud (EC2) instances or Amazon Elastic MapReduce (EMR). Learn more about the AWS Public Data Sets on AWS at: http://aws.amazon.com/public-data-sets/. 
  • We have committed to making up to 1 petabyte of USGS Landsat images readily available as objects on Amazon S3. This allows the data to be accessed programmatically via a RESTful interface and quickly deployed to any of our products for analysis and processing. Now anyone can analyze Landsat data at web scale with no significant up-front investment of time or capital expense. Learn more at: http://aws.amazon.com/public-data-sets/landsat/. 
  •  We have entered into a research agreement with the US National Oceanic and Atmospheric Administration (NOAA) to explore sustainable models to increase the output of open NOAA data. Publicly available NOAA data drives multi-billion dollar industries and critical research efforts. Under this new agreement, AWS and our collaborators will look at ways to push more NOAA data to the cloud and build an ecosystem of innovation around it. Learn more at: http://aws.amazon.com/noaa-big-data/.  
  • We have teamed up with the Square Kilometre Array (SKA) to create the new AstroCompute in the Cloud grant program in order to address these “to infinity and beyond” sorts of problems in order to ensure that mature, high-quality data management and processing solutions are in place by the time the SKA starts to pump out data in 2020 or so. Learn more at: https://aws.amazon.com/blogs/aws/new-astrocompute-in-the-cloud-grants-program/. 
  • NASA Open Earth Exchange (OpenNEX) is a collaboration and analytical platform that combines state-of-the-art supercomputing, Earth system modeling, workflow management, and NASA remote-sensing data. Through OpenNEX, anyone can now explore and analyze large Earth science data sets, run and share modeling algorithms, collaborate on new or existing projects, and exchange workflows and results within and among other science communities. Learn more at: http://aws.amazon.com/nasa/nex/.


  • [Arna] For projects like Landsat (and future SKA)  - How easy is it (or will be) to modify tools and datasets - i.e. is the data portal 'static'? 
  •  [Andrew] Where do we get more info re the mining in WA example? I joined the meeting late so you might have said something but I can't see a link above


 DIT4C / Tim Dettrick / @tjdett

  •  A cloud-based hosting platform for research tools with the lowest barrier to entry possible for the user. 
  • Required software: a modern web browser. That’s it. Chrome, Firefox, Safari (best effort, please report bugs!)
  • Network connectivity: HTTPS on TCP/443
  •  If you can perform a Google search in a modern web browser, then you should be able use DIT4C.
  •  DIT4C tries to minimize bandwidth requirements where possible.
  • gzip compression by default
  • SPDY now, HTTP/2 by the end of the year
  • Minimal API for adding new research tools
  • Docker image with a single HTTP port exposed. That’s pretty much it.
  •  There are some existing images optimized for teaching, but jupyter/demo and others work too.
  • Teaching now, long-term research work later
  • A standardized environment is valuable for teaching purposes.
  • Long-lived containers present a lot of challenges, but that’s where we’re headed.
  •  Further reading: https://dit4c.github.io/ & https://github.com/dit4c/dit4c/
  • Open Container Project: https://www.opencontainers.org/

Self drive demo – Go to https://resbaz.cloud.edu.au/  Click the "Login" button and proceed to login. Use your GitHub or Twitter based on whatever is convenient for you. Go to the "Compute Nodes" tab and click "Claim Compute Node Access".  From the drop down menu, select the name of the compute node (mslacc-demo) and enter the access code (KB9-645-PDC-CDL). Go to the "containers" tab and add a new container. The name is only visible to you - write whatever you'd like. Select an image from the drop down menu, set initial state to "on" and then hit the create button. When the container is "on", its name should turn blue and you can click on it to launch your environment in a new tab of your browser. Once you're finished for the day, turn your container "off" to save resources.


  •  [Andrew] Safari? Thanks for answer - will give it a bash ;-)
  •  may I please have an NLTK compute node access code? cobi.smith@unimelb.edu.au if need be Tim :)

notebooks in astronomy / Arna Karick / @drarnakarick

  • Research data management @Swinburne: no real strategy around Open Data/Reproducible science, but academic integrity is really important. Our ITS Cloud Strategy is woefully behind what is available commercially. It's also very much focussed on serving the business side of the university.
  • Building a one-size-fits-all metadata store to make research data public is tricky. Swinburne data collections range from simple spreadsheet datasets to 100s Petabytes and growing (e.g. astro data). 
  • We do have a number of open data projects- Australian Policy Online Linked data projects: http://apo.org.au/content/linked-data-project and the Theoretical Astrophysical Observatory:https://tao.asvo.org.au/tao/ (w/ NeCTAR) which are Cloud based.
  • Swinburne Pulsar Portal – Data Sharing Cluster on our gSTAR supercomputer: Currently hosts 2 radio astronomy datasets ~50-70GB each. 
  • On VicNode we have (had?) Smaug—Hydrodynamic Simulations of First Galaxy Formation ~70TB merit allocation - future funding unclear but ~$10,000/year.
  • Great idea in principle but expensive in practise for large datasets - payoff is unclear. Cloud repos like Figshare etc. don't cut it for these types of datasets. Long term sustainability is an issues. 
  • Still a gap for 'every day research'.
  •  Individual researchers: are to some extent embracing the idea of open data (and code) and reproducible science.
  •  Swinburne Hacker Within (SHW) session on 'Getting the most out of iPython Notebooks' http://thehackerwithin.github.io/swinburne/posts/iPython_notebooks/ - turned into more of a philosophical discussion. Web-based Notebooks are appealing for a number of reasons: (1) describing how you analysed data or wrote your pipeline and sharing this with your collaborators, (2) creating tutorials/teaching materials, (3) supplementing public code (e.g. on GitHub) with a more comprehensive description that just comment statements, (4) creating webpages that include html, text and code, or writing how-to-guides etc., (5) as an additional electronic Appendix for a paper. 
  • Reproducible science is really tricky in astro/physical sciences. A lot of different datasets, a lot of different programming languages (so IPython Notebook not really set up to be a "recipe book" to handle multiple coding languages which may run on a supercomputer or require a license). More importantly analysis and code not necessarily linear. Still not sure the best way to do this - perhaps in small chunks using multiple tools: ie. Journal paper+personal website+GitHub+Notebooks+Cloud Repo for data + [Data Storage - perhaps...]
  • Documenting code/data is seen as important: and it makes you a better coder - whether the results are reproducible is another matter - Astro Code Review started a few weeks ago: http://kaythaney.com/2013/08/08/experiment-exploring-code-review-for-science/
  • Related note: Astronomers are starting to take advantage of the Microsoft Azure for Research project (Award program); "facilitates and accelerates scholarly and scientific research by enabling researchers to use the power of Microsoft Azure to perform big data computations in the cloud. Take full advantage of the power and scalability of cloud computing for collaboration, computation, and data-intensive processing."  http://research.microsoft.com/en-us/projects/azure/default.aspx
  •  For research/training/SHW: We've thought about using DIT4C cloud instances. Until we run sessions with 50+ people (or have time constraints) we'll continue doing things locally. Researchers benefit enormously from installing their own software even if it means going through the inevitable pain and suffering. We want them to learn how their laptop operates , how they can install programs and how they can launch a VM either through Digital Ocean or NeCTAR, DIT4C or other. For large research projects the cloud solution will need to be sustainable on 3+ year timescales.


  • Would AARNet CloudStor+ (based on ownCloud) be a viable option for <100GB datasets: https://cloudstor.aarnet.edu.au/plus/
  • DIT4C is looking at integrating with AARNet CloudStor+ this year.
  • Different to original AARNet CloudStor. Much more DropBox-like.
  • Accessible via WebDAV after you set a sync password: https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/
  • Moving the 'compute to the data:  http://trillianverse.org/

Swinburne Hacker Within – Week 9

Swinburne Hacker Within - Week 8