I'm a data scientist working at the intersection of technology and design. Reformed astrophysicist & former e-Research/data consultant.

The Human Face of Big Data

As part of lead up to the Transitions Films Festival, I attended the screening (held at Federation Square) of the documentary The Human Face of Big Data. This astonishing film explores how the visualisation of data streaming in from satellites, billions of sensors (cheap and ubiquitous) and smart phones is enabling us, as individuals and collectively as a society, to sense, measure and understand aspects of our existence in ways never possible before. Every single thing that we do leaves a digital trace. In fact, more data has been generated since 2003 than in all of previous recorded history. Petabyte (1000 Terabytes) and Exabyte (1000 petabytes) are now routinely collected, visualised and interpreted (...for better or worse). The more information we get the larger the problems that will be solved.

In The Human Face of Big Data, Rick Smolan (@ricksmolan), a former Time, Life, and National Geographic photographer famous for creating the Day in the Life book series, and author Jennifer Erwitt examine how today's digital onslaught and emerging technologies can help us better understand and improve the human condition--ourselves, interactions with each other, and the planet.

Visually the film is quite stunning (incidentally it won 'Best Cinematography' at the 2014 Boston International Film Festival) a visual feast of data flows and I was completely blown away by the scale of ambition of the projects.

The predictive power of data was demonstrated by the Google flu trends analysis (Letter to Nature: Detecting influenza epidemics using search engine query data, 2009). The paper showed that search data, if properly tuned to the flu tracking information from the Center for Disease Control and Prevention, could produce accurate estimates of flu prevalence two weeks earlier than the CDC’s data. It demonstrated that google search data ("digital refuse") was successful in producing potentially life-saving, data-driven insights. But then a few years later it failed. It failed – and it failed spectacularly (although this story isn't discussed in the film), in part because of the fantastic amount of data, because correlation does not always imply causation, and because it's notoriously difficult for people to differentiate a serious flu virus, a common cold, or that general feeling of being run down, . Aside from data analysis issues, the research raised a number of issues regarding privacy.

The film also showcased the Chicago: City of Big Data project which I've blogged about before. One of the coolest data stories was this idea of cities as responsive organisms. Suppose your mobile phone or a simple sensor could track a cars vertical motion as it drove over a pothole. Suppose the slight change in altitude (and geolocation) could be transmitted directly back to a data centre that produced a real time, "living" map of all potholes in a city. The frequency of cars triggering "pothole events" could then be used to determine priority areas and the order in which they are filled. The Data–Smart Cities Solution, an initiative of the Harvard Kennedy School's Ash Centre for Democratic Governance and Innovation is one such example of "responsive cities" research, working at the intersection of government and data. The project uses open data, predictive analytics and  civic engagement technology to better discover and preemptively address civic problems.

Twitter alerts also featured in the film. I do love Twitter. It's often underrated in the research community, but if used well it can be a powerful tool for connecting with other researchers and projects. Admittedly it can be a time waster, but fortunately there are some novel tools, such as Buffer that help manage and optimise accounts to maximise impact with minimum time commitment.  As a data source Twitter is a gold mine. Scraping and analysing data can be done quite easily depending on what you want. For a quick primer I recommend working through  Mining the Social Web, 2nd edition (Chapter 9 deals with Twitter: Jupyter Notebook). Data portals such as TrISMA: Tracking Infrastructure for Social Media Analysis can provide comprehensive datasets for bonafide researchers. Anyway, Twitter alerts have been used many times (in fact it's becoming routine) for disaster relief assessment and for figuring out immediate aid needs. In 2013, Twitter served as a lifeline of information during Hurricane Sandy (another data story told in film) and since then social media has been used in part, to coordinate relief efforts of numerous natural disasters since. 

Perhaps the remarkable story, and definitely one of the most ambitious data-driven stories I've come across, was Deb Roy's Human Speechome Project.  For three years, Deb Roy (an Associate Professor at MIT's Media Lab) and his wife Rupal Patel (an Associate Professor in speech language pathology and computer science at Northeastern University) tracked the motion, behaviour and language development of their infant child, using 11 fish-eye, ceiling mounted video cameras and 14 microphones strategically placed around his family's house. 200 Gigabytes of data was generated everyday, including every "ga ga", "dada" and burp. Every 40 days the data was hand-delivered by Roy to the MIT where the information was categorised and analysed in a 200 Terabyte storage facility.  They were able to demonstrate that language is not just learned by repetitive behaviour, but by hearing words in difference context.

If you're interested in reading more about the film, and its Director I recommend the following links. There is also a book version of the film and a most excellent (and free!)  interactive iPad App available through iTunes.

Articles and Interviews

Mapping the rise of Data Science Institutes around the world

Mapping the rise of Data Science Institutes around the world

My last few weeks at Swinburne Research