Why are we, at Soroco, inspired by astronomers - the OG computer hackers? [Part I]

Rohan Murty

Overview

As a company, the more we build and scale Scout, the more we see strong parallels to how astronomers collect data, build pipelines, and find patterns in astronomical data. Like astronomers, at Soroco, we too worry about data capture, cleanliness, noise, aggregation, clustering, and filtering. Similarly, we too deal with similar scale of data capture on Scout and we focus on storage and indexing. And finally, Scout too has several pipelines for extracting faint patterns in the data, with many parallels to how astronomers find patterns in their data. In some cases, however, the challenges we see with Scout data exceed the challenges that astronomers face. For example, astronomers often have the facility to search for a pre-defined template or pattern. Or the rate at which Scout gathers data far exceeds the rate at which astronomy data gathering is growing (doubling roughly every two years). Hence, we look to astronomy to learn from the parallels while also being inspired to solve those problems specific to Scout.

Hence, recently, we at Soroco, began working with Shravan Hanasoge, a professor at the Tata Institute for Fundamental Research (TIFR), on computational questions tied to Scout by looking for inspiration from how Shravan and his team think of finding patterns to detect gravitational waves. And this is not the first time we have collaborated with scientists and engineers working on astronomy. We even interview and recruit astronomers (apply here, if you’re interested!). So, when a couple of my younger colleagues, asked me why do we interview astronomers, read their papers, or even collaborate with them, we figured this may be a point of view that may be of broader interest to the community as well. This is not a conventional point of view among software companies and we believe it is fairly unique to how we, at Soroco, solve problems at scale, think of team composition, and value diverse talent. Our hope is this article will do justice to astronomers and readers will share our enthusiasm for astronomers.

Astronomy and computing at scale

Astronomy projects are about literally finding the needle in the haystack at scale. These projects generate a tremendous volume of data that are stored, ingested, and analyzed, to make new discoveries.
grap1
Companies that build and manage platforms that collect, ingest, and analyze large volumes of data are typically well-funded and staffed by large teams of engineers, computer scientists performing R&D, product managers, QA teams, and SRE teams. On a relative basis, astronomy projects, however, are built by small teams of astronomers and engineers and operate on budgets that are several orders of magnitude lower (even when compared to a startup that has raised Series B). And yet, these teams in astronomy projects build systems that demonstrate tremendous scale by collecting large volumes of data, storing, indexing, processing, and ultimately finding a signal in the noise. And these systems survive and persist over extended periods. Long before the explosion of data in social networking companies, astronomy projects have often outpaced companies and the computing industry in their ability to build systems that process data at scale. This article hopes to highlight the incredible engineering work done by small teams of engineers and astronomers and here are some examples that illustrate this point on scale and complexity of astronomy projects. These projects produce large volumes of data, have to deal with scale, and are significant software engineering and innovation efforts.

LIGO: The Laser Interferometer Gravitational-Wave Observatory (LIGO), a much vaunted project in the news, is dedicated to observing gravitational wave observations. LIGO generates about 1TB of data each night it operates. Future upgrades to LIGO as well as new planned sites such as LIGO-India will only increase these nightly data rates by several orders of magnitude.

SETI@Home: The now defunct but one of the largest distributed applications (and in many regards the precursor to the sharing economy) started in the late 90s and ran for 21 years, was a distributed computing platform running on end-users’ desktops with the aim of analysing radio signal data to search for possible signs of extra-terrestrial life. Radio telescope data was aggregated to central servers and client software running on end-users’ machines, pull data from the central repository and analyze them locally. Just in 2008 SETI@Home was processing 100TB of data per year (in 2008 terms that was the size of the entire US library of congress). At its peak this distributed platform had a computing power of 668 teraflops across 5.2 million end-users running the platform. The underlying technology, built by the spaces sciences laboratory at Berkeley, was eventually open sourced as the BOINC platform – a platform for distributed computation that continues to be relevant even today in a wide variety of applications ranging from climate sciences to mathematics to biology and medicine.

SDSS: The Sloan Digital Sky Survey is a project to construct a detailed 3D map of the universe.

LSST: The large synoptic survey telescope is a large telescope slated to operate at very high data-rates and is equipped with a 3.2-billion-pixel camera that is capable of recording the entire visible sky twice each week.  

TMT: The thirty-meter telescope is an extremely large telescope (ELT) being built in Hawaii and will likely be the largest telescope ever built, is a multi-national project spanning research teams across the US, Israel, China, Canada, and India.  

The figure below puts these projects, their scale, and complexity in perspective.
graph2a

The ever-increasing data-rates per night of various projects over the past two decades. 

So, what does it take to store, index, and process data at such high-data rates? What kinds of queries can one run on this sort of data and what is the most efficient architecture for it? What pipelines are necessary to process the data and how are they integrated? What is the data model to export this data to enable any further analysis of this data? These are but a few questions that begin to highlight the kinds of questions / problems that astronomers face. Let us take a case study a dig a bit deeper to better appreciate the diversity of skillsets and expertise needed to succeed in these projects.

Content Explorer

See Scout in action.
Schedule your demo now!

Request demo