For many scientists nowadays, the first step in doing science is exploring the data computationally. New approaches to data-driven science are needed due to the big increase of space science mission’s data in volume, heterogeneity, velocity and complexity. This applies to ESA space science missions, whose archives are hosted at the ESA Science Data Centre (ESDC). Some examples are the Gaia archive -whose size is estimated to grow up to 1PB and 6000 billion of objects-, the Solar Orbiter archive -which is expected to handle several time series with more than 500 millions of records- and the Euclid archive, which shall be able to handle up to 10PB of data. The ESDC aims, as a major objective, to maximize the scientific exploitation of the archived data. Challenges are not limited to manage the large volume of data, but also to allow collaboration between scientists, to provide tools for exploring and mining the data, to integrate data (the value of data explodes when it can be linked with other data), or to manage data in context (track provenance, handle uncertainty and error). ESDC is exploring solutions for handling those challenges in different areas. Specifically: storage of big catalogues through distributed databases (ex. Greenplum, Postgres-XL,…); storage of long time series in high resolution via time series oriented databases (TimeScaleDB); fulfil data analysis requirements via Elasticsearch or Spark/Hadoop; and enabling scientific collaboration and closer access to data via JupyterLab, Python client libraries and integration with pipelines using containers. In this presentation we are going to take a tour of these approaches.
Link to PDF (may not be available yet): O10-4.pdf