Processing 90 TB Historical Weather Data

Patrick Zippenfenig

Nov 8, 2023

Integrating ECMWF IFS historical weather model data from 2017 onwards

Read →

5 Comments

Victor

Nov 8, 2023

Keep up the good work

Expand full comment

Aliaksandr Valialkin

Apr 18, 2024

Did you try storing the data in specialized time series databases such as VictoriaMetrics or ClickHouse? These databases may provide higher compression ratio for the weather data stored on disk compared to custom schemes. They also provide query languages optimized for typical queries over time series data such as MetricsQL - https://docs.victoriametrics.com/metricsql/

See, for example, a benchmark for ingesting 500 billion of samples into VictoriaMetrics - https://valyala.medium.com/billy-how-victoriametrics-deals-with-more-than-500-billion-rows-e82ff8f725da

Expand full comment

Reply (1)

Patrick Zippenfenig

Apr 19, 2024

Hi. Yes I ran experiments with InfluxDB, TimescaleDB and ClickHouse. ClickHouse would be my pick for unstructured time-series data (e.g. weather measurement stations). I also maintain a [Swift ClickHouse client](https://github.com/patrick-zippenfenig/ClickHouseNIO). I have not tried VictoriaMetrics or TileDB.

For gridded weather model those databases are less ideal, because they cannot exploit the fact that data is an ideal multi dimensional array and does not require meta data like coordinate and timestamps for each data entry. Using scientific file formats like HDF5, NetCDF, Zarr, Xarray works quite well and can provide reasonable access performance for random reads using "chunks".

Another issue are time-critical updates for weather forecasts. A single update may overwrite large portions of the entire database (>100 GB compressed GRIB data). This process should not take more than 15 minutes. In my experience overwriting large amounts of existing data does not work well with classical time-series databases. They prefer append-only workloads which can be queued and later merged into storage.

The Open-Meteo custom file format solved most of those issues. Compression reaches high ratios (5x smaller than GRIB). High compression/decompression throughput (>200 MB/s). Low random-io read latency due to chunks, mmap and prefetching. Ability to merge real-time weather model updates quickly. However, it is not a general purpose file format and only tailored for Open-Meteo. It only provides an advantage for very few use-cases, but one of them is to provide a fast weather API while keeping resources low.

In general I would like to integrate the concepts of cloud-native file formats, that can be stored on a S3/HTTP like storage and consumed in chunks using HTTP range operations. This would enable higher scalability by not having to keep all data locally. The challenge is latency. A typical weather forecast reads data from various weather variables which are split into multiple files. A linear execution flow would quickly add up to multiple seconds. If this can be ideally parallelised, it could work, but it will take a lot of work to get to this point. Note: All data is already on AWS open-data.

Expand full comment

Claire

Feb 19, 2024

Hi,

Truly good work !

A question : the data we obtain from 2017 onwards are from ECMWF IFS, ERA5 datasets or a combination of both ? Thank you

Expand full comment

Reply (1)

Patrick Zippenfenig

Feb 19, 2024

Per default, the API returns data from ECMWF IFS at 9 km resolution. You can however select "EAR5-Seamless" in the "Reanalysis models" selection to always use ERA5 or ERA5-Land.

Expand full comment