How to store weather forecast data for fast time-series APIs

Using multi-dimensional gridded files and ring-based data-structures

In this post, I am explaining how Open-Meteo APIs store numerical weather-model data for fast access. I will briefly explain:

  • Basics of geographical grids

  • What weather-model runs are

  • Show different storage approaches for gridded data

  • How efficient weather forecast APIs can be designed

Chapter 1: Grids

Weather-models store data in grids with columns for east-west direction and rows for north-south direction. With 1000 columns and 1000 rows you would get 1 million grid-cells. With an earth circumference of around 40.000 km, each grid-cell is around 40 x 40 km in size.

To get the weather forecast for your location, the closest grid-cell is located and data returned. For users in the east of the US this might be grid-cell “C5“. It is very similar to an spread-sheet filled with temperature values.

Grids do not need to be global. Local higher resolution grids for smaller areas are common as well. In any case, grids are structured and the geographical coordinates for grid-cells can be calculated.

Depending on the resolution and size of your grid, the number of grid-cells can be huge. Open-Meteo uses grids with more than 4 million grid-cells. If you would store the values of this grid on your hard disk, the gridded file is 15 MB large using 32 bit floating point numbers.

Unfortunately, many of those gridded files are required. 7 days of hourly weather forecast require 168 files for each time-step. Just for one weather variable (e.g. temperature) it adds up to 2.5 GB.

At Open-Meteo, we use 35 weather variables like wind-speed, direction, relative humidity, pressure, weather-code, clouds, precipitation or snow height. Some weather variables are also stored for different altitudes like wind speed on 10, 80, 120 and 180 meters above ground.

In total, every single weather-model-run has to store around 87.5 GB of data.

4 million grid-cells * 32-bit * 168 hours * 35 variables = 87.5 GB

With today’s storage options on large hard disks and faster solid state disks (SSD) this is not inherently an issue anymore. On a 16 TB hard disk you could store 180 weather model runs. With 4 runs a day, this would be 45 days of archived data.

For now, this ignores compression and smaller data types than 32-bit floating points, but more on that later.

Chapter 2: Model-Runs

I briefly mentioned the term “model-run”. Weather-models are constantly updated with newest measurements, satellite & radar observations and airplane data. Nowadays every 3-hours weather models are started.

At 0:00 UTC time, all measurements are collected, assembled onto a grid and then numerical weather prediction equations are run to predict how air mass change, clouds form and the sun is creating convection.

After a couple of hours, the “run” finishes and can be downloaded. This run is referred as 0z run. The same procedure is repeated at 3:00 UTC with the 3z, 6:00 UTC with 6z and so on.

Every model-run overlaps with previous runs, but offers more recent data, improved forecasts and reaches a bit more into the future.

To save space, only the latest 2 runs could be stored. This can be observed with many weather forecast APIs. It is noticeable, because after the latest run, older data is not accessible anymore. For example, API calls for today’s data, return only data after 9:00 UTC time, not no information about the night anymore.

Another common solution (and also how Open-Meteo does it), is to merge multiple model-runs. Newer runs overwrite portions of older runs, but leave previous data.

A drawback of this technique are data-inconsistencies like sudden jumps in temperature or sunny weather switching to pouring rain. With rapid updating weather-models every 3 hours, this is more rare.

Chapter 3: Reading data from gridded files

There are two use-cases for reading data:

  1. Create a weather-map for one timestamp

  2. Get an hourly time-series for one location

The first use-case works quite well. With gridded data files, only one file has to be opened and the entire global data for one timestamp can be read straight away. In essence, gridded data files are a “picture” of the entire world.

The second use-case for time-series data is not optimal. Every timestamp is in a different gridded-file. For 7 days (168 hours), 168 files have to be opened and a single value read. 10 weather variables have to read 1680 values from 1680 files.

On fast SSDs every open-file and read operation takes a fraction of a millisecond, but it multiplies by 1680. With half a millisecond for each read operation, this results in almost 1 second of waiting time. For an API, this is eternity.

In practice, it can be optimised by keeping files open or using caches to make is faster. It is possible run an API with less than 100 millisecond response time, but it is far from ideal.

Chapter 4: Using a relational database

Many weather APIs store weather-model data in database management systems like MySQL, PostgreSQL or SQLite.

The solution comes naturally and looks straight forward. A simple table with columns:

  • Timestamp

  • Latitude

  • Longitude

  • Temperature

  • Wind speed, direction, relative humidity, …

Additionally, some indices need to be defined on latitude, longitude and timestamp to access all timestamps quickly.

The database management system keeps data compact on disk. For each row, all weather variables are stored together. E.g. Bytes for temperature and wind-speed are next to each other.

time,lat,lon,temperature,windspeed,direction,humidity,…

To read data, the database management system has to traverse the indices for a given geographic location (latitude and longitude) and retrieve data for all timestamps. All weather variables are local and reading temperature, wind speed, direction and other variables all at once is quick.

Reading times are good with 50 milliseconds or less.

Unfortunately, there are a couple of drawbacks:

  1. Storage overhead: For each row, timestamp and geographic coordinates have to be stored again. In gridded files, this is not required, because you know the dimensions of each grid-cell.

  2. Indices memory: To find the correct locations and timestamps quickly, large indices are required which consume multiple GB memory.

  3. Updates: For every weather model-run large portions of the database have to be rewritten. In practice this is a huge issue. Updates take an hourly or more and performance drops significantly.

  4. Scaling: To improve performance more nodes are required and data has to be clustered. It works for small data-sets, but scaling becomes expensive.

Relational databases are not built for gridded data. Properties like spatial context are lost and operations become unnecessarily complex.

Chapter 5: Time-series in gridded files

Gridded data is not limited to 2 dimensions for east-west and north-south. A third dimension like “time” is an option.

The 1000 by 1000 grid in chapter 1 could be expanded with the third dimension “time” and a cardinality of 168 hours to fit our 7 day weather forecast. Dimensions would then be: east-west (1000), north-west (1000) and time (168).

The order of dimensions is important. In one case 168 “global pictures” are stored one after another. In the other, 1 million time-series of each 168 hours are stored. Naturally for a time-series API we need the later case.

On disk every grid-cell stores 168 hours of data, immediately followed by the next 168 hours for the next grid-cell.

Reading a single 168 hour weather-forecast for temperature, is now a single and a quick read operation.

With each model-run these files can be generated easily. There is not much difference of generating different gridded files for each time-step or 3-dimensional files of all time-steps at once.

Multiple weather-models-runs still need to be considered and providing data for the entire current day, we have to read the latest 4 or 6 runs. This issue of merging multiple runs into just one time-series is explained in the next chapter.

Chapter 6: Ring data-structure

Commonly in software development ring-based data-structures are used as round-robin buffer for network packages. We can adapt this principle to update time-series data from weather-model-runs.

If we look back at how multiple model-runs can be overlaid together, it becomes apparent that older data is not relevant anymore and newer data run out of space to be appended to the end.

Subsequent runs can overwrite portions of older data as well. With more and more runs, all data will be overwritten at some point. If visualised as a ring, data would go around the clock and always update a section of the ring.

For reading, the right portion has to be merged with the left portion to restore a full 168 hour time-series.

The ring-based structure for time-series has two handy properties:

  1. With a larger dimensioned ring-size, data of past days can be used

  2. Updates of new data can be performed in place

At Open-Meteo the ring-size is set to 10 days of data. Therefore 2 days of past data can be retrieved safely. This means that the 3-dimensional gridded data file is using the dimensions east-west (1000), north-west (1000) and time (240 instead of 168).

Every 3 hours new weather-model-runs finish and will be integrated into this ring-based structure. While doing updates, certain sections of the file might be inconsistent, because the start point of data suddenly gets overwritten with new data. Therefore 24 hours of data are never read and kept as a safety margin.

Doing updates in-place (e.g. modifying the existing files) simplifies data handling a lot. Otherwise, you would have to keep many duplicates files for each model-run.

It has also more practical advantages, because file-systems have to manage files on actual hard-disks. If the files always maintain the same size and reside at exactly the same location on disk, there is not much work to do. Updates basically rewrite the modify entire file and nothing else.

Using ring-based gridded files also has disadvantages:

  1. The entire file has to be updated at once

  2. Compressing data is not practical, because in-place updates are not possible anymore

The first limitation complicates larger ring-size like 100 days of data. Users commonly request access to more past-days or an entire weather archive. While I would love to just set the ring-size to 100 days, it has severe performance implications.

Data on disk is stored in blocks of 512 or 4096 bytes. Nowadays 4096 bytes are more common with fast SSD storage. A single 1 byte modification has to read 4096 bytes, modify one byte and then write 4096 bytes again to disk. Because the most disk workload is written in larger sequential blocks, this is barely noticeable.

For larger ring-sizes like 100 days, 9600 bytes are required for one grid-cell using 32 bit floating point numbers. Although only 7 days (672 bytes) are updated with every model-run, the operating system has to read, modify and write the entire file.

4 million grid-cells * 32-bit * 100 days * 24 hours = 35.8 GB

One gridded ring-based file for one weather variable is 35.8 GB or 1.25 TB for all 35 weather variables. Updating 1.25 TB of data every 3 hours is manageable, but has larger impacts on system load.

For smaller ring-sizes of 10 days, this approach work beautifully and can be updated very well in operations.

A single time-series can be read with one operation in less than 0.5 milliseconds. For 10 weather variables, APIs with latencies below 5 milliseconds are feasible.

Compared with regular gridded files:

  • Ring-based gridded files only have to open one file and do one single read, because all time-series data is sequential

Compared with relational databases:

  • No indices are required

  • No operational burden on clustering data is required

  • APIs can be scaled by copying files to more servers

Ring-based gridded files are more complicated to work with, because there are no open-source tools available, but they are worth it for rapid-updating weather-model data.

Chapter 7: Kernel page cache, memory mapped files and non-blocking IO

There are some tweaks to make it even faster and more efficient. I will just give a short overview of those techniques without a deep give into Linux kernel fundamentals.

First the Linux page cache keeps commonly read data in memory. Forecast data in densely populated areas are likely already in the page cache. Consecutive API calls that get the same data are significantly faster.

The page cache transparently caches files on local file systems. Ring-based gridded files do not have to reside on local file-systems only, but could use shared or remote file-systems like Amazon Elastic Filesystem (EFS) or Cephfs. In later posts I will explain how to use distributed file systems to store gridded data in conjunction with fast APIs.

To get the most out of the page cache, Open-Meteo also uses “memory mapped files (mmap)”. A gridded file is transparently mapped into memory and behaves like one large floating point array. In the previous example, it would be a floating point array with almost one billion elements. Memory mapped files are not read entirely into memory, but on access will read data from disk. If data already reside in memory, no reads are required. In the worst case, every read performs a read from disk.

Finally, reading data from multiple files can be done in parallel. This requires multi-threaded code or asynchronous event-loops to achieve non-blocking IO. With memory mapped files there is an alternative way. To read 10 weather variables at once, we tell the Linux kernel which 10 data regions we need and the kernel starts preparing reading data for all 10 weather variables. By the time data is accessed, they are hopefully present in the page cache. Technically, this is still blocking IO, but multiple parallel reads are performed in the background. Using this technique cuts down API latency by half.

Additionally to save space, many weather variables also use 16-bit floating points with less precision, but require only half the space.

Chapter 8: Conclusion

The end-result are API reading-times of 2 milliseconds for 10 weather variables and 0.4 milliseconds for consecutive calls.

Additional features, like combining high-resolution weather models, derived weather variables and down-scaling techniques also add latency. Areas covered with 3 weather models, take 6 milliseconds to read and 0.5 milliseconds for consecutive API calls.

With low API latencies the operational cost for API services can be kept very low. Open-Meteo is (and will be) free for non-commercial use without API key limitations. This can only be achieved with efficient data storage and APIs.

Ring-based gridded files are a key ingredient to efficiently store and update weather-model data and be able to quickly retrieve forecasts for single geographical points.

For longer time-series and archives, different techniques have to be used that focus on compression and support large distributed storages to be cost effective.

In later posts, I will explain how to design systems for a time-series weather archive, how different how high-resolution weather-models can be combined with low-resolution models or how weather variables can be computed on demand.

To get latest updates, via mail, you can subscribe to the Open-Meteo blog for free:

Otherwise, I encourage everyone to use Open-Meteo APIs and build Open-Source tools to work with weather data. If you have questions, idea, feedback or suggestions, please comment below.

Leave a comment