Satellite Imagery Processing
Who am I?
I am David Hoese and I work as a softwaredeveloper at the Space Science and Engineering Center(SSEC) at the University of Wisconsin - Madison.My job is to create software that makes meteorological data more accessible toscientists and researchers.
I am also a member of the open sourcePyTroll community where I act as a core developeron the SatPy library. I use SatPy inmy SSEC projects called Polar2Grid and Geo2Grid which provide a simple commandline interface on top of the features provided by SatPy.
The Problem I’m trying to solve
Satellite imagery data is often hard to read and use because of the manydifferent formats and structures that it can come in. To make satellite imagerymore useful the SatPy library wraps common operations performed on satellitedata in simple interfaces. Typically meteorological satellite data needs to gothrough some or all of the following steps:
- Read: Read the observed scientific data from one or more data files whilekeeping track of geolocation and other metadata to make the data themost useful and descriptive.
- Composite: Combine one or more different “channels” of data to bring outcertain features in the data. This is typically shown as RGBimages.
- Correct: Sometimes data has artifacts, from the instrument hardware or theatmosphere for example, that can be removed or adjusted.
- Resample: Visualization tools often support a small subset of Earthprojections and satellite data is usually not in thoseprojections. Resampling can also be useful when wanting to dointercomparisons between different instruments (on the samesatellite or not).
- Enhancement: Data can be normalized or scaled in certain ways that makescertain atmospheric conditions more apparent. This can alsobe used to better fit data in to other data types (8-bitintegers, etc).
- Write: Visualization tools typically only support a few specific fileformats. Some of these formats are difficult to write or have smalldifferences depending on what application they are destined for.
As satellite instrument technology advances, scientists have to learn how tohandle more channels for each instrument and at spatial and temporalresolutions that were unheard of when they were learning how to use satellitedata. If they are lucky, scientists may have access to a high performancecomputing system, while the rest may have to settle for long execution times ontheir desktop or laptop machines. By optimizing the parts of the processingthat take a lot of time and memory it is our hope that scientists can worryabout the science and leave the annoying parts to SatPy.
How Dask helps
SatPy’s use of Dask makes it possible to do calculations on laptops thatused to require high performance server machines. SatPy was originallydrawn to Xarray’s DataArray
objectsfor the metadata storing functionality and the support for Dask arrays. We knewthat our original usage of Numpy masked arrays was not scalable to the newsatellite data being produced. SatPy has now switched to DataArray
objects backed by Dask and leverages Dask’s ability to do the following:
- Lazy evaluation: Software development is so much easier when you don’t haveto remove intermediate results from memory to process the next step.
- Task caching: Our processing involves a lot of intermediate results that canbe shared between different processes. When things are optimized in theDask graph it saves us as developers from having to code the “reuse” logicourselves. It also means that intermediate results that are no longerneeded can be disposed of and their memory freed.
- Parallel workers and array chunking: Satellite data is usually compared bygeographic location. So a pixel at one index is often compared with thepixel of another array at that same index. Splitting arrays in to chunksand processing them separately provides us with a great performanceimprovement and not having to manage which worker gets what chunk of thearray makes development effortless.
Benefiting from all of the above lets us create amazing high resolution RGBimages in 6-8 minutes on 3 year old laptops that would have taken 35+ minutesto crash from memory limitations with SatPy’s old Numpy implementation.
Pain points when using Dask
- Dask arrays are not Numpy arrays. Almost everything is supported or isclose enough that you get used to it, but not everything. Most things youcan get away with and get perfectly good performance; others you may end upcomputing your arrays multiple times in just a couple lines of code whenyou didn’t know it. Sometimes I wish that there was a Dask feature toraise an exception of your array is computed without you specificallysaying it was ok.
- Writing to common satellite data formats, like GeoTIFF, can’t always bewritten to by multiple writers (multiple nodes on a cluster) and somearen’t even thread-safe. Opening a file object and using it with
dask.array.store
may work with some schedulers and not others. - Dimension changes are a pain. Satellite data processing some times involveslookup tables to save on bandwidth limitations when sending data from thesatellite to the ground or other similar situations. Having to use lookuptables, including something like a KDTree, can be really difficult andconfusing to code with Dask and get it right. It typically involves using
atop
,map_blocks
, or sometimes suffering the penalty of passing thingsto aDelayed
function where the entire data array is passed as onecomplete memory-hungry array. - A lot of satellite processing seems to perform better with the defaultthreaded Dask scheduler over the distributed scheduler due to the nature ofthe problems being solved. A lot of processing, especially the creation ofRGB images, requires comparing multiple arrays in different ways and cansuffer from the amount of communication between distributed workers. Thereisn’t an easy way that I know of to control where things are processed andwhich scheduler to use without requiring users to know detailed informationon the internal of Dask.
Technology I use around Dask
As mentioned earlier SatPy uses Xarraytowrap most of our Dask operations when possible. We have other useful tools thatwe’ve created in the PyTroll community to help support deploying satelliteprocessing tools on servers, but they are not specific to Dask.