Calculating NDVI directly from xarray DataArrays
Calculating NDVI directly from xarray DataArrays transforms traditional raster band math into a labeled, coordinate-aware operation. Instead of managing raw NumPy indices and manual axis alignment, you apply the standard formula (NIR - Red) / (NIR + Red) using declarative xarray operators. The library automatically handles spatial broadcasting, lazy execution via Dask, and metadata propagation. This eliminates silent misalignment bugs, preserves CF-compliant metadata, and scales seamlessly from single-scene processing to multi-temporal continental stacks.
Core Implementation
import xarray as xr
import numpy as np
def calculate_ndvi_xarray(data: xr.DataArray | xr.Dataset) -> xr.DataArray:
"""Compute NDVI from xarray objects with safe division and metadata preservation."""
# Extract bands regardless of input structure
if isinstance(data, xr.Dataset):
red = data["red"]
nir = data["nir"]
else:
# Assumes a 'band' coordinate exists with named values
red = data.sel(band="red")
nir = data.sel(band="nir")
# Prevent integer wrap-around (e.g., uint16 underflow) by casting to float32
red = red.astype("float32")
nir = nir.astype("float32")
# Core NDVI calculation with safe division
denominator = nir + red
ndvi = (nir - red) / denominator
# Mask pixels where denominator is zero (black borders, nodata regions)
ndvi = ndvi.where(denominator != 0, other=np.nan)
ndvi = ndvi.clip(-1, 1)
# Attach CF-compliant metadata
ndvi.attrs.update({
"long_name": "Normalized Difference Vegetation Index",
"standard_name": "normalized_difference_vegetation_index",
"units": "1",
"formula": "(NIR - RED) / (NIR + RED)",
"valid_range": [-1.0, 1.0]
})
return ndvi
Step-by-Step Breakdown
1. Structure-Agnostic Band Extraction
The function accepts either an xr.Dataset (where bands are separate variables) or an xr.DataArray (where bands are stacked along a coordinate dimension). Using .sel(band="red") leverages xarray’s label-based indexing, removing the need to hardcode array positions like data[0] or data[1]. This is critical when processing multi-sensor archives where band ordering varies.
2. Safe Type Casting
Raw satellite products (Sentinel-2 L2A, Landsat Collection 2) typically ship as uint16. Performing subtraction on unsigned integers causes wrap-around underflow: 50 - 100 becomes 65486 instead of -50. Casting to float32 before arithmetic prevents this while keeping memory overhead minimal compared to float64. For NDVI’s output range of -1 to 1, float32 provides more than adequate precision.
3. Division-by-Zero Handling
Satellite scenes contain padding, scan gaps, or fill pixels where both NIR and Red bands equal zero. Direct division produces NaN or Inf, which can break downstream aggregations. The .where(denominator != 0, other=np.nan) call explicitly masks these invalid pixels while preserving the original coordinate grid. The subsequent .clip(-1, 1) enforces physical bounds, catching any floating-point drift or calibration artifacts.
4. Metadata Preservation
Attaching CF-compliant attributes ensures interoperability with GIS software, NetCDF exporters, and visualization libraries. The standard_name and valid_range keys are recognized by tools like rioxarray, MetPy, and Panoply, enabling automatic colormap scaling and unit-aware plotting without manual overrides.
Performance & Production Considerations
Dask Chunking Strategy
NDVI computation is memory-light but benefits from spatial locality. When working with large archives, chunk along spatial dimensions to maximize cache locality and minimize inter-worker communication. Optimal chunk sizes for (time, y, x) layouts are typically (1, 1024, 1024). Avoid chunking along the band dimension if you plan to compute multiple indices (EVI, SAVI, NDWI) from the same stack, as it fragments the task graph and increases scheduling overhead. Refer to the official xarray Dask integration guide for chunking diagnostics and memory profiling workflows.
Coordinate Alignment & Lazy Execution
xarray aligns arrays by coordinate labels, not positional order. If your red and NIR bands originate from separate files or different processing pipelines, xarray automatically reindexes them to a common grid before arithmetic. This prevents subtle georeferencing offsets that plague NumPy-based workflows. Because operations are lazy, calling calculate_ndvi_xarray() only builds a computational graph. Actual calculation triggers when you call .compute(), .to_netcdf(), or .rio.to_raster().
Memory Footprint & Data Types
For continental-scale stacks, monitor peak memory during the .compute() phase. If OOM errors occur, reduce spatial chunk sizes or process in temporal batches. Always write outputs to float32 GeoTIFF or Zarr to maintain precision without inflating storage costs. Avoid xr.set_options(enable_cftimeindex=False) in modern xarray (>=2023.0) — CF time indexes are handled automatically.
Pipeline Integration
This pattern integrates cleanly into broader Band Math Operations with Xarray workflows. Because xarray tracks spatial coordinates (x, y, crs) and temporal dimensions natively, you can chain NDVI computation directly after STAC catalog ingestion without manual array slicing or coordinate reordering.
A typical production pipeline looks like this:
- Query a STAC API using
pystac-clientto fetch item URIs. - Open with
stackstacorodc-stacto build a lazyxr.DataArray. - Apply
calculate_ndvi_xarray()to the stacked cube. - Compute temporal aggregations (
.mean(dim="time"),.quantile()). - Export to cloud-optimized GeoTIFF or Zarr.
By keeping operations lazy until the final export step, you avoid materializing intermediate arrays in RAM. This approach scales to petabyte-scale archives when paired with distributed Dask clusters or cloud-native storage backends. For foundational concepts on coordinate systems, raster I/O, and STAC metadata mapping, review the Core Raster Fundamentals & STAC Mapping reference guide.