Grid map phase optimization report
Summary
The aeic trajectories-to-grid --mode map command was optimized in
four steps: three targeting the unfiltered hot loop (achieving a 47x
speedup there) and one targeting the filtered trajectory-loading path
(achieving a 41x speedup there).
Unfiltered path
Commit |
Change |
map_phase time |
Speedup |
|---|---|---|---|
(baseline) |
Original code |
174.3 s |
1x |
|
Eliminate redundant NetCDF reads |
32.2 s |
5.4x |
|
Batched slab reads via |
5.9 s |
29.3x |
|
Skip type-checking on trusted internal reads |
3.7 s |
47.1x |
Benchmark: ~10,000 trajectories (slice 0 of 3900 from the test-run store),
1x1 degree grid, best of 3 runs, map_phase wall-clock only.
Filtered path
Commit |
Change |
Loading time (40 flights) |
Speedup |
|---|---|---|---|
(baseline) |
|
~34 s |
1x |
|
Bulk index lookup + batched slabs via |
~0.83 s |
41x |
Step 1: eliminate redundant scalar reads in _read_from_nc_var
Commit: 9959fcb
The problem
_read_from_nc_var reads a single field for a single trajectory from a
NetCDF variable. Several of its branches accessed var[index] more than
once for the same call:
# Before (simplified)
case (False, False, False):
if var[index] == var.get_fill_value(): # read 1
return None
return var[index] # read 2 (same data!)
case (True, False, False) | (True, False, True):
return SpeciesValues(
{sp: var[index, si] for si, sp in enumerate(species)}
# ^^^^^^^^^^^^^ one NetCDF read per species
)
Each var[index] or var[index, si] is a full round-trip into the
netCDF4 C library: dimension lookup, type conversion, and (for HDF5) a
chunk decompression. With ~40 fields per trajectory and ~10,000
trajectories, the redundant reads added up to minutes of pure overhead.
The fix
Read once into a local variable, then index into the in-memory result:
# After
case (False, False, False):
val = var[index] # one read
if val == var.get_fill_value():
return None
return val
case (True, False, False) | (True, False, True):
slab = var[index] # one read: shape (NSPECIES,)
return SpeciesValues({sp: slab[si] for si, sp in enumerate(species)})
Step 2: batched slab reads via iter_range
Commit: 3eeb48e
The problem
After Step 1, the remaining bottleneck was the per-trajectory call
overhead of __getitem__ -> _load_trajectory -> _read_from_nc_var.
Each trajectory still required one NetCDF read per field, and the Python
function-call stack was entered ~40 times per trajectory. For 10,000
trajectories that is 400,000 individual NetCDF reads.
The fix
TrajectoryStore.iter_range(start, stop) reads each field as a single
slab of batch_size trajectories (default 500), then unpacks the
in-memory array into individual Trajectory objects. This reduces
400,000 NetCDF reads to ~800 (40 fields x 20 batches).
Key design decisions:
File-boundary awareness. For merged stores, each underlying
NcFileshas asize_indexmarking where each constituent file ends. A single slab read cannot straddle a file boundary.iter_rangecollects boundary points from allNcFiles, splits the requested range at those boundaries, then chunks each sub-range bybatch_size.Positive local indices via
bisect_right. The existing_load_trajectoryusesbisect_left(size_index, index + 1)and computes a negative local index (index - size_index[file_index]). This works for scalar reads because netCDF4 interprets negative scalar indices as “from the end.” But for slab reads,var[-2:-1]uses Python slice semantics and produces wrong results.iter_rangeusesbisect_rightand computes positive local offsets instead.No cache interaction. Trajectories from
iter_rangeare not placed in the LRU cache (self._trajectories), since strict forward iteration would only thrash it.
Step 3: skip type-checking on trusted internal reads
The problem
After Step 2, profiling showed that the NetCDF slab reads themselves
(_load_trajectories own time) accounted for only 1.8 s of the 5.9 s
map phase. The new dominant cost was trajectory construction: the
setattr -> Container.__setattr__ -> convert_in -> _cast chain
consumed 4.7 s validating and re-casting data that was already correctly
typed from the NetCDF read.
For each of the ~506,600 field assignments (50 fields x 10,131
trajectories), _cast calls np.asarray, np.can_cast, and
ndarray.astype – all redundant when the data comes straight from a
NetCDF variable whose schema already enforces the correct type.
The fix
In _load_trajectories, write directly to the private _data dict
instead of going through __setattr__:
# Before
for k, v in per_traj_data[i].items():
setattr(traj, k, v) # -> convert_in -> _cast per field
# After
for k, v in per_traj_data[i].items():
traj._data[k] = v # trusted: data is from NetCDF slab read
This bypass is confined to the private _load_trajectories method.
The public Container.__setattr__ type-checking is unchanged and
continues to protect all user-facing code paths.
Step 4: bulk index lookup and batched reads via iter_flight_ids
Commit: 858f648
The problem
When a mission filter is provided, _trajectory_iterator in
trajectories_to_grid.py called store.get_flight(flight.flight_id) once
per flight inside a for flight in result loop:
# Before
for flight in result:
traj = store.get_flight(flight.flight_id)
if traj is not None:
yield traj
get_flight performs a flight-ID lookup by reading the entire
flight_id and trajectory_index NetCDF index arrays from disk on every
call. For N filtered trajectories this means:
2N full array reads (the index arrays, re-read every time), and
N individual
_load_trajectorycalls, each going through thesetattr→convert_in→_casttype-checking chain.
For 40 flights, this totalled ~34 s — all of it redundant I/O.
The fix
Collect all flight IDs into a plain Python list before touching the
store, then hand the whole list to store.iter_flight_ids:
# After (trajectories_to_grid.py)
flight_ids = [flight.flight_id for flight in result]
yield from store.iter_flight_ids(flight_ids)
TrajectoryStore.iter_flight_ids (new method in store.py):
Reads
flight_idandtrajectory_indexindex arrays once.Uses
np.searchsorted(vectorized) to map all requested IDs to trajectory indices in a single pass over the sorted index.Filters out misses (IDs not present in the store) with a boolean mask.
Sorts the found trajectory indices for sequential NetCDF access.
Groups consecutive indices into contiguous runs using
np.diffandnp.split.Splits each run at file boundaries (same logic as
iter_range).Loads each sub-range via
_load_trajectories(batched slab reads, bypassingsetattrvalidation — same trusted-internal-path policy asiter_range).Does not populate the LRU cache (same policy as
iter_range).
Key design decisions:
Vectorized ID mapping.
get_flightuses a Python loop with early exit to locate a single flight ID.iter_flight_idsinstead callsnp.searchsortedonce over the whole request, making the ID-to-index mapping O(N log M) rather than O(N × M) where M is the index size.Run grouping. After sorting trajectory indices, consecutive entries that differ by 1 form a contiguous run that can be loaded with a single
_load_trajectoriescall.np.diff+np.whereidentifies the break points in O(N) time. This avoids one NetCDF slab call per flight when flights happen to be stored contiguously in the file.Yield order. Trajectories are yielded in trajectory-index order (i.e. the order they appear in the store), not in the order of the input
flight_idslist. This is intentional: sequential index order maximises HDF5 chunk locality and avoids seeking backwards through the file.
Lessons: avoiding NetCDF performance bugs
1. Every var[index] is expensive – never read the same index twice
NetCDF4-Python’s Variable.__getitem__ is not a simple array lookup. It
enters the C library, locates the chunk, decompresses it, applies type
conversion, and returns a new Python/NumPy object. Accessing var[i]
twice does all of that work twice. Always capture the result in a local:
# Bad
if var[i] == fill:
return None
return var[i]
# Good
val = var[i]
if val == fill:
return None
return val
This applies equally to multi-dimensional access like var[i, j].
2. Prefer slab reads over scalar loops
Reading var[start:stop] is much cheaper than looping over
var[i] for i in range(start, stop). The per-call overhead of
entering the C library dominates for small reads, so amortizing it
across a batch is a large win. The optimal batch size depends on memory
vs. overhead tradeoffs; 500 worked well here.
3. Watch out for negative indexing in slices
netCDF4-Python interprets a negative scalar index as “from the end of
the dimension,” which happens to produce correct results when the local
offset within a merged file is negative by construction. But negative
indices in slices (var[-3:-1]) follow Python’s slice-from-the-end
convention, which silently returns wrong data. If you are computing
local offsets for slab reads, make sure they are always non-negative.
4. Profile before optimizing
The original plan included parallelizing the Numba gridding kernel. A cProfile run showed it accounted for less than 1 second of 32 seconds – the I/O path was 30x more important. Always profile to find the actual bottleneck before writing optimization code.
uv run python -m cProfile -o /tmp/profile.prof \
.venv/bin/aeic trajectories-to-grid <args>
python -c "
import pstats
s = pstats.Stats('/tmp/profile.prof')
s.sort_stats('tottime').print_stats(40)
"
5. Validation has a cost – bypass it on trusted internal paths
Defensive type-checking in __setattr__ is valuable for user-facing
APIs, but when data is being loaded from a schema-enforced store (like
NetCDF), the types are already correct by construction. In tight loops,
the per-assignment overhead of np.can_cast + ndarray.astype adds up
quickly. If you can identify an internal code path where the data
provenance is guaranteed, writing directly to the underlying storage
(_data dict) avoids redundant validation. Keep such bypasses private
and well-commented so they don’t leak into user-facing code.
6. Understand the library stack beneath you
netCDF4-Python wraps netCDF-C, which wraps HDF5. Each layer adds overhead per call. Patterns that look fine in pure Python (dictionary comprehensions that index a variable N times) become expensive when each index operation traverses three library boundaries. Think of NetCDF variable access as I/O, not as array indexing.
7. Collect all IDs before entering a read loop
When a loop calls a lookup function for each element in a result set, any per-call cost that scales with the full dataset — such as reading an entire index array — is paid once per element instead of once total.
# Bad: reads the full flight_id index array on every iteration
for flight in result:
traj = store.get_flight(flight.flight_id) # O(M) I/O × N times
# Good: read the index once, then map all IDs in one vectorized pass
flight_ids = [flight.flight_id for flight in result]
yield from store.iter_flight_ids(flight_ids) # O(M) I/O × 1 time
The pattern generalises: if a lookup helper re-reads any shared resource
(a database index, a config file, a NetCDF variable) on every call,
restructure the caller to collect all keys up front and call a bulk
variant of the lookup instead. NumPy’s searchsorted is the natural
tool for sorted integer or float index lookups.