Trajectory stores
Trajectories generated by simulation using trajectory builders can be
persisted to NetCDF files using the TrajectoryStore class. This provides a
general facility for storing trajectory and associated data that covers not
only the base case of saving simple trajectory data, but also storing
associated data like emissions values alongside trajectory data.
This system works in conjunction with the Trajectory and FieldSet classes.
Each field set for a trajectory is stored in a separate NetCDF group within a
NetCDF file (or files: see below). Associated data can be stored in separate
associated NetCDF files to allow for cases where emissions data is created by
a separate process from the base trajectory simulation.
When working in conjunction with the missions database, a TrajectoryStore
can create an index based on the mission database flight ID. This makes
referencing simulated trajectories based on mission database queries
straightforward.
More detailed information can be found in the documentation string for the
TrajectoryStore class. Example usage of the TrajectoryStore class can be
found in the test/test_trajectory_stores.py test file, and a more extended
example, including parallel generation and merging of trajectory stores can be
seen in the notebooks/end-to-end-simulation.ipynb notebook in the AEIC
repository.
Warning
The TrajectoryStore class is not thread-safe. Even opening a trajectory store read-only in multiple threads is unsafe. This is a limitation of the underlying HDF5 and netcdf-c libraries on which the Python netCDF4 package relies.
Parallelism for trajectory stores should always be done at the process level. Multiple processes can open a trajectory store for reading without causing problems. For creating trajectory stores in parallel, there is a facility to merge multiple stores into a single “merged store”: split the simulation work to be done, run simulations in separate processes and create multiple trajectory stores, one per process, then merge them. The merged store can then be opened in parallel by multiple processes for further processing. This approach avoids all problems with the lack of thread safety in the underlying libraries.
- class AEIC.trajectories.store.TrajectoryStore(*, base_file: str | Path | None = None, mode: FileMode = FileMode.READ, override: bool | None = None, force_fieldset_matches: bool | None = None, cache_size_mb: int = 2048, associated_files: list[tuple[str | Path, list[str]]] | list[str | Path] | None = None, title: str | None = None, comment: str | None = None, history: str | None = None, source: str | None = None)
Class representing a set of trajectories stored in NetCDF files.
Fundamentally, a TrajectoryStore is an append-only collection of Trajectory objects that can be stored in and retrieved from NetCDF files. In the simplest file-based use case, a TrajectoryStore stores plain trajectory data in a single NetCDF file. However, the class offers additional functionality that allows additional data fields to be combined with trajectory data, either in the same (“base”) NetCDF file, or in additional (“associated”) NetCDF files.
The intention here is to support a number of different use cases for storing trajectory and associated (normally emissions) data.
Trajectory stores are not thread-safe. A program may create and use TrajectoryStore values from a single thread only. This single-threaded restriction applies both to read-only access and mutation of stores.
Field sets
Data stored in a TrajectoryStore is divided into “field sets” (represented by the FieldSet class from the AEIC.trajectories.field_sets package). A field set is a collection of data and metadata fields that are part of a trajectory or data that lives alongside a trajectory (emissions data of one sort or another, for example). “Data fields” in field sets have values for each point along a trajectory: the length of the data values in each of these fields must match the length of the trajectory. “Metadata fields” in field sets are per-trajectory values: there is one value of each of these fields for each trajectory. Each field in a field set has a name, a data type and associated information used for serialization to and from NetCDF files.
A TrajectoryStore always contains the “base” field set, which holds the basic trajectory data (defined as BASE_FIELDS in the AEIC.trajectories.trajectory package). Additional field sets can be stored in a TrajectoryStore as needed.
The set of field sets contained in a TrajectoryStore is determined when the first trajectory is added to the store. All subsequent trajectories added must have the same field sets.
Access modes and NetCDF files
The TrajectoryStore class supports three access modes: CREATE, READ and APPEND, accessed using the create, open and append class methods. In addition, it is possible to create additional associated NetCDF files associated with an existing TrajectoryStore using the create_associated class method. In general, TrajectoryStore instances should be managed using a with context manager to ensure that files are closed properly when no longer needed. This is particularly important because of the tricky finalization semantics of the underlying NetCDF and HDF5 libraries.
A TrajectoryStore may be:
created purely in-memory, or
it may be connected to a single NetCDF file (in which data from all field sets will be stored), or
it may be connected to a base NetCDF file (holding the “base” field set of trajectory data and zero or more additional field sets) and one or more associated NetCDF files (storing other non-base field sets).
Trajectories are managed by a TrajectoryStore using an LRU memory cache with a (user configurable) fixed size. When a TrajectoryStore is associated with NetCDF files, entries from the trajectory cache can be evicted, since they are stored in external files. An in-memory TrajectoryStore does not have this flexibility, and if the stored trajectories overflow the cache size, requiring an eviction, a TrajectoryCache.EvictionOccurred exception is raised.
Base and associated files
Every TrajectoryStore has a “base” NetCDF file (or files, for a merged store) that contains the base field set and possibly additional field sets. Additional field sets may be stored in separate “associated” NetCDF files. Each associated file may contain one or more field sets.
When opening a TrajectoryStore in READ or APPEND mode, a list of associated files may be provided along with the base file. Trajectories retrieved from the resulting store will have data from all field sets stored in the base and associated files.
When creating a new TrajectoryStore in CREATE mode, a list of associated files may be provided along with the base file. In this case, the list must specify which field sets are to be stored in each associated file. When the first trajectory is added to the store, the base and associated files are created and the field sets are distributed among them as specified.
Associated files may be created from an existing TrajectoryStore using the create_associated class method. This is essentially a mapping operation, taking each trajectory in the store, applying a user-supplied function to generate new data values, and storing the resulting data in a new associated NetCDF file. The intended use case here is for the calculation of data like emissions that are associated with trajectories but are not part of the trajectory data itself, and so may be calculated separately.
When an associated file is created, metadata is stored within the file to link it to the base file from which it was created. This linkage is checked when opening a TrajectoryStore with associated files to ensure that the associated files correspond to the base file.
Merged stores
To simplify management of large trajectory data sets that may be split across multiple NetCDF files, the TrajectoryStore class supports “merged” stores. A merged store is stored as a directory containing multiple NetCDF files, each of which contains a subset of the trajectories in the store, along with a JSON metadata file and possible a separate NetCDF file for the store’s flight ID index.
A merged store is created using the merge class method, which takes as input a list of existing NetCDF files and the name of a directory in which to create the merged store. The resulting merged store may then be opened in READ mode like any other TrajectoryStore. Merged stores must have extension “.aeic-store”.
Merged stores may be created from both base files and associated files.
Trajectory access and mission database indexing
Trajectories can be retrieved from a trajectory store simply by indexing the store with the integer index of the trajectory within the store. Indexes are assigned in order from zero in the order of insertion of trajectories. Indexes into merged stores run consecutively from zero across all the constituent NetCDF files composing the merged store.
In addition, if all trajectories in the store have a flight_id field (representing the mission database flight ID for the trajectory), the store can be indexed by flight ID as well. In this case, the store is said to be “indexable”. If a store is indexable, the flight_id field of each trajectory must be unique within the store. A trajectory can be retrieved by flight ID using the get_flight method.
- enum FileMode(value)
- Member Type:
str
Valid values are as follows:
- READ = <FileMode.READ: 'r'>
- CREATE = <FileMode.CREATE: 'w'>
- APPEND = <FileMode.APPEND: 'a'>
- class NcFiles(path: list[Path], fieldsets: set[str], dataset: list[Dataset], traj_dim: list[Dimension], groups: dict[str, list[Group]], size_index: list[int] | None, title: str | None = None, comment: str | None = None, history: str | None = None, source: str | None = None, created: datetime | None = None)
Internal class used to store information about NetCDF files associated with field sets in a TrajectoryStore.
- comment: str | None = None
Comment global attribute value.
- created: datetime | None = None
Creation time global attribute value.
- dataset: list[Dataset]
multiple to support merged stores.
- Type:
Netcdf4 Dataset objects for the files
- fieldsets: set[str]
Field sets stored in the NetCDF files.
- groups: dict[str, list[Group]]
groups are multiple for each field set name to support merged stores.
- Type:
Mapping from field set names to NetCDF groups in the files
- history: str | None = None
History global attribute value.
- path: list[Path]
multiple to support merged stores.
- Type:
Paths to NetCDF files
- size_index: list[int] | None
Cumulative trajectory count through the NetCDF files represented here: not used for single NetCDF stores, but for merged stored, used to find the underlying NetCDF file containing a given trajectory index.
- source: str | None = None
Source global attribute value.
- title: str | None = None
Title global attribute value.
- traj_dim: list[Dimension]
multiple to support merged stores.
- Type:
Trajectory dimension objects for the files
- active_in_thread: int | None = None
Thread ID of active TrajectoryStore instance, if any. Multi-threaded access is not allowed. This attribute is used to check for this.
- add(trajectory: Trajectory) int
Add a trajectory to the store and return its index.
- classmethod append(*args, **kwargs) TrajectoryStore
Open an existing TrajectoryStore for appending.
This class method is the preferred way to open an existing trajectory store for appending. Call this method in preference to calling the class constructor directly.
- close()
Close any open NetCDF files associated with the trajectory store.
- classmethod create(*args, **kwargs) TrajectoryStore
Create a new TrajectoryStore.
This class method is the preferred way to create a new trajectory store. Call this method in preference to calling the class constructor directly.
- create_associated(associated_file: str | Path, fieldsets: list[str], mapping_function: AssociatedFileCreateFn, *args, **kwargs) None
Create an associated NetCDF file for additional field sets.
This maps a function over all trajectories in the TrajectoryStore to create new associated data values, which are immediately written to an associated NetCDF file.
- property files: list[NcFiles]
NetCDF file information associated with the trajectory store.
The base NetCDF file is the first entry in the list; any associated files follow.
- get_flight(flight_id: int) Trajectory | None
Lookup a trajectory by flight ID.
- static merge(output_store: str | Path, input_stores: list[str | Path] | None = None, input_stores_pattern: str | Path | None = None, input_stores_index_range: tuple[int, int] | None = None, title: str | None = None, comment: str | None = None, history: str | None = None, source: str | None = None) None
Merge multiple TrajectoryStore files into a single merged store.
TODO: Finish this docstring.
- property nc_linked: bool
Is a NetCDF file (or files) associated with the trajectory store?
- classmethod open(*args, **kwargs) TrajectoryStore
Open an existing TrajectoryStore read-only.
This class method is the preferred way to open an existing trajectory store for reading. Call this method in preference to calling the class constructor directly.
- save(base_file: str | Path, associated_files: list[tuple[str | Path, list[str]]] | list[str | Path] | None = None)
Create NetCDF files for a TrajectoryStore currently not linked to one.
This method is used when an in-memory TrajectoryStore needs to be persisted to disk. This method can only be called on a TrajectoryStore created in CREATE mode with base_file=None.
- sync()
Synchronize any pending writes to the NetCDF file or files.
Note that this does not necessarily make the NetCDF files readable by another application because of NetCDF4’s finalization behavior. To ensure complete finalization, call close() instead.