-
Notifications
You must be signed in to change notification settings - Fork 18
Specification
An OMX matrix file is a structured collection of two-dimensional array objects and associated metadata. The collection of data implements the OMX Data Structure. This page describes the physical implementation of an OMX file, which is built on top of the well-established HDF5 scientific data storage standard. An OMX file has a specific layout, described below, that is intended to ensure that complete and consistent information about the matrix data is stored and that the data can be retrieved correctly and efficiently.
Direct manipulation of an OMX file through standard HDF5 tools is feasible, but using one of the APIs to construct and manipulate an OMX file is recommended because the OMX format requires that certain elements have consistent values and the API will enforce that consistency whereas a "generic" HDF5 interface will not. New APIs can be constructed using any HDF5-compliant toolset, provided the resulting file conforms to the layout and data-consistency requirements presented here.
- For Anyone managing transportation data with a matrix structure
- Who Wants a portable and open format for storing and exchanging that data
- That Is easy to understand, can be used in custom programs/scripts, and supports commercial products
- Unlike Existing solutions that are either proprietary, cumbersome, slow, and/or large
An OMX file is referred to as an "open modeling matrix" or simply "open matrix", and uses the standard file extension ".omx"
An OMX file has an HDF5 tree structure, with these elements:
-
The "root node" which is the parent of all data inside the file. This root node is similar to a file system root directory on a hard drive, and is usually nameless and represented by the character "/".
-
A "/data" folder under the root node contains all the matrix data in the OMX file. Each individual 2D matrix (or "table") is represented as a leaf under this data folder.
-
A "/lookup" folder under the root node contains index information about each axis of the 2D matrices that can be used to retrieve matrix data using index labels such as Traffic Analysis Zone (TAZ) numbers
The root node of an OMX matrix should include the following standard attributes:
-
OMX_VERSION: A string representing the version number of the OMX matrix file standard that has been implemented. This document specifies OMX_VERSION "0.2".
-
SHAPE: An array of two integers (rows, columns). Each table in the /data folder will have this structure. Other metadata specifying rows and and columns will be ignored.
Every two-dimensional matrix (table) in an OMX file resides as a leaf node in the /data folder. No subfolders are allowed. The tables are stored as two-dimensional HDF5 datasets in row-major order. The data needs to be written to the file in row chunks or else the I/O will be quite slow. The data types across matrix tables may be different as noted below in HDF5 Attributes.
-
Is stored under a unique Name. The Name is required. The name is used to retrieve a specific matrix data table. With the root node and the name established, the matrix can be identified by its full path, e.g. "/data/m1". Most OMX APIs will allow you to leave out the "/data" portion of the path as it is the same in all OMX files.
-
May have a descriptive title attribute (a string of any length). This attribute is optional but recommended. A conforming API for an OMX file will be able to parse and store the Title attribute.
-
NA, the value for missing, NA, or NULL data such as -9999. It is up to the user to supply an NA value that is consistent with the data stored in the matrix. This attribute is optional, but highly recommended if any cells in any of the matrix data tables might contain "missing" information.
Matrices can have an unlimited number of additional attributes as key value pairs, but support for entering and retrieving specific metadata elements will vary by API. The attribute keys are strings, and values can be any simple data type, including ints, floats, and strings. Some possible, user-specified attribute examples:
- pa-format: (true/false), to signify production/attraction format
- year: (integer), to identify the forecast year represented in a particular matrix
- source: (string) signifying the source folder or alternative represented.
It is frequently useful to map matrix indices to some other set of values; e.g. 1-based cell numbers (HDF5 natively uses 0-based indexing); TAZ lookups; district definitions; etc. The structure used to construct such maps can be referred to as a "mapping", a "lookup", or an "index map". The term "index map" is used here. Since all matrices in an OMX file are the same shape, they can all use the same index map. Those index maps are stored in the /lookup folder
An index map is simply a one-dimensional array (vector) with the same number of simple data types such as integer or String elements as the corresponding matrix dimension (one of the numbers in the OMX SHAPE attribute). The elements of the vector are populated with the corresponding index value. For example, a matrix of SHAPE(1500,3500) could have two different index sizes, 1500 or 3500, and an index map of dimension 1500 could only be used for the first index of such a matrix, whereas an index map of dimension 3500 could only be used with the second index. In the much more common case of a square matrix, the same index map could be used for either dimension.
This feature can be used to associate TAZ "numbers" (which may not be contiguous, and may not even be numbers) with a set of ordinal integers corresponding to row or column index positions in the data set. Each element of the lookup map will contain the TAZ number associated with the corresponding index position in the matrix data. Another common application would be to identify districts or jurisdictions such as cities or counties. In that case, each element of the index map will contain the district identifier for the corresponding index position. The same district identifier may be used for multiple index positions.
-
Is stored under a unique Name that is used to identify the index map, and that should be meaningful for your data; e.g. "TAZ", "District", etc. With the root node and the name established, the index map can be retrieved by its full path, e.g. "/lookup/taz". Most OMX APIs will allow you to leave out the "/lookup" portion of the path as it is the same in all OMX files.
-
Is a one-dimensional array with the same number of elements as one of the matrix SHAPE dimensions; e.g. the number of rows or the number of columns. This array can be any type, but the type should support unique matching (an integer or string type will work, but not a floating point type)
-
May have optional attribute called dim that has value 0 or 1. A 0 means the lookup applies to the first dimension, a 1 means it applies to the 2nd dimension, and no attribute means both.
The underlying HDF5 format requires that certain structural information be specified. The data type must be specified for each matrix table in /data, but may be different for each table:
- Data type, which is one of several pre-defined native data types. Data types are the usual suspects and include int, float, and specific bit-lengths such as "int32" and "float64".
The following items must be the same for all tables in the OMX file:
-
Compression level or other "filters". It is strongly recommended, but not required, that OMX files be compressed using the "zlib" compression filter with compression level 1. Only zlib compression is supported for OMX files, because only zlib compression is available across all HDF5 implementations. Zlib compression level 1 is the default.
-
Without compression, matrix data can be quite large on disk. Zlib is the fastest compressor available across all HDF5 implementations, and using it will greatly reduce file sizes without compromising file transferability. The compression filter slows down file writing; thus depending on the application you may decide that enormous files are worth the trade-off in writing speed. Note that other compressors are available in some HDF5 implementations, but using them will mean your file will not be readable by some HDF5 implementations, and therefore your file will not be an OMX file.
-
Chunksize. Normal use of OMX files does not require setting a "chunksize" as this is handled by the OMX API implementations. Specifying a chunksize helps the HDF5 library write your data efficiently; it's basically a memory buffer size (in bytes) that HDF5 uses before actually writing to disk. Chunksize in no way affects the data itself: it only affects the I/O speed of your application. Again, the standard OMX API implentations will set this to a reasonable default that works for most use cases.