radis.api.cdsdapi module¶
Parser for CDSD-HITEMP, CDSD-4000 format.
Routine Listing¶
References
CDSD-4000 manual
- cdsd2df(fname, version='hitemp', cache=True, load_columns=None, verbose=True, drop_non_numeric=True, load_wavenum_min=None, load_wavenum_max=None, engine='pytables', output='pandas')[source]¶
Convert a CDSD-HITEMP [1] or CDSD-4000 [2] file to a Pandas dataframe.
- Parameters:
fname (str) – CDSD file name
version (str (‘4000’, ‘hitemp’)) – CDSD version
cache (boolean, or ‘regen’) – if
True
, a pandas-readable HDF5 file is generated on first access, and later used. This saves on the datatype cast and conversion and improves performances a lot (but changes in the database are not taken into account). IfFalse
, no database is used. If ‘regen’, temp file are reconstructed. DefaultTrue
.load_columns (list) – columns to load. If
None
, loads everythingNote
this is only relevant if loading from a cache file. To generate the cache file, all columns are loaded anyway.
- Other Parameters:
drop_non_numeric (boolean) – if
True
, non numeric columns are dropped. This improves performances, but make sure all the columns you need are converted to numeric formats before hand. DefaultTrue
. Note that if a cache file is loaded it will be left untouched.load_wavenum_min, load_wavenum_max (float) – if not
'None'
, only load the cached file if it contains data for wavenumbers above/below the specified value. See :py:func`~radis.api.cache_files.load_h5_cache_file`. Default'None'
.engine (‘pytables’, ‘vaex’) – format for Hdf5 cache file. Default
pytables
- Returns:
df – dataframe containing all lines and parameters
- Return type:
pandas Dataframe or Vaex Dataframe
Notes
CDSD-4000 Database can be downloaded from [3]
Performances: I had huge performance trouble with this function, because the files are huge (500k lines) and the format is to special (no space between numbers…) to apply optimized methods such as pandas’s. A line by line reading isn’t so bad, using struct to parse each line. However, we waste typing determining what every line is. I ended up using the fromfiles functions from numpy, not considering n (line return) as a special character anymore, and a second call to numpy to cast the correct format. That ended up being twice as fast.
initial: 20s / loop
with mmap: worse
w/o readline().rstrip(’n’): still 20s
numpy fromfiles: 17s
no more readline, 2x fromfile 9s
Think about using cache mode too:
no cache mode 9s
cache mode, first time 22s
cache mode, then 2s
Moving to HDF5:
On cdsd_02069_02070 (56 Mb)
Reading:
cdsd2df(): 9.29 s cdsd2df(cache=True [old .txt version]): 2.3s cdsd2df(cache=True [new h5 version, table]): 910ms cdsd2df(cache=True [new h5 version, fixed]): 125ms
Storage:
%timeit df.to_hdf("cdsd_02069_02070.h5", "df", format="fixed") 337ms %timeit df.to_hdf("cdsd_02069_02070.h5", "df", format="table") 1.03s
References
Note that CDSD-HITEMP is used as the line database for CO2 in HITEMP 2010
See also
hit2df()