radis.api.cdsdapi module¶

Parser for CDSD-HITEMP, CDSD-4000 format.

Routine Listing¶

cdsd2df()

References

CDSD-4000 manual

cdsd2df(fname, version='hitemp', cache=True, load_columns=None, verbose=True, drop_non_numeric=True, load_wavenum_min=None, load_wavenum_max=None, engine='pytables', output='pandas')[source]¶

Convert a CDSD-HITEMP [1] or CDSD-4000 [2] file to a Pandas dataframe.

Parameters:

fname (str) – CDSD file name
version (str (‘4000’, ‘hitemp’)) – CDSD version
cache (boolean, or ‘regen’) – if True, a pandas-readable HDF5 file is generated on first access, and later used. This saves on the datatype cast and conversion and improves performances a lot (but changes in the database are not taken into account). If False, no database is used. If ‘regen’, temp file are reconstructed. Default True.
load_columns (list) – columns to load. If None, loads everything

Note

this is only relevant if loading from a cache file. To generate the cache file, all columns are loaded anyway.

Other Parameters:

drop_non_numeric (boolean) – if True, non numeric columns are dropped. This improves performances, but make sure all the columns you need are converted to numeric formats before hand. Default True. Note that if a cache file is loaded it will be left untouched.
load_wavenum_min, load_wavenum_max (float) – if not 'None', only load the cached file if it contains data for wavenumbers above/below the specified value. See :py:func`~radis.api.cache_files.load_h5_cache_file`. Default 'None'.
engine (‘pytables’, ‘vaex’) – format for Hdf5 cache file. Default pytables

Returns:

df – dataframe containing all lines and parameters

Return type:

pandas Dataframe or Vaex Dataframe

Notes

CDSD-4000 Database can be downloaded from [3]

Performances: I had huge performance trouble with this function, because the files are huge (500k lines) and the format is to special (no space between numbers…) to apply optimized methods such as pandas’s. A line by line reading isn’t so bad, using struct to parse each line. However, we waste typing determining what every line is. I ended up using the fromfiles functions from numpy, not considering n (line return) as a special character anymore, and a second call to numpy to cast the correct format. That ended up being twice as fast.

initial: 20s / loop

with mmap: worse

w/o readline().rstrip(’n’): still 20s

numpy fromfiles: 17s

no more readline, 2x fromfile 9s

Think about using cache mode too:

no cache mode 9s

cache mode, first time 22s

cache mode, then 2s

Moving to HDF5:

On cdsd_02069_02070 (56 Mb)

Reading:

cdsd2df(): 9.29 s
cdsd2df(cache=True [old .txt version]): 2.3s
cdsd2df(cache=True [new h5 version, table]): 910ms
cdsd2df(cache=True [new h5 version, fixed]): 125ms

Storage:

%timeit df.to_hdf("cdsd_02069_02070.h5", "df", format="fixed")  337ms
%timeit df.to_hdf("cdsd_02069_02070.h5", "df", format="table")  1.03s

References

Note that CDSD-HITEMP is used as the line database for CO2 in HITEMP 2010

RADIS

Navigation

Quick search

radis.api.cdsdapi module¶

Routine Listing¶