radis.api.cdsdapi module

Parser for CDSD-HITEMP, CDSD-4000 format.

Routine Listing

References

CDSD-4000 manual


cdsd2df(fname, version='hitemp', cache=True, load_columns=None, verbose=True, drop_non_numeric=True, load_wavenum_min=None, load_wavenum_max=None, engine='pytables', output='pandas')[source]

Convert a CDSD-HITEMP [1] or CDSD-4000 [2] file to a Pandas dataframe.

Parameters:
  • fname (str) – CDSD file name

  • version (str (‘4000’, ‘hitemp’)) – CDSD version

  • cache (boolean, or ‘regen’) – if True, a pandas-readable HDF5 file is generated on first access, and later used. This saves on the datatype cast and conversion and improves performances a lot (but changes in the database are not taken into account). If False, no database is used. If ‘regen’, temp file are reconstructed. Default True.

  • load_columns (list) – columns to load. If None, loads everything

    Note

    this is only relevant if loading from a cache file. To generate the cache file, all columns are loaded anyway.

Other Parameters:
  • drop_non_numeric (boolean) – if True, non numeric columns are dropped. This improves performances, but make sure all the columns you need are converted to numeric formats before hand. Default True. Note that if a cache file is loaded it will be left untouched.

  • load_wavenum_min, load_wavenum_max (float) – if not 'None', only load the cached file if it contains data for wavenumbers above/below the specified value. See :py:func`~radis.api.cache_files.load_h5_cache_file`. Default 'None'.

  • engine (‘pytables’, ‘vaex’) – format for Hdf5 cache file. Default pytables

Returns:

df – dataframe containing all lines and parameters

Return type:

pandas Dataframe or Vaex Dataframe

Notes

CDSD-4000 Database can be downloaded from [3]

Performances: I had huge performance trouble with this function, because the files are huge (500k lines) and the format is to special (no space between numbers…) to apply optimized methods such as pandas’s. A line by line reading isn’t so bad, using struct to parse each line. However, we waste typing determining what every line is. I ended up using the fromfiles functions from numpy, not considering n (line return) as a special character anymore, and a second call to numpy to cast the correct format. That ended up being twice as fast.

  • initial: 20s / loop

  • with mmap: worse

  • w/o readline().rstrip(’n’): still 20s

  • numpy fromfiles: 17s

  • no more readline, 2x fromfile 9s

Think about using cache mode too:

  • no cache mode 9s

  • cache mode, first time 22s

  • cache mode, then 2s

Moving to HDF5:

On cdsd_02069_02070 (56 Mb)

Reading:

cdsd2df(): 9.29 s
cdsd2df(cache=True [old .txt version]): 2.3s
cdsd2df(cache=True [new h5 version, table]): 910ms
cdsd2df(cache=True [new h5 version, fixed]): 125ms

Storage:

%timeit df.to_hdf("cdsd_02069_02070.h5", "df", format="fixed")  337ms
%timeit df.to_hdf("cdsd_02069_02070.h5", "df", format="table")  1.03s

References

Note that CDSD-HITEMP is used as the line database for CO2 in HITEMP 2010

See also

hit2df()