radis.io package

Submodules

Module contents

Parsers for various databases


cdsd2df(fname, version='hitemp', cache=True, verbose=True, drop_non_numeric=True, load_wavenum_min=None, load_wavenum_max=None)[source]

Convert a CDSD-HITEMP [1]_ or CDSD-4000 2 file to a Pandas dataframe.

fname: str

CDSD file name

version: str (‘4000’, ‘hitemp’)

CDSD version

cache: boolean, or ‘regen’

if True, a pandas-readable HDF5 file is generated on first access, and later used. This saves on the datatype cast and conversion and improves performances a lot (but changes in the database are not taken into account). If False, no database is used. If ‘regen’, temp file are reconstructed. Default True.

Other Parameters
  • drop_non_numeric (boolean) – if True, non numeric columns are dropped. This improves performances, but make sure all the columns you need are converted to numeric formats before hand. Default True. Note that if a cache file is loaded it will be left untouched.

  • load_wavenum_min, load_wavenum_max (float) – if not 'None', only load the cached file if it contains data for wavenumbers above/below the specified value. See :py:func`~radis.io.cache_files.load_h5_cache_file`. Default 'None'.

Returns

df – dataframe containing all lines and parameters

Return type

pandas Dataframe

Notes

CDSD-4000 Database can be downloaded from 3

Performances: I had huge performance trouble with this function, because the files are huge (500k lines) and the format is to special (no space between numbers…) to apply optimized methods such as pandas’s. A line by line reading isn’t so bad, using struct to parse each line. However, we waste typing determining what every line is. I ended up using the fromfiles functions from numpy, not considering n (line return) as a special character anymore, and a second call to numpy to cast the correct format. That ended up being twice as fast.

  • initial: 20s / loop

  • with mmap: worse

  • w/o readline().rstrip(’n’): still 20s

  • numpy fromfiles: 17s

  • no more readline, 2x fromfile 9s

Think about using cache mode too:

  • no cache mode 9s

  • cache mode, first time 22s

  • cache mode, then 2s

Moving to HDF5:

On cdsd_02069_02070 (56 Mb)

Reading:

cdsd2df(): 9.29 s
cdsd2df(cache=True [old .txt version]): 2.3s
cdsd2df(cache=True [new h5 version, table]): 910ms
cdsd2df(cache=True [new h5 version, fixed]): 125ms

Storage:

%timeit df.to_hdf("cdsd_02069_02070.h5", "df", format="fixed")  337ms
%timeit df.to_hdf("cdsd_02069_02070.h5", "df", format="table")  1.03s

References

Note that CDSD-HITEMP is used as the line database for CO2 in HITEMP 2010

1

HITEMP 2010, Rothman et al., 2010

2

CDSD-4000 article, Tashkun et al., 2011

3

CDSD-4000 database

See also

hit2df()

fetch_astroquery(molecule, isotope, wmin, wmax, verbose=True, cache=True, expected_metadata={'isotope': 3, 'molecule': 'CO', 'wmax': 2300.0, 'wmin': 1900.0})[source]

Download a HITRAN line database to a Pandas DataFrame.

Wrapper to the fetch function of Astroquery [1]_ (itself based on [HAPI])

Note

if using, cite [HAPI] and [HITRAN-2016]

Parameters
  • molecule (str, or int) – molecule name or identifier

  • isotope (int) – isotope number

  • wmin, wmax (float (cm-1)) – wavenumber min and max

Other Parameters
  • verbose (boolean) – Default True

  • cache (boolean or 'regen') – if True, tries to find a .h5 cache file in the Astroquery cache_location, that would match the requirements. If not found, downloads it and saves the line dataframe as a .h5 file in the Astroquery. If 'regen', delete existing cache file to regerenate it.

  • expected_metadata (dict) – if cache=True, check that the metadata in the cache file correspond to these attributes. Arguments molecule, isotope, wmin, wmax are already added by default.

References

1

Astroquery

See also

astroquery.hitran.core.Hitran.query_lines_async(), astroquery.query.BaseQuery.cache_location

fetch_hitemp(molecule, local_databases='~/.radisdb/hitemp/', databank_name='HITEMP-{molecule}', isotope=None, load_wavenum_min=None, load_wavenum_max=None, columns=None, cache=True, verbose=True, chunksize=100000, clean_cache_files=True, return_local_path=False, engine='pytables', parallel=True)[source]

Stream HITEMP file from HITRAN website. Unzip and build a HDF5 file directly.

Returns a Pandas DataFrame containing all lines.

Parameters
  • molecule ("H2O", "CO2", "N2O", "CO", "CH4", "NO", "NO2", "OH") – HITEMP molecule. See https://hitran.org/hitemp/

  • local_databases (str) – where to create the RADIS HDF5 files. Default "~/.radisdb/"

  • databank_name (str) – name of the databank in RADIS Configuration file Default "HITEMP-{molecule}"

  • isotope (str) – load only certain isotopes : '2', '1,2', etc. If None, loads everything. Default None.

  • load_wavenum_min, load_wavenum_max (float (cm-1)) – load only specific wavenumbers.

  • columns (list of str) – list of columns to load. If None, returns all columns in the file.

Other Parameters
  • cache (bool, or 'regen') – if True, use existing HDF5 file. If False or 'regen', rebuild it.. Default True.

  • verbose (bool)

  • chunksize (int) – number of lines to process at a same time. Higher is usually faster but can create Memory problems and keep the user uninformed of the progress.

  • clean_cache_files (bool) – if True clean downloaded cache files after HDF5 are created.

  • return_local_path (bool) – if True, also returns the path of the local database file.

  • engine (‘pytables’, ‘vaex’) – which HDF5 library to use.

  • parallel (bool) – if True, uses joblib.parallel to load database with multiple processes

Returns

  • df (pd.DataFrame) – Line list A HDF5 file is also created in local_databases and referenced in the RADIS config file with name databank_name

  • local_path (str) – path of local database file if return_local_path

Examples

from radis import fetch_hitemp
df = fetch_hitemp("CO")
print(df.columns)
>>> Index(['id', 'iso', 'wav', 'int', 'A', 'airbrd', 'selbrd', 'El', 'Tdpair',
    'Pshft', 'ierr', 'iref', 'lmix', 'gp', 'gpp', 'Fu', 'branch', 'jl',
    'syml', 'Fl', 'vu', 'vl'],
    dtype='object')

Notes

if using load_only_wavenum_above/below or isotope, the whole database is anyway downloaded and uncompressed to local_databases fast access .HDF5 files (which will take a long time on first call). Only the expected wavenumber range & isotopes are returned. The .HFD5 parsing uses hdf2df()

hit2df(fname, cache=True, verbose=True, drop_non_numeric=True, load_wavenum_min=None, load_wavenum_max=None)[source]

Convert a HITRAN/HITEMP [1]_ file to a Pandas dataframe

Parameters
  • fname (str) – HITRAN-HITEMP file name

  • cache (boolean, or 'regen' or 'force') – if True, a pandas-readable HDF5 file is generated on first access, and later used. This saves on the datatype cast and conversion and improves performances a lot (but changes in the database are not taken into account). If False, no database is used. If 'regen', temp file are reconstructed. Default True.

Other Parameters
  • drop_non_numeric (boolean) – if True, non numeric columns are dropped. This improves performances, but make sure all the columns you need are converted to numeric formats before hand. Default True. Note that if a cache file is loaded it will be left untouched.

  • load_wavenum_min, load_wavenum_max (float) – if not 'None', only load the cached file if it contains data for wavenumbers above/below the specified value. See :py:func`~radis.io.cache_files.load_h5_cache_file`. Default 'None'.

Returns

df – dataframe containing all lines and parameters

Return type

pandas Dataframe

References

1

HITRAN 1996, Rothman et al., 1998

Notes

Performances: see CDSD-HITEMP parser

See also

cdsd2df()