Types in Tables
In ETL, we work with Table object, which is derived from pandas.DataFrame to adjust it to our needs. Setting the types of your columns is crucial for performance and memory optimization. In this guide, we’ll cover the types we use in our ETL pipeline and how to set them.
As a general summary, we use nullable data types, and therefore recommend the usage of Float64, Int64 and string[pyarrow] types. We also avoid using np.nan and prefer pd.NA.
Loading tables¶
The preferred way to load a table from a dataset is
tb = ds_meadow.read("my_table")
tb = ds_meadow["my_table"]
This process automatically converts all columns to the recommended types:
float64
orfloat32
->Float64
int32
,uint32
, orint64
->Int64
category
->string[pyarrow]
To disable this conversion, use .read("my_table", safe_types=False)
. This is especially useful when working with large tables where conversion to string type would significantly increase memory usage.
Repacking datasets¶
We use a "repacking process" to reduce the size of the dataset before saving it to disk (dataset.save()
). This process also converts the data to the recommended types, even if you have converted the data to old NumOy types (e.g. .astype(float)
).
String dtypes¶
To convert a column to a string type use
# Option 1
tb["my_column"] = tb["my_column"].astype("string")
# Option 2
tb["my_column"] = tb["my_column"].astype("string[pyarrow]")
tb["my_column"] = tb["my_column"].astype(str)
However, if you don’t use the new method, repack will handle this conversion when saving.
Difference between string
and string[pyarrow]
Both types are very similar, but string[pyarrow]
is more efficient and will be the default in Pandas 3.0. In practice, you won't notice a difference.
NaN
values¶
Avoid using np.nan
! Always use pd.NA
.