owid.datautils.dataframes#

Objects related to pandas dataframes.

exception datautils.dataframes.DataFramesHaveDifferentLengths(exception_message: Optional[str] = None, *args: Any)[source]#

Bases: owid.datautils.common.ExceptionFromDocstring

Dataframes cannot be compared because they have different number of rows.

exception datautils.dataframes.ObjectsAreNotDataframes(exception_message: Optional[str] = None, *args: Any)[source]#

Bases: owid.datautils.common.ExceptionFromDocstring

Given objects are not dataframes.

datautils.dataframes.apply_on_categoricals(cat_series: List[pandas.core.series.Series], func: Callable[[...], str]) → pandas.core.series.Series[source]#

Apply a function on a list of categorical series.

This is much faster than converting them to strings first and then applying the function and it prevents memory explosion. It uses category codes instead of using values directly and it builds the output categorical mapping from codes to strings on the fly.

Parameters

cat_series – List of series with category type.
func – Function taking as many arguments as there are categorical series and returning str.

datautils.dataframes.are_equal(df1: pandas.core.frame.DataFrame, df2: pandas.core.frame.DataFrame, absolute_tolerance: float = 1e-08, relative_tolerance: float = 1e-08, verbose: bool = True) → Tuple[bool, pandas.core.frame.DataFrame][source]#

Check whether two dataframes are equal.

It assumes that all nans are identical, and compares floats by means of certain absolute and relative tolerances.

Parameters

df1 (pd.DataFrame) – First dataframe.
df2 (pd.DataFrame) – Second dataframe.
absolute_tolerance (float) – Absolute tolerance to assume in the comparison of each cell in the dataframes. A value a of an element in df1 is considered equal to the corresponding element b at the same position in df2, if: abs(a - b) <= absolute_tolerance
relative_tolerance (float) – Relative tolerance to assume in the comparison of each cell in the dataframes. A value a of an element in df1 is considered equal to the corresponding element b at the same position in df2, if: abs(a - b) / abs(b) <= relative_tolerance
verbose (bool) – True to print a summary of the comparison of the two dataframes.

Returns

are_equal (bool) – True if the two dataframes are equal (given the conditions explained above).
compared (pd.DataFrame) – Dataframe with the same shape as df1 and df2 (if they have the same shape) that is True on each element where both dataframes have equal values. If dataframes have different shapes, compared will be empty.

datautils.dataframes.compare(df1: pandas.core.frame.DataFrame, df2: pandas.core.frame.DataFrame, columns: Optional[List[str]] = None, absolute_tolerance: float = 1e-08, relative_tolerance: float = 1e-08) → pandas.core.frame.DataFrame[source]#

Compare two dataframes element by element to see if they are equal.

It assumes that nans are all identical, and allows for certain absolute and relative tolerances for the comparison of floats.

NOTE: Dataframes must have the same number of rows to be able to compare them.

Parameters

df1 (pd.DataFrame) – First dataframe.
df2 (pd.DataFrame) – Second dataframe.
columns (list or None) – List of columns to compare (they both must exist in both dataframes). If None, common columns will be compared.
absolute_tolerance (float) – Absolute tolerance to assume in the comparison of each cell in the dataframes. A value a of an element in df1 is considered equal to the corresponding element b at the same position in df2, if: abs(a - b) <= absolute_tolerance
relative_tolerance (float) – Relative tolerance to assume in the comparison of each cell in the dataframes. A value a of an element in df1 is considered equal to the corresponding element b at the same position in df2, if: abs(a - b) / abs(b) <= relative_tolerance

Returns

compared – Dataframe of booleans, with as many rows as df1 and df2, and as many columns as specified by columns argument (or as many common columns between df1 and df2, if columns is None). The (i, j) element is True if df1 and f2 have the same value (for the given tolerances) at that same position.

Return type

pd.DataFrame

datautils.dataframes.concatenate(dfs: List[pandas.core.frame.DataFrame], **kwargs: Any) → pandas.core.frame.DataFrame[source]#

Concatenate while preserving categorical columns.

Original source code from https://stackoverflow.com/a/57809778/1275818.

datautils.dataframes.count_missing_in_groups(df: pandas.core.frame.DataFrame, groupby_columns: List[str], **kwargs: Any) → pandas.core.frame.DataFrame[source]#

Count the number of missing values in each group.

Faster version of:

>>> num_nans_detected = df.groupby(groupby_columns, **groupby_kwargs).agg(
    lambda x: pd.isnull(x).sum()
)

datautils.dataframes.groupby_agg(df: pandas.core.frame.DataFrame, groupby_columns: Union[List[str], str], aggregations: Optional[Dict[str, Any]] = None, num_allowed_nans: Optional[int] = 0, frac_allowed_nans: Optional[float] = None) → pandas.core.frame.DataFrame[source]#

Group dataframe by certain columns, and aggregate using a certain method, and decide how to handle nans.

This function is similar to the usual > df.groupby(groupby_columns).agg(aggregations) However, pandas by default ignores nans in aggregations. This implies, for example, that > df.groupby(groupby_columns).sum() will treat nans as zeros, which can be misleading.

When both num_allowed_nans and frac_allowed_nans are None, this function behaves like the default pandas behaviour (and nans will be treated as zeros).

On the other hand, if num_allowed_nans is not None, then a group will be nan if the number of nans in that group is larger than num_allowed_nans, otherwise nans will be treated as zeros.

Similarly, if frac_allowed_nans is not None, then a group will be nan if the fraction of nans in that group is larger than frac_allowed_nans, otherwise nans will be treated as zeros.

If both num_allowed_nans and frac_allowed_nans are not None, both conditions are applied. This means that, each group must have a number of nans <= num_allowed_nans, and a fraction of nans <= frac_allowed_nans, otherwise that group will be nan.

Note: This function won’t work when using multiple aggregations for the same column (e.g. {‘a’: (‘sum’, ‘mean’)}).

Parameters

df (pd.DataFrame) – Original dataframe.
groupby_columns (list or str) – List of columns to group by. It can be given as a string, if it is only one column.
aggregations (dict or None) – Aggregations to apply to each column in df. If None, ‘sum’ will be applied to all columns.
num_allowed_nans (int or None) – Maximum number of nans that are allowed in a group.
frac_allowed_nans (float or None) – Maximum fraction of nans that are allowed in a group.

Returns

grouped – Grouped dataframe after applying aggregations.

Return type

pd.DataFrame

datautils.dataframes.map_series(series: pandas.core.series.Series, mapping: Dict[Any, Any], make_unmapped_values_nan: bool = False, warn_on_missing_mappings: bool = False, warn_on_unused_mappings: bool = False, show_full_warning: bool = False) → pandas.core.series.Series[source]#

Map values of a series given a certain mapping.

This function does almost the same as > series.map(mapping) However, map() translates values into nan if those values are not in the mapping, whereas this function allows to optionally keep the original values.

This function should do the same as > series.replace(mapping) However .replace() becomes very slow on big dataframes.

Parameters

series (pd.Series) – Original series to be mapped.
mapping (dict) – Mapping.
make_unmapped_values_nan (bool) – If true, values in the series that are not in the mapping will be translated into nan; otherwise, they will keep their original values.
warn_on_missing_mappings (bool) – True to warn if elements in series are missing in mapping.
warn_on_unused_mappings (bool) – True to warn if the mapping contains values that are not present in the series. False to ignore.
show_full_warning (bool) – True to print the entire list of unused mappings (only relevant if warn_on_unused_mappings is True).

Returns

series_mapped – Mapped series.

Return type

pd.Series

datautils.dataframes.multi_merge(dfs: List[pandas.core.frame.DataFrame], on: Union[List[str], str], how: str = 'inner') → pandas.core.frame.DataFrame[source]#

Merge multiple dataframes.

This is a helper function when merging more than two dataframes on common columns.

Parameters

dfs (list) – Dataframes to be merged.
on (list or str) – Column or list of columns on which to merge. These columns must have the same name on all dataframes.
how (str) – Method to use for merging (with the same options available in pd.merge).

Returns

merged – Input dataframes merged.

Return type

pd.DataFrame