owid.datautils.geo#

Utils related to geographical entities.

exception datautils.geo.RegionNotFound(exception_message: Optional[str] = None, *args: Any)[source]#

Bases: owid.datautils.common.ExceptionFromDocstring

Region was not found in countries-regions dataset.

datautils.geo._load_countries_regions() pandas.core.frame.DataFrame[source]#
datautils.geo._load_income_groups() pandas.core.frame.DataFrame[source]#
datautils.geo._load_population() pandas.core.frame.DataFrame[source]#
datautils.geo.add_population_to_dataframe(df: pandas.core.frame.DataFrame, country_col: str = 'country', year_col: str = 'year', population_col: str = 'population', warn_on_missing_countries: bool = True, show_full_warning: bool = True) pandas.core.frame.DataFrame[source]#

Add column of population to a dataframe.

Parameters
  • df (pd.DataFrame) – Original dataframe that contains a column of country names and years.

  • country_col (str) – Name of column in original dataframe with country names.

  • year_col (str) – Name of column in original dataframe with years.

  • population_col (str) – Name of new column to be created with population values.

  • warn_on_missing_countries (bool) – True to warn about countries that appear in original dataframe but not in population dataset.

  • show_full_warning (bool) – True to display list of countries in warning messages.

Returns

df_with_population – Original dataframe after adding a column with population values.

Return type

pd.DataFrame

datautils.geo.add_region_aggregates(df: pandas.core.frame.DataFrame, region: str, countries_in_region: Optional[List[str]] = None, countries_that_must_have_data: Optional[List[str]] = None, num_allowed_nans_per_year: Optional[int] = None, frac_allowed_nans_per_year: Optional[float] = 0.2, country_col: str = 'country', year_col: str = 'year', aggregations: Optional[Dict[str, Any]] = None, keep_original_region_with_suffix: Optional[str] = None) pandas.core.frame.DataFrame[source]#

Add data for regions (e.g. income groups or continents) to a dataset.

If data for a region already exists in the dataset, it will be replaced.

When adding up the contribution from different countries (e.g. Spain, France, etc.) of a region (e.g. Europe), we want to avoid two problems: * Generating a series of nan, because one small country (with a negligible contribution) has nans. * Generating a series that underestimates the real one, because of treating missing values as zeros.

To avoid these problems, we first define a list of “big countries” that must be present in the data, in order to safely do the aggregation. If any of these countries is not present for a particular variable and year, the aggregation will be nan for that variable and year. Otherwise, if all big countries are present, any other missing country will be assumed to have zero contribution to the variable. For example, when aggregating the electricity demand of North America, United States and Mexico cannot be missing, because otherwise the aggregation would significantly underestimate the true electricity demand of North America.

Additionally, the aggregation of a particular variable for a particular year cannot have too many nans. If the number of nans exceeds num_allowed_nans_per_year, or if the fraction of nans exceeds frac_allowed_nans_per_year, the aggregation for that variable and year will be nan.

Parameters
  • df (pd.Dataframe) – Original dataset, which may contain data for that region (in which case, it will be replaced by the ).

  • region (str) – Region to add.

  • countries_in_region (list or None) – List of countries that are members of this region. None to load them from countries-regions dataset.

  • countries_that_must_have_data (list or None) – List of countries that must have data for a particular variable and year, otherwise the region will have nan for that particular variable and year. See function list_countries_in_region_that_must_have_data for more details.

  • num_allowed_nans_per_year (int or None) – Maximum number of nans that can be present in a particular variable and year. If exceeded, the aggregation will be nan.

  • frac_allowed_nans_per_year (float or None) – Maximum fraction of nans that can be present in a particular variable and year. If exceeded, the aggregation will be nan.

  • country_col (str) – Name of country column.

  • year_col (str) – Name of year column.

  • aggregations (dict or None) – Aggregations to execute for each variable. If None, the contribution to each variable from each country in the region will be summed. Otherwise, only the variables indicated in the dictionary will be affected. All remaining variables will be nan.

  • keep_original_region_with_suffix (str or None) – If None, original data for region will be replaced by aggregate data constructed by this function. If not None, original data for region will be kept, with the same name, but having suffix keep_original_region_with_suffix added to its name.

Returns

df_updated – Original dataset after adding (or replacing) data for selected region.

Return type

pd.DataFrame

datautils.geo.harmonize_countries(df: pandas.core.frame.DataFrame, countries_file: str, country_col: str = 'country', warn_on_missing_countries: bool = True, make_missing_countries_nan: bool = False, warn_on_unused_countries: bool = True, show_full_warning: bool = True) pandas.core.frame.DataFrame[source]#

Harmonize country names in dataframe, following the mapping given in a file.

Parameters
  • df (pd.DataFrame) – Original dataframe that contains a column of non-harmonized country names.

  • countries_file (str) – Path to json file containing a mapping from non-harmonized to harmonized country names.

  • country_col (str) – Name of column in df containing non-harmonized country names.

  • warn_on_missing_countries (bool) – True to warn about countries that appear in original table but not in countries file.

  • make_missing_countries_nan (bool) – True to make nan any country that appears in original dataframe but not in countries file. False to keep their original (possibly non-harmonized) names.

  • warn_on_unused_countries (bool) – True to warn about countries that appear in countries file but are useless (since they do not appear in original dataframe).

  • show_full_warning (bool) – True to display list of countries in warning messages.

Returns

df_harmonized – Original dataframe after standardizing the column of country names.

Return type

pd.DataFrame

datautils.geo.list_countries_in_region(region: str, countries_regions: Optional[pandas.core.frame.DataFrame] = None, income_groups: Optional[pandas.core.frame.DataFrame] = None) List[str][source]#

List countries that are members of a region.

Parameters
  • region (str) – Name of the region (e.g. Europe).

  • countries_regions (pd.DataFrame or None) – Countries-regions dataset, or None to load it from the catalog.

  • income_groups (pd.DataFrame or None) – Income-groups dataset, or None, to load it from the catalog.

Returns

members – Names of countries that are members of the region.

Return type

list

datautils.geo.list_countries_in_region_that_must_have_data(region: str, reference_year: int = 2018, min_frac_individual_population: float = 0.0, min_frac_cumulative_population: float = 0.7, countries_regions: Optional[pandas.core.frame.DataFrame] = None, income_groups: Optional[pandas.core.frame.DataFrame] = None, population: Optional[pandas.core.frame.DataFrame] = None, verbose: bool = False) List[str][source]#

List countries of a region that are expected to have the largest contribution to any variable.

The contribution of each country is based on their population relative to the region’s total.

Method to select countries: 1. Select countries whose population is, on a certain reference year (reference_year), larger than a fraction of min_frac_individual_population with respect to the total population of the region. 2. Among those, sort countries by descending population, and cut as soon as the cumulative population exceeds min_frac_cumulative_population. Note: It may not be possible to fulfil both conditions. In that case, a warning is raised.

Parameters
  • region (str) – Name of the region.

  • reference_year (int) – Reference year to consider when selecting countries.

  • min_frac_individual_population (float) – Minimum fraction of the total population of the region that each of the listed countries must exceed.

  • min_frac_cumulative_population (float) – Minimum fraction of the total population of the region that the sum of the listed countries must exceed.

  • countries_regions (pd.DataFrame or None) – Countries-regions dataset, or None, to load it from owid catalog.

  • income_groups (pd.DataFrame or None) – Income-groups dataset, or None, to load it from the catalog.

  • population (pd.DataFrame or None) – Population dataset, or None, to load it from owid catalog.

  • verbose (bool) – True to print the number of countries (and percentage of cumulative population) that must have data.

Returns

countries – Countries that are expected to have the largest contribution.

Return type

list