Control variables for gravity equations

Gravity controls

The gravity model is used in international economics to explain bilateral flows (such as trade) between two units (generally countries), based on the economic size of each unit (in terms of GDP or number of inhabitants) and the distance between these two units (in terms of geographic distance, for instance number of kilometers). Over time, several researchers have enhanced this model by adding a number of control variables, available in this dataset.

The datasets you can download from this page contain fewer variables than the original dataset from the CEPII. The original dataset contains more than four million observations across 86 variables, which is too much for Excel to handle. You can download the original dataset here. You just have to indicate an email address, but no email will be sent to you and you will not be registered to anything. Some variables have been renamed to increase readability and consistency with the other datasets available on this blog.

I have broken down the 86 variables from the CEPII database into 10 topics (geographic distance, size, time zones, languages, colonial, legal, religions, membership, ease of doing business, and trade) and made one dataframe (i.e. one .csv document) per such topic. This makes a total of 10 different datasets to be downloaded here:

gravity_geodist: 47 101 observations, based on 219 home countries and 219 host countries in the year 2019 (chosen since it contains the smallest number of NA); it contains 8 variables:

  • country_pair: A character string uniting the home country (origin) and the host country (destination).

  • origin: A character string indicating the home country (ISO3 code).

  • destination: A character string indicating the host country (ISO3 code).

  • common_border: Dummy equal to 1 if countries are contiguous [variable name in the CEPII database: contig].

  • dist: Distance between most populated cities of each country (km).

  • distw: Population-weighted distance between mostpopulated cities (km).

  • distcap: Distance between capitals (km).

  • distwces: Population-weighted distance between most populated cities (km) using CES formulation with θ=−1.

R

CSV

Stata

SOURCE OF THE DATA

gravity_size: 4 428 288 observations, based on 248 home countries and 248 host countries over 72 years (1948-2019); it contains 21 variables:

  • ID_o: A character string indicating the home country (ISO3 code) and the year.

  • ID_d: A character string indicating the host country (ISO3 code) and the year.

  • ID: A character string indicating the country pair (home country and host country, both in ISO3 code) and the year.

  • year: Integer indicating the year.

  • country_pair: A character string uniting the home country (origin) and the host country (destination).

  • origin: A character string indicating the home country (ISO3 code).

  • destination: A character string indicating the host country (ISO3 code).

  • pop_o: Population of the home (origin) country (in thousands).

  • pop_d: Population of the host (destination) country (in thousands).

  • GDP_o: GDP of the home (origin) country (current thousands US$) [variable name in the CEPII database; gdp_o].

  • GDP_d: GDP of the host (destination) country (current thousands US$) [variable name in the CEPII database; gdp_d].

  • GDP_per_capita_o: GDP of the home (origin) country per capita (current thousands US$) [variable name in the CEPII database; gdpcap_o].

  • GDP_per_capita_d: GDP of the host (destination) country per capita (current thousands US$) [variable name in the CEPII database; gdpcap_d].

  • gdp_ppp_o: GDP expressed in purchasing power parity of the home (origin) country (current thousands international $).

  • gdp_ppp_d: GDP expressed in purchasing power parity of the host (destination) country (current thousands international $).

  • GDP_per_capita_ppp_o: GDP per capita expressed in purchasing power parity of the home (origin) country (current thousands international $).

  • GDP_per_capita_ppp_d: GDP per capita expressed in purchasing power parity of the host (destination) country (current thousands international $).

  • pop_pwt_o: Population of the home (origin) country (in thousands) (source: Penn WorldTables).

  • pop_pwt_d: Population of the host (destination) country (in thousands) (source: Penn WorldTables).

  • gdp_ppp_pwt_o: Deflated GDP at current purchasing power parity of the home (origin) country (2011 thousands US$) (source: PWT).

  • gdp_ppp_pwt_d: Deflated GDP at current purchasing power parity of the host (destination) country (2011 thousands US$) (source: PWT).

R

CSV

Stata

SOURCE OF THE DATA

gravity_timezones: 57 600 observations, based on 240 home countries and 240 host countries in the year 2019 (most recent year available and smallest number of NA); it contains 5 variables:

  • country_pair: A character string uniting the home country (origin) and the host country (destination).

  • origin: A character string indicating the home country (ISO3 code).

  • destination: A character string indicating the host country (ISO3 code).

  • time_zone_o: GMT offset in 2020 of the home (origin) country (hours) [variable name in the CEPII database: gmt_offset_2020_o].

  • time_zone_d: GMT offset in 2020 of the host (destination) country (hours) [variable name in the CEPII database: gmt_offset_2020_d].

R

CSV

Stata

SOURCE OF THE DATA

gravity_lang: 3 143 839 observations, based on 224 home countries and 224 host countries over 72 years (1948-2019); it contains 9 variables:

  • ID_o: A character string indicating the home country (ISO3 code) and the year.

  • ID_d: A character string indicating the host country (ISO3 code) and the year.

  • ID: A character string indicating the country pair (home country and host country, both in ISO3 code) and the year.

  • country_pair: A character string uniting the home country (origin) and the host country (destination) using “_”.

  • origin: A character string indicating the home country (ISO3 code).

  • destination: A character string indicating the host country (ISO3 code).

  • common_language_off: 1 if countries share common official or primary language.

  • common_language_ethno: 1 if countries share a common language spoken by at least 9% of the population.

R

CSV

Stata

SOURCE OF THE DATA

gravity_colonial: 61 504 observations, based on 248 home countries and 248 host countries in the year 2010 (chosen since it contains the smallest number of NA); it contains 16 variables:

  • country_pair: A character string uniting the home country (origin) and the host country (destination).

  • origin: A character string indicating the home country (ISO3 code).

  • destination: A character string indicating the host country (ISO3 code).

  • comcol: 1 if origin and destination countries share a common colonizer post 1945.

  • colonial_ties: 1 if countries are or were in colonial relationship post 1945 [variable name in the CEPII database: col45].

  • heg_o: 1 if origin is current or former hegemon of destination.

  • heg_d: 1 if destination is current or former hegemon of origin.

  • col_dep_ever: 1 if pair ever was in colonial or dependency relationship (including before 1948).

  • col_dep: 1 if pair currently in colonial or dependency relationship.

  • col_dep_end_year: Independence year from concerned hegemon (includes colonial ties before 1948).

  • col_dep_end_conflict: 1 if independence from the concerned hegemon involved a conflict.

  • common_colonizer: character string indicating the country which was a common colonizer to both the home (origin) country and the host (destination) country [variable name in the CEPII database: empire]. Common colonizer countries are the Netherlands (NLD), Great Britain (GBR), the United States (USA), France (FRA), Australia (AUS), New Zealand (NZL), and Denmark (DNK).

  • sibling_ever: 1 if pair ever had the same colonizer (including before 1948).

  • sibling: 1 if pair currently has the same colonizer.

  • sever_year: Severance year for pairs that ever had the same colonizer (includes colonial ties before 1948); corresponds to the independence year of the first independent sibling.

  • sib_conflict: 1 if pair ever had the same colonizer and independence involved a conflict with the hegemon (includes colonial ties before 1948).

R

CSV

Stata

SOURCE OF THE DATA

gravity_legal: 61 504 observations, based on 248 home countries and 248 host countries in the year 2010 (chosen since it contains the smallest number of NA); it contains 10 variables:

  • country_pair: A character string uniting the home country (origin) and the host country (destination).

  • origin: A character string indicating the home country (ISO3 code).

  • destination: A character string indicating the host country (ISO3 code).

  • law_before_1991_o: Historical origin of the home (origin) country’s laws before 1991 [variable name in the CEPII database: legal_old_o].

  • law_before_1991_d: Historical origin of the host (destination) country’s laws before 1991 [variable name in the CEPII database: legal_old_d].

  • law_after_1991_o: Historical origin of the home (origin) country’s laws after 1991 [variable name in the CEPII database: legal_new_o].

  • law_after_1991_d: Historical origin of the host (destination) country’s laws after 1991 [variable name in the CEPII database: legal_new_d].

  • comleg_pretrans: 1 if countries share common legal origins before 1991.

  • comleg_posttrans: 1 if countries share common legal origins after 1991.

  • transition_legalchange: 1 if common legal origin changed in 1991.

R

CSV

Stata

SOURCE OF THE DATA

gravity_religions: 61 504 observations, based on 248 home countries and 248 host countries in the year 2010 (chosen since it contains the smallest number of NA); it contains 11 variables:

  • country_pair: A character string uniting the home country (origin) and the host country (destination).

  • origin: A character string indicating the home country (ISO3 code).

  • destination: A character string indicating the host country (ISO3 code).

  • pct_catholics_o: Share of catholics in the population of the home (origin) country in 1980 (%) [variable name in the CEPII database: cat_o].

  • pct_catholics_d: Share of catholics in the population of the host (destination) country in 1980 (%) [variable name in the CEPII database: cat_d].

  • pct_muslims_o: Share of muslims in the population of the home (origin) country in 1980 (%) [variable name in the CEPII database: mus_o].

  • pct_muslims_d: Share of muslims in the population of the host (destination) country in 1980 (%) [variable name in the CEPII database: mus_d].

  • pct_protestants_o: Share of protestants in the population of the home (origin) country in 1980 (%) [variable name in the CEPII database: pro_o].

  • pct_protestants_d: Share of protestants in the population of the host (destination) country in 1980 (%) [variable name in the CEPII database: pro_d].

  • pct_other_religion_o: Share of other religions in the population of the home (origin) country in 1980 (%) [variable name in the CEPII database: oth_o].

  • pct_other_religion_d: Share of other religions in the population of the host (destination) country in 1980 (%) [variable name in the CEPII database: oth_d].

R

CSV

Stata

SOURCE OF THE DATA

gravity_membership: 4 428 288 observations, based on 248 home countries and 248 host countries over 72 years (1948-2019); it contains 16 variables:

  • ID_o: A character string indicating the home country (ISO3 code) and the year.

  • ID_d: A character string indicating the host country (ISO3 code) and the year.

  • ID: A character string indicating the country pair (home country and host country, both in ISO3 code) and the year.

  • year: Integer indicating the year.

  • country_pair: A character string uniting the home country (origin) and the host country (destination).

  • origin: A character string indicating the home country (ISO3 code).

  • destination: A character string indicating the host country (ISO3 code).

  • GATT_member_o: 1 if the home (origin) country currently is a GATT member [variable name in the CEPII database: gatt_o].

  • GATT_member_d: 1 if the host (destination) country currently is a GATT member [variable name in the CEPII database: gatt_d].

  • WTO_member_o: 1 if the home (origin) country currently is a WTO member [variable name in the CEPII database: wto_o].

  • WTO_member_d: 1 if the host (destination) country currently is a WTO member [variable name in the CEPII database: wto_d].

  • EU_member_o: 1 if the home (origin) country currently is a EU member [variable name in the CEPII database: eu_o].

  • EU_member_d: 1 if the host (destination) country currently is a EU member [variable name in the CEPII database: eu_d].

  • regional_trading_agreement: 1 if the pair currently has a regional trading agreement (source: WTO) [variable name in the CEPII database: rta].

  • rta_coverage: Indicates whether the RTA covers goods only or goods and services (source: WTO).

  • rta_type: Indicates the type of RTA (customs union for instance).

R

CSV

Stata

SOURCE OF THE DATA

gravity_ease_business: 4 428 288 observations, based on 248 home countries and 248 host countries over 72 years (1948-2019); it contains 15 variables:

  • ID_o: A character string indicating the home country (ISO3 code) and the year.

  • ID_d: A character string indicating the host country (ISO3 code) and the year.

  • ID: A character string indicating the country pair (home country and host country, both in ISO3 code) and the year.

  • year: Integer indicating the year.

  • country_pair: A character string uniting the home country (origin) and the host country (destination).

  • origin: A character string indicating the home country (ISO3 code).

  • destination: A character string indicating the host country (ISO3 code).

  • entry_cost_o: Cost of business start-up procedures (% of GNI per capita) in the home (origin) country.

  • entry_cost_d: Cost of business start-up procedures (% of GNI per capita) in the host (destination) country.

  • entry_proc_o: Number of start-up procedures to register a business in the home (origin) country.

  • entry_proc_d: Number of start-up procedures to register a business in the host (destination) country.

  • entry_time_o: Days required to start a business in the home (origin) country.

  • entry_time_d: Days required to start a business in the host (destination) country.

  • entry_tp_o: Days required to start a business + number of procedures to start a business in the home (origin) country.

  • entry_tp_d: Days required to start a business + number of procedures to start a business in the host (destination) country.

R

CSV

Stata

SOURCE OF THE DATA

gravity_trade: 4 428 288 observations, based on 248 home countries and 248 host countries over 72 years (1948-2019); it contains 13 variables:

  • ID_o: A character string indicating the home country (ISO3 code) and the year.

  • ID_d: A character string indicating the host country (ISO3 code) and the year.

  • ID: A character string indicating the country pair (home country and host country, both in ISO3 code) and the year.

  • year: Integer indicating the year.

  • country_pair: A character string uniting the home country (origin) and the host country (destination).

  • origin: A character string indicating the home country (ISO3 code).

  • destination: A character string indicating the host country (ISO3 code).

  • trade_comtrade_o: Trade flow as reported by the exporter (in thousands current US$) (source: Comtrade).

  • trade_comtrade_d: Trade flow as reported by the importer (in thousands current US$) (source: Comtrade).

  • tradeflow_baci: Trade flow (in thousands current US$) (source:BACI).

  • manuf_tradeflow_baci: Trade flow of manufactured goods (in thousands current US$) (source: BACI).

  • trade_imf_o: Trade flow as reported by the exporter (in thousands current US$) (source: IMF).

  • trade_imf_d: Trade flow as reported by the importer (in thousands current US$) (source: IMF).

R

CSV

Stata

SOURCE OF THE DATA

DOCUMENTATION OF THE DATA

DATA ON WHETHER EACH COUNTRY IS LANDLOCKED, AN ISLAND, ETC.