Merging Data

There are two ways to combine datasets in geopandas – attribute joins and spatial joins.

In an attribute join, a GeoSeries or GeoDataFrame is combined with a regular pandas Series or DataFrame based on a common variable. This is analogous to normal merging or joining in pandas.

In a Spatial Join, observations from to GeoSeries or GeoDataFrames are combined based on their spatial relationship to one another.

In the following examples, we use these datasets:

In [1]: world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

In [2]: cities = geopandas.read_file(geopandas.datasets.get_path('naturalearth_cities'))

# For attribute join
In [3]: country_shapes = world[['geometry', 'iso_a3']]

In [4]: country_names = world[['name', 'iso_a3']]

# For spatial join
In [5]: countries = world[['geometry', 'name']]

In [6]: countries = countries.rename(columns={'name':'country'})

Attribute Joins

Attribute joins are accomplished using the merge method. In general, it is recommended to use the merge method called from the spatial dataset. With that said, the stand-alone merge function will work if the GeoDataFrame is in the left argument; if a DataFrame is in the left argument and a GeoDataFrame is in the right position, the result will no longer be a GeoDataFrame.

For example, consider the following merge that adds full names to a GeoDataFrame that initially has only ISO codes for each country by merging it with a pandas DataFrame.

# `country_shapes` is GeoDataFrame with country shapes and iso codes
In [7]: country_shapes.head()
Out[7]: 
                                            geometry iso_a3
0  MULTIPOLYGON (((180.000000000 -16.067132664, 1...    FJI
1  POLYGON ((33.903711197 -0.950000000, 34.072620...    TZA
2  POLYGON ((-8.665589565 27.656425890, -8.665124...    ESH
3  MULTIPOLYGON (((-122.840000000 49.000000000, -...    CAN
4  MULTIPOLYGON (((-122.840000000 49.000000000, -...    USA

# `country_names` is DataFrame with country names and iso codes
In [8]: country_names.head()
Out[8]: 
                       name iso_a3
0                      Fiji    FJI
1                  Tanzania    TZA
2                 W. Sahara    ESH
3                    Canada    CAN
4  United States of America    USA

# Merge with `merge` method on shared variable (iso codes):
In [9]: country_shapes = country_shapes.merge(country_names, on='iso_a3')

In [10]: country_shapes.head()
Out[10]: 
                                            geometry iso_a3                      name
0  MULTIPOLYGON (((180.000000000 -16.067132664, 1...    FJI                      Fiji
1  POLYGON ((33.903711197 -0.950000000, 34.072620...    TZA                  Tanzania
2  POLYGON ((-8.665589565 27.656425890, -8.665124...    ESH                 W. Sahara
3  MULTIPOLYGON (((-122.840000000 49.000000000, -...    CAN                    Canada
4  MULTIPOLYGON (((-122.840000000 49.000000000, -...    USA  United States of America

Spatial Joins

In a Spatial Join, two geometry objects are merged based on their spatial relationship to one another.

# One GeoDataFrame of countries, one of Cities.
# Want to merge so we can get each city's country.
In [11]: countries.head()
Out[11]: 
                                            geometry                   country
0  MULTIPOLYGON (((180.000000000 -16.067132664, 1...                      Fiji
1  POLYGON ((33.903711197 -0.950000000, 34.072620...                  Tanzania
2  POLYGON ((-8.665589565 27.656425890, -8.665124...                 W. Sahara
3  MULTIPOLYGON (((-122.840000000 49.000000000, -...                    Canada
4  MULTIPOLYGON (((-122.840000000 49.000000000, -...  United States of America

In [12]: cities.head()
Out[12]: 
           name                           geometry
0  Vatican City  POINT (12.453386545 41.903282180)
1    San Marino  POINT (12.441770158 43.936095835)
2         Vaduz   POINT (9.516669473 47.133723774)
3    Luxembourg   POINT (6.130002806 49.611660379)
4       Palikir  POINT (158.149974324 6.916643696)

# Execute spatial join
In [13]: cities_with_country = geopandas.sjoin(cities, countries, how="inner", op='intersects')

In [14]: cities_with_country.head()
Out[14]: 
             name                           geometry  index_right  country
0    Vatican City  POINT (12.453386545 41.903282180)          141    Italy
1      San Marino  POINT (12.441770158 43.936095835)          141    Italy
192          Rome  POINT (12.481312563 41.897901485)          141    Italy
2           Vaduz   POINT (9.516669473 47.133723774)          114  Austria
184        Vienna  POINT (16.364693097 48.201961137)          114  Austria

Sjoin Arguments

sjoin.() has two core arguments: how and op.

op

The `op argument specifies how geopandas decides whether or not to join the attributes of one object to another. There are three different join options as follows:

  • intersects: The attributes will be joined if the boundary and interior of the object intersect in any way with the boundary and/or interior of the other object.

  • within: The attributes will be joined if the object’s boundary and interior intersect only with the interior of the other object (not its boundary or exterior).

  • contains: The attributes will be joined if the object’s interior contains the boundary and interior of the other object and their boundaries do not touch at all.

You can read more about each join type in the Shapely documentation.

how

The how argument specifies the type of join that will occur and which geometry is retained in the resultant geodataframe. It accepts the following options:

  • left: use the index from the first (or left_df) geodataframe that you provide to sjoin; retain only the left_df geometry column

  • right: use index from second (or right_df); retain only the right_df geometry column

  • inner: use intersection of index values from both geodataframes; retain only the left_df geometry column

Note more complicated spatial relationships can be studied by combining geometric operations with spatial join. To find all polygons within a given distance of a point, for example, one can first use the buffer method to expand each point into a circle of appropriate radius, then intersect those buffered circles with the polygons in question.

Sjoin Performance

Existing spatial indexes on either left_df or right_df will be reused when performing an sjoin. If neither df has a spatial index, a spatial index will be generated for the longer df. If both have a spatial index, the right_df’s index will be used preferentially. Performance of multiple sjoins in a row involving a common GeoDataFrame may be improved by pre-generating the spatial index of the common GeoDataFrame prior to performing sjoins using df1.sindex.

df1 = # a GeoDataFrame with data
df2 = # a second GeoDataFrame
df3 = # a third GeoDataFrame

# pre-generate sindex on df1 if it doesn't already exist
df1.sindex

sjoin(df1, df2, ...)
# sindex for df1 is reused
sjoin(df1, df3, ...)
# sindex for df1 is reused again