Merging data#

There are two ways to combine datasets in GeoPandas – attribute joins and spatial joins.

In an attribute join, a GeoSeries or GeoDataFrame is combined with a regular pandas.Series or pandas.DataFrame based on a common variable. This is analogous to normal merging or joining in pandas.

In a spatial join, observations from two GeoSeries or GeoDataFrame are combined based on their spatial relationship to one another.

In the following examples, these datasets are used:

In [1]: import geodatasets

In [2]: chicago = geopandas.read_file(geodatasets.get_path("geoda.chicago_commpop"))

In [3]: groceries = geopandas.read_file(geodatasets.get_path("geoda.groceries"))

# For attribute join
In [4]: chicago_shapes = chicago[['geometry', 'NID']]

In [5]: chicago_names = chicago[['community', 'NID']]

# For spatial join
In [6]: chicago = chicago[['geometry', 'community']].to_crs(groceries.crs)

Appending#

Appending GeoDataFrame and GeoSeries uses pandas append() methods. Keep in mind, that appended geometry columns needs to have the same CRS.

# Appending GeoSeries
In [7]: joined = pd.concat([chicago.geometry, groceries.geometry])

# Appending GeoDataFrames
In [8]: douglas = chicago[chicago.community == 'DOUGLAS']

In [9]: oakland = chicago[chicago.community == 'OAKLAND']

In [10]: douglas_oakland = pd.concat([douglas, oakland])

Attribute joins#

Attribute joins are accomplished using the merge() method. In general, it is recommended to use the merge() method called from the spatial dataset. With that said, the stand-alone pandas.merge() function will work if the GeoDataFrame is in the left argument; if a DataFrame is in the left argument and a GeoDataFrame is in the right position, the result will no longer be a GeoDataFrame.

For example, consider the following merge that adds full names to a GeoDataFrame that initially has only area ID for each geometry by merging it with a DataFrame.

# `chicago_shapes` is GeoDataFrame with community shapes and area IDs
In [11]: chicago_shapes.head()
Out[11]: 
                                            geometry  NID
0  MULTIPOLYGON (((-87.60914 41.84469, -87.60915 ...   35
1  MULTIPOLYGON (((-87.59215 41.81693, -87.59231 ...   36
2  MULTIPOLYGON (((-87.62880 41.80189, -87.62879 ...   37
3  MULTIPOLYGON (((-87.60671 41.81681, -87.60670 ...   38
4  MULTIPOLYGON (((-87.59215 41.81693, -87.59215 ...   39

# `chicago_names` is DataFrame with community names and area ID
In [12]: chicago_names.head()
Out[12]: 
         community  NID
0          DOUGLAS   35
1          OAKLAND   36
2      FULLER PARK   37
3  GRAND BOULEVARD   38
4          KENWOOD   39

# Merge with `merge` method on shared variable (area ID):
In [13]: chicago_shapes = chicago_shapes.merge(chicago_names, on='NID')

In [14]: chicago_shapes.head()
Out[14]: 
                                            geometry  NID        community
0  MULTIPOLYGON (((-87.60914 41.84469, -87.60915 ...   35          DOUGLAS
1  MULTIPOLYGON (((-87.59215 41.81693, -87.59231 ...   36          OAKLAND
2  MULTIPOLYGON (((-87.62880 41.80189, -87.62879 ...   37      FULLER PARK
3  MULTIPOLYGON (((-87.60671 41.81681, -87.60670 ...   38  GRAND BOULEVARD
4  MULTIPOLYGON (((-87.59215 41.81693, -87.59215 ...   39          KENWOOD

Spatial joins#

In a spatial join, two geometry objects are merged based on their spatial relationship to one another.

# One GeoDataFrame of communities, one of grocery stores.
# Want to merge to get each grocery's community.
In [15]: chicago.head()
Out[15]: 
                                            geometry        community
0  MULTIPOLYGON (((1181573.250 1886828.039, 11815...          DOUGLAS
1  MULTIPOLYGON (((1186289.356 1876750.733, 11862...          OAKLAND
2  MULTIPOLYGON (((1176344.998 1871187.546, 11763...      FULLER PARK
3  MULTIPOLYGON (((1182322.043 1876674.730, 11823...  GRAND BOULEVARD
4  MULTIPOLYGON (((1186289.356 1876750.733, 11862...          KENWOOD

In [16]: groceries.head()
Out[16]: 
   OBJECTID     Ycoord  ...  Category                              geometry
0        16  41.973266  ...       NaN  MULTIPOINT (1168268.672 1933554.350)
1        18  41.696367  ...       NaN  MULTIPOINT (1162302.618 1832900.224)
2        22  41.868634  ...       NaN  MULTIPOINT (1173317.042 1895425.426)
3        23  41.877590  ...       new  MULTIPOINT (1168996.475 1898801.406)
4        27  41.737696  ...       NaN  MULTIPOINT (1176991.989 1847262.423)

[5 rows x 8 columns]

# Execute spatial join
In [17]: groceries_with_community = groceries.sjoin(chicago, how="inner", predicate='intersects')

In [18]: groceries_with_community.head()
Out[18]: 
     OBJECTID     Ycoord  ...  index_right    community
0          16  41.973266  ...           30       UPTOWN
87        365  41.961707  ...           30       UPTOWN
90        373  41.963131  ...           30       UPTOWN
140       582  41.969131  ...           30       UPTOWN
1          18  41.696367  ...           73  MORGAN PARK

[5 rows x 10 columns]

GeoPandas provides two spatial-join functions:

Note

For historical reasons, both methods are also available as top-level functions sjoin() and sjoin_nearest(). It is recommended to use methods as the functions may be deprecated in the future.

Binary predicate joins#

Binary predicate joins are available via GeoDataFrame.sjoin().

GeoDataFrame.sjoin() has two core arguments: how and predicate.

predicate

The predicate argument specifies how GeoPandas decides whether or not to join the attributes of one object to another, based on their geometric relationship.

The values for predicate correspond to the names of geometric binary predicates and depend on the spatial index implementation.

The default spatial index in GeoPandas currently supports the following values for predicate which are defined in the Shapely documentation:

  • intersects

  • contains

  • within

  • touches

  • crosses

  • overlaps

how

The how argument specifies the type of join that will occur and which geometry is retained in the resultant GeoDataFrame. It accepts the following options:

  • left: use the index from the first (or left_df) GeoDataFrame that you provide to GeoDataFrame.sjoin(); retain only the left_df geometry column

  • right: use index from second (or right_df); retain only the right_df geometry column

  • inner: use intersection of index values from both GeoDataFrame; retain only the left_df geometry column

Note more complicated spatial relationships can be studied by combining geometric operations with spatial join. To find all polygons within a given distance of a point, for example, one can first use the buffer() method to expand each point into a circle of appropriate radius, then intersect those buffered circles with the polygons in question.

Nearest joins#

Proximity-based joins can be done via GeoDataFrame.sjoin_nearest().

GeoDataFrame.sjoin_nearest() shares the how argument with GeoDataFrame.sjoin(), and includes two additional arguments: max_distance and distance_col.

max_distance

The max_distance argument specifies a maximum search radius for matching geometries. This can have a considerable performance impact in some cases. If you can, it is highly recommended that you use this parameter.

distance_col

If set, the resultant GeoDataFrame will include a column with this name containing the computed distances between an input geometry and the nearest geometry.