{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Spatial indexing\n", "\n", "When you want to know a spatial relationship (known as a spatial\n", "predicate) between a set of geometries A and a geometry B (or a set of\n", "them), you can compare geometry B against any geometry in a\n", "set A. However, that is not the most performant approach\n", "in most cases. A spatial index is a more efficient method for pre-filtering comparisons of geometries before using more computationally expensive spatial predicates.\n", "GeoPandas exposes the\n", "Sort-Tile-Recursive R-tree from shapely on any GeoDataFrame and\n", "GeoSeries using the [GeoSeries.sindex](../reference/api/geopandas.GeoSeries.sindex.html) property. This page outlines its options\n", "and common usage patterns.\n", "\n", "Note that for many operations where a spatial index provides significant\n", "performance benefits, GeoPandas already uses it automatically (like [sjoin()](../reference/api/geopandas.GeoDataFrame.sjoin.html),\n", "[overlay()](../reference/api/geopandas.GeoDataFrame.overlay.html), or [clip()](../reference/api/geopandas.GeoDataFrame.clip.html)). However, more advanced use cases may require\n", "a direct interaction with the index." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import geopandas\n", "import matplotlib.pyplot as plt\n", "import shapely\n", "\n", "from geodatasets import get_path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load data on New York City subboroughs to illustrate the spatial\n", "indexing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyc = geopandas.read_file(get_path(\"geoda nyc\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## R-tree principle\n", "\n", "In principle, any R-tree index builds a hierarchical collection of\n", "bounding boxes (envelopes) representing first individual geometries and then their\n", "most efficient combinations (from a spatial query perspective). When\n", "creating one, you can imagine that your geometries are represented by\n", "their envelopes, as illustrated below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axs = plt.subplots(1, 2, sharey=True, figsize=(8, 4))\n", "\n", "nyc.plot(ax=axs[0], edgecolor=\"black\", linewidth=1)\n", "nyc.envelope.boundary.plot(ax=axs[1], color='black');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The left side of the figure shows the original geometries, while the\n", "right side their bounding boxes, extracted using the [envelope](../reference/api/geopandas.GeoSeries.envelope.html)\n", "property. Typically, the index works on top of those.\n", "\n", "Let’s generate two points now, both intersecting at least one bounding\n", "box but only one intersecting the actual geometry." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "point_inside = shapely.Point(950000, 155000)\n", "point_outside = shapely.Point(1050000, 150000)\n", "points = geopandas.GeoSeries([point_inside, point_outside], crs=nyc.crs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can verify that visually." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axs = plt.subplots(1, 2, sharey=True, figsize=(8, 4))\n", "\n", "nyc.plot(ax=axs[0], edgecolor=\"black\", linewidth=1)\n", "nyc.envelope.boundary.plot(ax=axs[1], color='black')\n", "points.plot(ax=axs[0], color=\"limegreen\")\n", "points.plot(ax=axs[1], color=\"limegreen\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Querying the index\n", "\n", "### Scalar query\n", "\n", "\n", "You can now use the [sindex](../reference/api/geopandas.GeoSeries.sindex.html) property to query the index. The\n", "[query()](../reference/api/geopandas.sindex.SpatialIndex.query.html) method, by default, returns positions of all geometries\n", "whose bounding boxes intersect the bounding box of the input geometry.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bbox_query_inside = nyc.sindex.query(point_inside)\n", "bbox_query_outside = nyc.sindex.query(point_outside)\n", "bbox_query_inside, bbox_query_outside" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both the point we know is inside a geometry and the one that is outside\n", "a geometry return one hit as each intersects one bounding box in the\n", "tree." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axs = plt.subplots(1, 2, sharey=True, figsize=(8, 4))\n", "\n", "nyc.plot(ax=axs[0], edgecolor=\"black\", linewidth=1)\n", "nyc.envelope.boundary.plot(ax=axs[1], color='black')\n", "points.plot(ax=axs[0], color=\"limegreen\", zorder=3, edgecolor=\"black\", linewidth=.5)\n", "points.plot(ax=axs[1], color=\"limegreen\", zorder=3, edgecolor=\"black\", linewidth=.5)\n", "nyc.iloc[bbox_query_inside].plot(ax=axs[0], color='orange')\n", "nyc.iloc[bbox_query_outside].plot(ax=axs[0], color='orange')\n", "nyc.envelope.iloc[bbox_query_inside].plot(ax=axs[1], color='orange')\n", "nyc.envelope.iloc[bbox_query_outside].plot(ax=axs[1], color='orange');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The image above provides a clear illustration of what happens. While you\n", "can see on the left image that only one intersects an orange geometry\n", "marked as a *hit*, the hits are quite clear when looking at the bounding\n", "box.\n", "\n", "Thankfully, the spatial index allows for further filtering based on the\n", "actual geometry. In this case, the tree is first queried as above but\n", "afterwards, each of the possible hits is checked using a spatial predicate." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pred_inside = nyc.sindex.query(point_inside, predicate=\"intersects\")\n", "pred_outside = nyc.sindex.query(point_outside, predicate=\"intersects\")\n", "pred_inside, pred_outside" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you specify ``predicate=\"intersects\"``, the result is indeed\n", "different and the output of the query using the point that lies outside\n", "of any geometry is empty." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axs = plt.subplots(1, 2, sharey=True, figsize=(8, 4))\n", "\n", "nyc.plot(ax=axs[0], edgecolor=\"black\", linewidth=1)\n", "nyc.envelope.boundary.plot(ax=axs[1], color='black')\n", "points.plot(ax=axs[0], color=\"limegreen\", zorder=3, edgecolor=\"black\", linewidth=.5)\n", "points.plot(ax=axs[1], color=\"limegreen\", zorder=3, edgecolor=\"black\", linewidth=.5)\n", "nyc.iloc[pred_inside].plot(ax=axs[0], color='orange')\n", "nyc.envelope.iloc[pred_inside].plot(ax=axs[1], color='orange');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use any of the predicates available in [valid_query_predicates](../reference/api/geopandas.sindex.SpatialIndex.valid_query_predicates.html):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyc.sindex.valid_query_predicates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Array query\n", "\n", "Checking a single geometry against the tree is nice but not that\n", "efficient if you are interested in many-to-many relationships. The\n", "[query()](../reference/api/geopandas.sindex.SpatialIndex.query.html) method allows passing any 1-D array of geometries to be\n", "checked against the tree. If you do so, the output structure is slightly\n", "different:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bbox_array_query = nyc.sindex.query(points)\n", "bbox_array_query" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the method returns a 2-D array of indices where the query\n", "found a hit where the subarrays correspond to the indices of the input\n", "geometries and indices of the tree geometries associated with each. In\n", "the example above, the 0-th geometry in the ``points`` GeoSeries\n", "intersects the bounding box of the geometry at the position 1 from the\n", "``nyc`` GeoDataFrame, while the geometry 1 in the ``points`` matches\n", "geometry 16 in the ``nyc``. You may notice that these are the same\n", "indices as you’ve seen above.\n", "\n", "The other option is to return a boolean array with shape\n", "``(len(tree), n)`` with boolean values marking whether the bounding box\n", "of a geometry in the tree intersects a bounding box of a given geometry.\n", "This can be either a dense numpy array, or a sparse scipy array. Keep in\n", "mind that the output will be, in most cases, mostly filled with\n", "``False`` and the array can become really large, so it is recommended to\n", "use the sparse format, if possible.\n", "\n", "You can specify each using the ``output_format`` keyword:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bbox_array_query_dense = nyc.sindex.query(points, output_format=\"dense\")\n", "bbox_array_query_dense" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dense array above has rows aligned with the rows of ``nyc`` and\n", "columns aligned with the rows of ``points`` and indicates all pairs\n", "where a *hit* was found.\n", "\n", "The same array can be represented as a `scipy.sparse.coo_array`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bbox_array_query_sparse = nyc.sindex.query(points, output_format=\"sparse\")\n", "bbox_array_query_sparse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, to find the number of neighboring geometries for each subborough, you can\n", "use the spatial index to compare all geometries against each other. Since you are using\n", "``nyc`` on both sides of the query here, the resulting array is square-shaped with\n", "diagonal filled with ``True``." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "neighbors = nyc.sindex.query(nyc.geometry, predicate=\"intersects\", output_format=\"dense\")\n", "neighbors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Getting the sum along one axis can then give you the answer. Note that\n", "since a geometry always intersects itself, you need to subtract one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_neighbors = neighbors.sum(axis=1) - 1\n", "n_neighbors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a numpy array you can directly plot on a map." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyc.plot(n_neighbors, legend=True);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Nearest geometry query\n", "\n", "While checking the spatial predicate using the spatial index is indeed extremely useful, GeoPandas\n", "also allows you to use the spatial index to find\n", "the nearest geometry. The API is similar as above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nearest_indices = nyc.sindex.nearest(points)\n", "nearest_indices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that the nearest query returns the indices representation.\n", "If you are interested in how “near” the geometries actually are, the method\n", "can also return distances. In this case, the return format is a tuple of\n", "arrays." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nearest_indices, distance = nyc.sindex.nearest(points, return_distance=True)\n", "distance" ] } ], "metadata": { "kernelspec": { "display_name": "default", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 2 }