diff --git a/07-read-write-plot.ipynb b/07-read-write-plot.ipynb deleted file mode 100644 index dc3909e8..00000000 --- a/07-read-write-plot.ipynb +++ /dev/null @@ -1,1722 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Geographic data I/O {#read-write}\n", - "\n", - "## Prerequisites\n", - "\n", - "Let's import the required packages:\n" - ], - "id": "11a0f3ab" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "import urllib.request\n", - "import zipfile\n", - "import numpy as np\n", - "import fiona\n", - "import geopandas as gpd\n", - "import shapely\n", - "import rasterio\n", - "import rasterio.plot\n", - "import cartopy\n", - "import osmnx as ox" - ], - "id": "0ee0fb74", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "#| echo: false\n", - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "pd.options.display.max_rows = 6\n", - "pd.options.display.max_columns = 6\n", - "pd.options.display.max_colwidth = 35\n", - "plt.rcParams['figure.figsize'] = (5, 5)" - ], - "id": "eafb80b2", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "and load the sample data for this chapter:\n" - ], - "id": "96dd9dda" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "nz = gpd.read_file('data/nz.gpkg')\n", - "nz_elev = rasterio.open('data/nz_elev.tif')" - ], - "id": "a19f47c8", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Introduction\n", - "\n", - "This chapter is about reading and writing geographic data.\n", - "Geographic data import is essential for geocomputation: real-world applications are impossible without data. \n", - "Data output is also vital, enabling others to use valuable new or improved datasets resulting from your work. \n", - "Taken together, these processes of import/output can be referred to as data I/O.\n", - "\n", - "Geographic data I/O is often done with few lines of code at the beginning and end of projects. \n", - "It is often overlooked as a simple one step process. \n", - "However, mistakes made at the outset of projects (e.g. using an out-of-date or in some way faulty dataset) can lead to large problems later down the line, so it is worth putting considerable time into identifying which datasets are available, where they can be found and how to retrieve them. \n", - "These topics are covered in @sec-retrieving-open-data, which describes various geoportals, which collectively contain many terabytes of data, and how to use them. \n", - "To further ease data access, a number of packages for downloading geographic data have been developed, as described in @sec-geographic-data-packages.\n", - "\n", - "There are many geographic file formats, each of which has pros and cons, described in @sec-file-formats. \n", - "The process of reading and writing files in formats efficiently is covered in Sections @sec-data-input and @sec-data-output, respectively. \n", - "\n", - "## Retrieving open data {#sec-retrieving-open-data}\n", - "\n", - "A vast and ever-increasing amount of geographic data is available on the internet, much of which is free to access and use (with appropriate credit given to its providers).[^gis-open-datasets]\n", - "In some ways there is now too much data, in the sense that there are often multiple places to access the same dataset. \n", - "Some datasets are of poor quality. \n", - "In this context, it is vital to know where to look, so the first section covers some of the most important sources. \n", - "Various 'geoportals' (web services providing geospatial datasets such as [Data.gov](https://catalog.data.gov/dataset?metadata_type=geospatial)) are a good place to start, providing a wide range of data but often only for specific locations (as illustrated in the updated [Wikipedia page](https://en.wikipedia.org/wiki/Geoportal) on the topic).\n", - "\n", - "[^gis-open-datasets]: For example, visit for a list of websites with freely available geographic datasets.\n", - "\n", - "Some global geoportals overcome this issue. \n", - "The [GEOSS portal](http://www.geoportal.org/) and the [Copernicus Open Access Hub](https://scihub.copernicus.eu/), for example, contain many raster datasets with global coverage. \n", - "A wealth of vector datasets can be accessed from the [SEDAC](http://sedac.ciesin.columbia.edu/) portal run by the National Aeronautics and Space Administration (NASA) and the European Union's [INSPIRE geoportal](http://inspire-geoportal.ec.europa.eu/), with global and regional coverage.\n", - "\n", - "Most geoportals provide a graphical interface allowing datasets to be queried based on characteristics such as spatial and temporal extent, the United States Geological Survey's [EarthExplorer](https://earthexplorer.usgs.gov/) being a prime example. \n", - "Exploring datasets interactively on a browser is an effective way of understanding available layers. \n", - "Downloading data is best done with code, however, from reproducibility and efficiency perspectives. \n", - "Downloads can be initiated from the command line using a variety of techniques, primarily via URLs and APIs (see the [Sentinel API](https://scihub.copernicus.eu/twiki/do/view/SciHubWebPortal/APIHubDescription) for example). \n", - "Files hosted on static URLs can be downloaded with the following method, as illustrated in the code chunk below which accesses the [Natural Earth Data](https://www.naturalearthdata.com/) website to download the world airports layer zip file and to extract the contained Shapefile. \n", - "Note that the download code is complicated by the fact that the server checks the `User-agent` header of the request, basically to make sure that the download takes place through a browser. \n", - "To overcome this, we add a header corresponding to a request coming from a browser (such as Firefox) in our code:\n" - ], - "id": "db6b07ab" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "#| eval: false\n", - "\n", - "# Set URL+filename\n", - "url = 'https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_airports.zip'\n", - "filename = 'output/ne_10m_airports.zip'\n", - "\n", - "# Download\n", - "opener = urllib.request.build_opener()\n", - "opener.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0')]\n", - "urllib.request.install_opener(opener)\n", - "urllib.request.urlretrieve(url, filename)\n", - "\n", - "# Extract\n", - "f = zipfile.ZipFile(filename, 'r')\n", - "f.extractall('output')\n", - "f.close()" - ], - "id": "7dc6c212", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "#| echo: false\n", - "\n", - "filename = 'output/ne_10m_airports.zip'" - ], - "id": "ab9a2f9b", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The Shapefile that has been created in the `output` directory can then be imported and plotted (@fig-ne-airports) as follows:\n" - ], - "id": "08fd444b" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "#| label: fig-ne-airports\n", - "#| fig-cap: 'World airports layer, downloaded using Python from the Natural Earth Data website'\n", - "\n", - "ne = gpd.read_file(filename.replace('.zip', '.shp'))\n", - "ne.plot();" - ], - "id": "fig-ne-airports", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Geographic data packages {#sec-geographic-data-packages}\n", - "\n", - "Many Python packages have been developed for accessing geographic data, two of which are presented in @tbl-data-packages and demonstrated below. \n", - "These provide interfaces to one or more spatial libraries or geoportals and aim to make data access even quicker from the command line.\n", - "\n", - "| Package | Description |\n", - "|---|-----------|\n", - "| `cartopy` | Download layers from [Natural Earth Data](https://www.naturalearthdata.com/downloads/) | \n", - "| `osmnx` | Access to [OpenStreetMap](https://www.openstreetmap.org/) data and conversion to spatial networks | \n", - ": Selected Python packages for geographic data retrieval {#tbl-data-packages}\n", - "\n", - "Each data package has its own syntax for accessing data. \n", - "This diversity is demonstrated in the subsequent code chunks, which show how to get data using the packages from @tbl-data-packages. \n", - "\n", - "Administrative borders are often useful in spatial analysis. \n", - "These can be accessed with the `cartopy.io.shapereader.natural_earth` function from the **cartopy** package. \n", - "For example, the following code loads the `'admin_2_counties'` dataset of US counties:\n" - ], - "id": "795ff052" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "filename = cartopy.io.shapereader.natural_earth(\n", - " resolution='10m',\n", - " category='cultural',\n", - " name='admin_2_counties'\n", - ")\n", - "counties = gpd.read_file(filename)\n", - "counties" - ], - "id": "b2160a5e", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The layer of counties is plotted in @fig-ne-counties:\n" - ], - "id": "b2675b3c" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "#| label: fig-ne-counties\n", - "#| fig-cap: 'US counties, downloaded from the Natural Earth Data website using package **cartopy**'\n", - "\n", - "counties.plot();" - ], - "id": "fig-ne-counties", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Other layers can be accessed the same way: \n", - "\n", - "* you need to locate the `resolution`, `category`, and `name` of the requested dataset in [Natural Earth Data](https://www.naturalearthdata.com/downloads/), then \n", - "* run the `cartopy.io.shapereader.natural_earth`, which downloads the file(s) and returns the path, and\n", - "* read the file into the Python environment, e.g., using `gpd.read_file`.\n", - "\n", - "This is an alternative approach to \"directly\" downloading files as shown earlier (@sec-retrieving-open-data). \n", - "\n", - "The second example uses the **osmnx** package to find parks from the OpenStreetMap (OSM) database. \n", - "As illustrated in the code-chunk below, OpenStreetMap data can be obtained using the `ox.features.features_from_place` function. \n", - "The first argument is a string which is geocoded to a polygon (the `ox.features.features_from_bbox` and `ox.features.features_from_polygon` can also be used to query a custom area of interest). \n", - "The second argument specifies the OSM tag(s), selecting which OSM elements we're interested in (parks, in this case), represented by key-value pairs:\n" - ], - "id": "bb5c53a1" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "parks = ox.features.features_from_place(\n", - " query='leeds uk', \n", - " tags={'leisure': 'park'}\n", - ")\n", - "parks" - ], - "id": "9c3e7c04", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The result is a `GeoDataFrame` with the parks in Leeds.\n", - "The following expression plots the geometries with the `name` property in the tooltips (@fig-ox-features):\n" - ], - "id": "95c1141a" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "#| label: fig-ox-features\n", - "#| fig-cap: 'Parks in Leeds, based on OpenStreetMap data, downloaded using package `osmnx`'\n", - "\n", - "parks[['name', 'geometry']].explore()" - ], - "id": "fig-ox-features", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It should be noted that the `osmnx` package downloads OSM data from the [Overpass API](https://wiki.openstreetmap.org/wiki/Overpass_API), which is rate limited and therefore unsuitable for queries covering very large areas. \n", - "To overcome this limitation, you can download OSM data extracts, such as in Shapefile format from [Geofabrik](https://download.geofabrik.de/), and then load them from the file into the Python environment.\n", - "\n", - "OpenStreetMap is a vast global database of crowd-sourced data, is growing daily, and has a wider ecosystem of tools enabling easy access to the data, from the [Overpass turbo](https://overpass-turbo.eu/) web service for rapid development and testing of OSM queries to [osm2pgsql](https://osm2pgsql.org/) for importing the data into a PostGIS database. \n", - "Although the quality of datasets derived from OSM varies, the data source and wider OSM ecosystems have many advantages: they provide datasets that are available globally, free of charge, and constantly improving thanks to an army of volunteers. \n", - "Using OSM encourages 'citizen science' and contributions back to the digital commons (you can start editing data representing a part of the world you know well at [www.openstreetmap.org](https://www.openstreetmap.org/)). \n", - "\n", - "Sometimes, packages come with built-in datasets. \n", - "These can be accessed just like any other object (e.g., function) that is imported as part of the package, or in other ways as specified in the package documentation. \n", - "For example, package **geopandas** comes with few built-in datasets (see `gpd.datasets.available` for a list of names). \n", - "Using the `gpd.datasets.get_path` function and the dataset name, we can obtain the path to the location of the dataset file on our computer. \n", - "For example, `'naturalearth_lowres'` is a vector layer of world countries (from Natural Earth Data, which we've alredy met before):\n" - ], - "id": "a6736f9e" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "filename = gpd.datasets.get_path('naturalearth_lowres')\n", - "filename" - ], - "id": "8a438438", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Then, we can import the dataset, just like from any other file:\n" - ], - "id": "2f59e8fb" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "gpd.read_file(filename)" - ], - "id": "280f51b6", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Another way to obtain spatial information is to perform geocoding---transform a description of a location, usually an address, into its coordinates. \n", - "This is usually done by sending a query to an online service and getting the location as a result. \n", - "Many such services exist that differ in the used method of geocoding, usage limitations, costs, or API key requirements. \n", - "[Nominatim](https://nominatim.openstreetmap.org/ui/about.html) is a popular free service, based on OpenStreetMap data. \n", - "It can be accessed in Python uisng the `osmnx.geocoder.geocode` function. The function returns a `tuple` of the form `(lat,lon)`. \n", - "The example below searches for John Snow blue plaque coordinates located on a building in the Soho district of London:\n" - ], - "id": "56cbe445" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "ox.geocoder.geocode('54 Frith St, London W1D 4SJ, UK')" - ], - "id": "03ba2a44", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If the query returns no results, an `InsufficientResponseError` is raised, a scenario that the user can deal with using [`try`/`except`](https://docs.python.org/3/tutorial/errors.html#handling-exceptions). \n", - "\n", - "The alternative function `osmnx.geocoder.geocode_to_gdf` can be used to automatically geocode multiple addresses (accepting a `list` of `string`s) and transforming them into a `GeoDataFrame`. \n", - "This function also returns `'Polygon'` geometries. \n", - "For example:\n" - ], - "id": "f27d94a5" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "result = ox.geocoder.geocode_to_gdf(['54 Frith St, London W1D 4SJ, UK'])\n", - "result" - ], - "id": "b9f67dd9", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The result is visualized below using `.explore` (@fig-ox-geocode):\n" - ], - "id": "7027083a" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "#| label: fig-ox-geocode\n", - "#| fig-cap: 'Specific address in London, geocoded into a `GeoDataFrame` using package `osmnx`'\n", - "\n", - "result[['display_name', 'geometry']].explore(color='red')" - ], - "id": "fig-ox-geocode", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## File formats {#sec-file-formats}\n", - "\n", - "Geographic datasets are usually stored as files or in spatial databases. \n", - "File formats can either store vector or raster data, while spatial databases such as [PostGIS](https://postgis.net/) can store both. \n", - "The large variety of file formats may seem bewildering, but there has been much consolidation and standardization since the beginnings of GIS software in the 1960s when the first widely distributed program ([SYMAP](https://news.harvard.edu/gazette/story/2011/10/the-invention-of-gis/)) for spatial analysis was created at Harvard University [@coppock_history_1991].\n", - "\n", - "GDAL (which should be pronounced \"goo-dal\", with the double \"o\" making a reference to object-orientation), the Geospatial Data Abstraction Library, has resolved many issues associated with incompatibility between geographic file formats since its release in 2000. \n", - "GDAL provides a unified and high-performance interface for reading and writing of many raster and vector data formats. \n", - "Many open and proprietary GIS programs, including GRASS, ArcGIS and QGIS, use GDAL behind their GUIs for doing the legwork of ingesting and spitting out geographic data in appropriate formats.\n", - "Most Pyhton packages for working with spatial data, including **geopandas** and **rasterio** used in this book, also rely on GDAL for importing and exporting spatial data files. \n", - "\n", - "GDAL provides access to more than 200 vector and raster data formats. \n", - "@tbl-file-formats presents some basic information about selected and often used spatial file formats.\n", - "\n", - "Name | Extension | Info | Type | Model |\n", - "|-----|----|----------|-----|-----|\n", - "ESRI Shapefile | `.shp` (the main file) | Popular format consisting of at least three files. No support for: files > 2GB;mixed types; names > 10 chars; cols > 255. | Vector | Partially open |\n", - "GeoJSON | `.geojson` | Extends the JSON exchange format by including a subset of the simple feature representation; mostly used for storing coordinates in longitude and latitude; it is extended by the TopoJSON format. | Vector | Open |\n", - "KML | `.kml` | XML-based format for spatial visualization, developed for use with Google Earth. Zipped KML file forms the KMZ format. | Vector | Open |\n", - "GPX | `.gpx` | XML schema created for exchange of GPS data. | Vector | Open |\n", - "FlatGeobuf | `.fgb` | Single file format allowing for quick reading and writing of vector data. Has streaming capabilities. | Vector | Open |\n", - "GeoTIFF | `.tif/.tiff` | Popular raster format. A TIFF file containing additional spatial metadata. | Raster | Open |\n", - "Arc ASCII | `.asc` | Text format where the first six lines represent the raster header, followed by the raster cell values arranged in rows and columns. | Raster | Open |\n", - "SQLite/SpatiaLite | `.sqlite` | Standalone relational database, SpatiaLite is the spatial extension of SQLite. | Vector and raster | Open |\n", - "ESRI FileGDB | `.gdb` | Spatial and nonspatial objects created by ArcGIS. Allows: multiple feature classes; topology. Limited support from GDAL. | Vector and raster | Proprietary |\n", - "GeoPackage | `.gpkg` | Lightweight database container based on SQLite allowing an easy and platform-independent exchange of geodata | Vector and (very limited) raster | Open |\n", - ": Commonly used spatial data file formats {#tbl-file-formats}\n", - "\n", - "An important development ensuring the standardization and open-sourcing of file formats was the founding of the Open Geospatial Consortium ([OGC](http://www.opengeospatial.org/)) in 1994. \n", - "Beyond defining the simple features data model (see @sec-simple-features), the OGC also coordinates the development of open standards, for example as used in file formats such as KML and GeoPackage. \n", - "Open file formats of the kind endorsed by the OGC have several advantages over proprietary formats: the standards are published, ensure transparency and open up the possibility for users to further develop and adjust the file formats to their specific needs.\n", - "\n", - "ESRI Shapefile is the most popular vector data exchange format; however, it is not an open format (though its specification is open). \n", - "It was developed in the early 1990s and has a number of limitations. \n", - "First of all, it is a multi-file format, which consists of at least three files. \n", - "It only supports 255 columns, column names are restricted to ten characters and the file size limit is 2 GB. \n", - "Furthermore, ESRI Shapefile does not support all possible geometry types, for example, it is unable to distinguish between a polygon and a multipolygon. \n", - "Despite these limitations, a viable alternative had been missing for a long time. \n", - "In the meantime, [GeoPackage](https://www.geopackage.org/) emerged, and seems to be a more than suitable replacement candidate for ESRI Shapefile. \n", - "GeoPackage is a format for exchanging geospatial information and an OGC standard. \n", - "The GeoPackage standard describes the rules on how to store geospatial information in a tiny SQLite container. \n", - "Hence, GeoPackage is a lightweight spatial database container, which allows the storage of vector and raster data but also of non-spatial data and extensions. \n", - "Aside from GeoPackage, there are other geospatial data exchange formats worth checking out (@tbl-file-formats).\n", - "\n", - "The GeoTIFF format seems to be the most prominent raster data format. \n", - "It allows spatial information, such as the CRS definition and the transformation matrix (see @sec-using-rasterio), to be embedded within a TIFF file. \n", - "Similar to ESRI Shapefile, this format was firstly developed in the 1990s, but as an open format. \n", - "Additionally, GeoTIFF is still being expanded and improved. \n", - "One of the most significant recent addition to the GeoTIFF format is its variant called COG (Cloud Optimized GeoTIFF). \n", - "Raster objects saved as COGs can be hosted on HTTP servers, so other people can read only parts of the file without downloading the whole file (@sec-input-raster).\n", - "\n", - "There is also a plethora of other spatial data formats that we do not explain in detail or mention in @tbl-file-formats due to the book limits. \n", - "If you need to use other formats, we encourage you to read the GDAL documentation about [vector](https://gdal.org/drivers/vector/index.html) and [raster](https://gdal.org/drivers/raster/index.html) drivers. \n", - "Additionally, some spatial data formats can store other data models (types) than vector or raster. \n", - "It includes LAS and LAZ formats for storing lidar point clouds, and NetCDF and HDF for storing multidimensional arrays. \n", - "\n", - "Finally, spatial data is also often stored using tabular (non-spatial) text formats, including CSV files or Excel spreadsheets. \n", - "This can be convenient to share spatial (point) datasets with people who, or software that, struggle with spatial data formats. \n", - "If necessary, the table can be converted to a point layer (see examples in @sec-vector-layer-from-scratch and @sec-spatial-joining).\n", - "\n", - "## Data input (I) {#sec-data-input}\n", - "\n", - "Executing commands such as `geopandas.read_file` (the main function we use for loading vector data) or `rasterio.open`+`.read` (the main functions used for loading raster data) silently sets off a chain of events that reads data from files. \n", - "Moreover, there are many Python packages containing a wide range of geographic data or providing simple access to different data sources. \n", - "All of them load the data into the Python environment or, more precisely, assign objects to your workspace, stored in RAM and accessible within the Python session.\n", - "\n", - "### Vector data\n", - "\n", - "Spatial vector data comes in a wide variety of file formats. \n", - "Most popular representations such as `.shp`, `.geojson`, and `.gpkg` files can be imported and exported with **geopandas** functions `read_file` and `to_file` (covered in @sec-data-output), respectively.\n", - "\n", - "**geopandas** uses GDAL to read and write data, via `fiona` (the [default](https://github.com/geopandas/geopandas/issues/2217)) or `pyogrio` packages (a recently developed alternative to `fiona`). \n", - "After `fiona` is imported, the command `fiona.supported_drivers` can be used to list drivers available to GDAL, including whether they can (`'r'`), append (`'a'`), or write (`'w'`) data, or all three:\n" - ], - "id": "96b5478a" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "fiona.supported_drivers" - ], - "id": "3373fe2e", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Other, less common, drivers can be [\"activated\"](https://geopandas.org/en/stable/docs/user_guide/io.html) by manually supplementing `fiona.supported_drivers`.\n", - "\n", - "The first argument of the **geopandas** versatile data import function `gpd.read_file` is `filename`, which is typically a string, but can also be a file connection.\n", - "The content of a string could vary between different drivers.\n", - "In most cases, as with the ESRI Shapefile (`.shp`) or the GeoPackage format (`.gpkg`), the `filename` argument would be a path or a URL to an actual file, such as `geodata.gpkg`.\n", - "The driver is automatically selected based on the file extension, as demonstrated for a `.gpkg` file below:\n" - ], - "id": "7224820b" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "world = gpd.read_file('data/world.gpkg')\n", - "world" - ], - "id": "3135667a", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For some drivers, such as a File Geodatabase (`OpenFileGDB`), `filename` could be provided as a folder name.\n", - "\n", - "GeoJSON, a plain text format, can be read from a `.geojson` file, but also from a string:\n" - ], - "id": "436613e3" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "gpd.read_file('{\"type\":\"Point\",\"coordinates\":[34.838848,31.296301]}')" - ], - "id": "e46dd2be", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `gpd.read_postgis` function can be used to read a vector layer from a PostGIS database.\n", - "\n", - "Some vector formats, such as GeoPackage, can store multiple data layers. \n", - "By default, `gpd.read_file` automatically reads the first layer of the file specified in `filename`. \n", - "However, using the `layer` argument you can specify any other layer.\n", - "\n", - "The `gpd.read_file` function also allows for reading just parts of the file into RAM with two possible mechanisms. \n", - "The first one is related to the `where` argument, which allows specifying what part of the data to read using an SQL `WHERE` expression. \n", - "An example below extracts data for Tanzania only (@fig-read-shp-query (a)). \n", - "It is done by specifying that we want to get all rows for which `name_long` equals to `'Tanzania'`:\n" - ], - "id": "00cae2cf" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "tanzania = gpd.read_file('data/world.gpkg', where='name_long=\"Tanzania\"')\n", - "tanzania" - ], - "id": "62ee57db", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you do not know the names of the available columns, a good approach is to just read one row of the data using the `rows` argument, which can be used to read the first N rows, then use the `.columns` property to examine the column names:\n" - ], - "id": "dcc2f980" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "gpd.read_file('data/world.gpkg', rows=1).columns" - ], - "id": "fc275759", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The second mechanism uses the `mask` argument to filter data based on intersection with an existing geometry. \n", - "This argument expects a geometry (`GeoDataFrame`, `GeoSeries`, or `shapely`) representing the area where we want to extract the data. \n", - "Let's try it using a small example---we want to read polygons from our file that intersect with the buffer of 50,000 $m$ of Tanzania's borders. \n", - "To do it, we need to: \n", - "\n", - "* transform the geometry to a projected CRS (such as `EPSG:32736`), \n", - "* prepare our \"filter\" by creating the buffer (@sec-buffers), and \n", - "* transform back to the original CRS to be used as a mask:\n" - ], - "id": "2107c9ac" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "tanzania_buf = tanzania.to_crs(32736).buffer(50000).to_crs(4326)\n", - "tanzania_buf" - ], - "id": "091a29a9", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The resulting buffered geometry is shown in @fig-tanzania-mask:\n" - ], - "id": "c23b1fe7" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "#| label: fig-tanzania-mask\n", - "#| fig-cap: 'Tanzania polygon buffered by 50 $km$, to by used for reading a subset of `world.gpkg` (@fig-read-shp-query (b))'\n", - "\n", - "tanzania_buf.plot()" - ], - "id": "fig-tanzania-mask", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, we can apply pass the \"filter\" geometry `tanzania_buf` to the `mask` argument of `gpd.read_file`:\n" - ], - "id": "7baa6bdc" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "tanzania_neigh = gpd.read_file('data/world.gpkg', mask=tanzania_buf)\n", - "tanzania_neigh" - ], - "id": "d11fd0c7", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Our result, shown in @fig-read-shp-query (b), contains Tanzania and every country intersecting with its 50,000 $m$ buffer. \n", - "Note that the last two expressions are used to add text labels with the `name_long` of each country, placed at the country centroid:\n" - ], - "id": "5aea3509" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "#| label: fig-read-shp-query\n", - "#| fig-cap: Reading a subset of the vector layer file `world.gpkg`\n", - "#| layout-ncol: 2\n", - "#| fig-subcap:\n", - "#| - Using a `where` query (matching `'Tanzania'`)\n", - "#| - Using a `mask` (a geometry shown in red)\n", - "\n", - "# Using 'where'\n", - "fig, ax = plt.subplots()\n", - "tanzania.plot(ax=ax, color='lightgrey', edgecolor='grey')\n", - "tanzania.apply(lambda x: ax.annotate(text=x['name_long'], xy=x.geometry.centroid.coords[0], ha='center'), axis=1);\n", - "\n", - "# Using 'mask'\n", - "fig, ax = plt.subplots()\n", - "tanzania_neigh.plot(ax=ax, color='lightgrey', edgecolor='grey')\n", - "tanzania_buf.plot(ax=ax, color='none', edgecolor='red')\n", - "tanzania_neigh.apply(lambda x: ax.annotate(text=x['name_long'], xy=x.geometry.centroid.coords[0], ha='center'), axis=1);" - ], - "id": "fig-read-shp-query", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Often we need to read CSV files (or other tabular formats) which have x and y coordinate columns, and turn them into a `GeoDataFrame` with point geometries. \n", - "To do that, we can import the file using **pandas** (e.g., using `pd.read_csv` or `pd.read_excel`), then go from `DataFrame` to `GeoDataFrame` using the `gpd.points_from_xy` function, as shown earlier in the book (See @sec-vector-layer-from-scratch and @sec-spatial-joining). For example, the table `cycle_hire_xy.csv`, where the coordinates are stored in the `X` and `Y` columns in `EPSG:4326`, can be imported, converted to a `GeoDataFrame`, and plotted, as follows (@fig-cycle_hire_xy-layer):\n" - ], - "id": "4ffd00f6" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "#| label: fig-cycle_hire_xy-layer\n", - "#| fig-cap: The `cycle_hire_xy.csv` table transformed to a point layer\n", - "\n", - "cycle_hire = pd.read_csv('data/cycle_hire_xy.csv')\n", - "geom = gpd.points_from_xy(cycle_hire['X'], cycle_hire['Y'], crs=4326)\n", - "geom = gpd.GeoSeries(geom)\n", - "cycle_hire_xy = gpd.GeoDataFrame(data=cycle_hire, geometry=geom)\n", - "cycle_hire_xy.plot();" - ], - "id": "fig-cycle_hire_xy-layer", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Instead of columns describing 'XY' coordinates, a single column can also contain the geometry information, not necessarily points but possible any other geometry type. \n", - "Well-known text (WKT), well-known binary (WKB), and GeoJSON are examples of formats used to encode geometry in such a column. \n", - "For instance, the `world_wkt.csv` file has a column named `'WKT'`, representing polygons of the world's countries (in WKT format). \n", - "To import and convert it to a `GeoDataFrame`, we can apply the `shapely.from_wkt` function (@sec-geometries) on the WKT strings, to convert them into `shapely` geometries (@fig-world_wkt-layer):\n" - ], - "id": "60b66c74" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "#| label: fig-world_wkt-layer\n", - "#| fig-cap: The `world_wkt.csv` table transformed to a polygon layer\n", - "\n", - "world_wkt = pd.read_csv('data/world_wkt.csv')\n", - "world_wkt['geometry'] = world_wkt['WKT'].apply(shapely.from_wkt)\n", - "world_wkt = gpd.GeoDataFrame(world_wkt)\n", - "world_wkt.plot();" - ], - "id": "fig-world_wkt-layer", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "::: {.callout-note}\n", - "Not all of the supported vector file formats store information about their coordinate reference system. \n", - "In these situations, it is possible to add the missing information using the `.set_crs` function. \n", - "Please refer also to @sec-querying-and-setting-coordinate-systems for more information. \n", - ":::\n", - "\n", - "As a final example, we will show how **geopandas** also reads KML files. \n", - "A KML file stores geographic information in XML format---a data format for the creation of web pages and the transfer of data in an application-independent way [@nolan_xml_2014]. \n", - "Here, we access a KML file from the web. \n", - "First, if necessary, we may need to \"activate\" the `KML` driver, which isn't always available by default:\n" - ], - "id": "c5fe05f1" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "fiona.supported_drivers['KML'] = 'r'" - ], - "id": "2965e917", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The sample KML file `KML_Samples.kml` contains more than one layer. \n", - "To list the available layers, we can use the `fiona.listlayers` function: \n" - ], - "id": "6b76a2c9" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "u = 'https://developers.google.com/kml/documentation/KML_Samples.kml'\n", - "fiona.listlayers(u)" - ], - "id": "0fe7e5d5", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can choose, for instance, the first layer `'Placemarks'` and read it, using `gpd.read_file` with an additional `layer` argument:\n" - ], - "id": "c91ab69a" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "placemarks = gpd.read_file(u, layer='Placemarks')\n", - "placemarks" - ], - "id": "a8600948", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Raster data {#sec-input-raster}\n", - "\n", - "Similar to vector data, raster data comes in many file formats with some of them supporting multilayer files. \n", - "`rasterio.open` is used to create a file connection to a raster file, which can be subsequently used to read the metadata and/or the values, as shown previously (@sec-using-rasterio). \n", - "For example: \n" - ], - "id": "8da7f10c" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "src = rasterio.open('data/srtm.tif')\n", - "src" - ], - "id": "2273770c", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "All of the previous examples, like the one above, read spatial information from files stored on your hard drive. \n", - "However, GDAL also allows reading data directly from online resources, such as HTTP/HTTPS/FTP web resources. \n", - "The only thing we need to do is to add a `/vsicurl/` prefix before the path to the file. \n", - "Let's try it by connecting to the global monthly snow probability at 500 $m$ resolution for the period 2000-2012 [@hengl_t_2021_5774954]. \n", - "Snow probability for December is stored as a Cloud Optimized GeoTIFF (COG) file (see @sec-file-formats). \n", - "To read an online file, we need to provide its URL together with the `/vsicurl/` prefix:\n" - ], - "id": "9b12efad" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "src = rasterio.open('/vsicurl/https://zenodo.org/record/5774954/files/clm_snow.prob_esacci.dec_p.90_500m_s0..0cm_2000..2012_v2.0.tif')\n", - "src" - ], - "id": "427103e1", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In the example above `rasterio.open` creates a connection to the file without obtaining any values, as we did for the local `srtm.tif` file.\n", - "The values can read, into an `ndarray`, using the `.read` method of the file connection (@sec-using-rasterio). \n", - "Using parameters of `.read` allows us to just read a small portion of the data, without downloading the entire file. \n", - "This is very useful when working with large datasets hosted online from resource-constrained computing environments such as laptops.\n", - "\n", - "For example, we can read a specified rectangular extent of the raster.\n", - "With **rasterio**, this is done using the so-called [windowed reading](https://rasterio.readthedocs.io/en/latest/topics/windowed-rw.html) capabilities. \n", - "Note that, with windowed reading, we import just a subset of the raster extent into an `ndarray` covering any partial extent. \n", - "Windowed reading is therefore memory- (and, in this case, bandwidth-) efficient, since it avoids reading the entire raster into memory. \n", - "Windowed reading can also be considered an alternative pathway to *cropping* (@sec-raster-cropping).\n", - "\n", - "To read a raster *window*, let's first define the bounding box coordinates. \n", - "For example, here we use a $10 \\times 10$ degrees extent coinciding with Reykjavik:\n" - ], - "id": "2c8f7717" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "xmin=-30\n", - "xmax=-20\n", - "ymin=60\n", - "ymax=70" - ], - "id": "942c2fcf", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Using the extent coordinates along with the raster transformation matrix, we create a window object, using the [`rasterio.windows.from_bounds`](https://rasterio.readthedocs.io/en/stable/api/rasterio.windows.html#rasterio.windows.from_bounds) function. \n", - "The `rasterio.windows.from_bounds` function basically \"translates\" the extent from coordinates, to row/column ranges:\n" - ], - "id": "a6b0fc41" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "w = rasterio.windows.from_bounds(\n", - " left=xmin, \n", - " bottom=ymin,\n", - " right=xmax,\n", - " top=ymax, \n", - " transform=src.transform\n", - ")\n", - "w" - ], - "id": "e2b7513a", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we can read the partial array, accordint to the specified window `w`, passing it to the `.read` method:\n" - ], - "id": "bb460846" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "r = src.read(1, window=w)\n", - "r" - ], - "id": "8f50fea2", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that the transformation matrix of the window is not the same as that of the original raster (unless it incidentally starts from the top-left corner)! \n", - "Therefore, we must re-create the transformation matrix, with the modified origin (`xmin`,`ymax`), yet the same resolution, as follows:\n" - ], - "id": "048ded60" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "w_transform = rasterio.transform.from_origin(\n", - " west=xmin, \n", - " north=ymax, \n", - " xsize=src.transform[0],\n", - " ysize=abs(src.transform[4])\n", - ")\n", - "w_transform" - ], - "id": "856cc4f9", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The array `r` along with the updated transformation matrix `w_transform` comprise the partial window, which we can keep working with just like with any other raster, as shown in previous chapters.\n", - "@fig-raster-window shows the result, along with the location of Reykjavik (see below):\n" - ], - "id": "f4d08379" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "#| label: fig-raster-window\n", - "#| fig-cap: Raster window read from a remote Cloud Optimized GeoTIFF (COG) file source\n", - "\n", - "fig, ax = plt.subplots()\n", - "rasterio.plot.show(r, transform=w_transform, ax=ax)\n", - "gpd.GeoSeries(shapely.Point(-21.94, 64.15)).plot(ax=ax, color='red');" - ], - "id": "fig-raster-window", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Another option is to extract raster values at particular points, directly from the file connection, using the `.sample` method (see @sec-spatial-subsetting-raster). \n", - "For example, we can get the snow probability for December in Reykjavik (70%) by specifying its coordinates and applying `.sample`:\n" - ], - "id": "14043f7b" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "coords = (-21.94, 64.15)\n", - "values = src.sample([coords])\n", - "list(values)" - ], - "id": "8669a5e8", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The example above efficiently extracts and downloads a single value instead of the entire GeoTIFF file, saving valuable resources.\n", - "\n", - "The `/vsicurl/` prefix also works for *vector* file formats, enabling you to import datasets from online storage with **geopandas** just by adding it before the vector file URL.\n", - "Importantly, `/vsicurl/` is not the only prefix provided by GDAL---many more exist, such as `/vsizip/` to read spatial files from ZIP archives without decompressing them beforehand or `/vsis3/` for on-the-fly reading files available in AWS S3 buckets. \n", - "You can learn more about it at .\n", - "\n", - "## Data output (O) {#sec-data-output}\n", - "\n", - "Writing geographic data allows you to convert from one format to another and to save newly created objects for permanent storage. \n", - "Depending on the data type (vector or raster), object class (e.g., `GeoDataFrame`), and type and amount of stored information (e.g., object size, range of values), it is important to know how to store spatial files in the most efficient way. \n", - "The next two sections will demonstrate how to do this.\n", - "\n", - "### Vector data\n", - "\n", - "The counterpart of `gpd.read_file` is the `.to_file` method that a `GeoDataFrame` has. \n", - "It allows you to write `GeoDataFrame` objects to a wide range of geographic vector file formats, including the most common ones, such as `.geojson`, `.shp` and `.gpkg`. \n", - "Based on the file name, `.to_file` decides automatically which driver to use. The speed of the writing process depends also on the driver.\n", - "\n", - "For example, here is how we export the `world` layer to a GeoPackage file:\n" - ], - "id": "2a6472e2" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "world.to_file('output/world.gpkg')" - ], - "id": "c4f2e4a1", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that if you try to write to the same data source again, the function will overwrite the file:\n" - ], - "id": "d58c4846" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "world.to_file('output/world.gpkg')" - ], - "id": "88129c5e", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Instead of overwriting the file, we could add a new layer to the file with `mode='a'` (\"append\" mode, as opposed to the default `mode='w'` for \"write\" mode). \n", - "Appending is supported by several spatial formats, including GeoPackage. \n", - "For example:\n" - ], - "id": "d3810131" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "world.to_file('output/world_many_features.gpkg')\n", - "world.to_file('output/world_many_features.gpkg', mode='a')" - ], - "id": "1d777444", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here, `world_many_features.gpkg` will contain a polygonal layer named `world` with two \"copies\" of each country (that is 177×2=354 features, whereas the `world` layer has 177 features):\n", - "\n", - "\n", - "```{pyhton}\n", - "gpd.read_file('output/world_many_features.gpkg').shape\n", - "```\n", - "\n", - "\n", - "Alternatively, you can create another, separate, layer, within the same file. \n", - "The GeoPackage format also supports multiple layers within one file. \n", - "For example:\n" - ], - "id": "4668bd82" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "world.to_file('output/world_many_layers.gpkg')\n", - "world.to_file('output/world_many_layers.gpkg', layer='world2')" - ], - "id": "e0ff0a3b", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this case, `world_many_layers.gpkg` has two \"layers\", `world_many_layers` (same as the file name, when `layer` is unspecified) and `world2`. \n", - "Incidentally, the contents of the two layers is identical, but this doesn't have to be so. \n", - "Each layer from such a file can be imported separately, as in: \n" - ], - "id": "3eea5048" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "gpd.read_file('output/world_many_layers.gpkg', layer='world_many_layers').head(1)" - ], - "id": "aca77c86", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "gpd.read_file('output/world_many_layers.gpkg', layer='world2').head(1)" - ], - "id": "4666f9e8", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Raster data {#sec-data-output-raster}\n", - "\n", - "To write a raster file using **rasterio**, we need to pass a raster file path to `rasterio.open`, in writing (`'w'`) mode. This implies creating a new empty file (or overwriting an existing one). As opposed to read (`'r'`, the default) mode, the `rasterio.open` function needs quite a lot of information, in addition to the file path and mode:\n", - "\n", - "* An array with the raster values\n", - "* Metadata describing the raster format and spatial properties\n", - "\n", - "The metadata needs to specify the following properties:\n", - "\n", - "* `driver`---The file format (The recommendation is `'GTiff'` for GeoTIFF)\n", - "* `height`---Number of rows\n", - "* `width`---Number of columns\n", - "* `count`---Number of bands\n", - "* `nodata`---The value which represents \"No Data\", if any\n", - "* `dtype`---The raster data type, one of **numpy** types (e.g., `np.int64`)\n", - "* `crs`---The CRS, using an EPSG code (e.g., `4326`)\n", - "* `transform`---The transform matrix\n", - "* `compress`---A compression method to apply, such as `'lzw'`. This is optional and most useful for large rasters. Note that, at the time of writing, this [doesn't work well](https://gis.stackexchange.com/questions/404738/why-does-rasterio-compression-reduces-image-size-with-single-band-but-not-with-m) for writing multiband rasters.\n", - "\n", - "Once the file connection with the right metadata is ready, we do the actual writing using the `.write` method of the file connection. If there are several bands we may execute the `.write` method several times, as in `.write(a,n)`, where `a` is the array with band values and `n` is the band index (starting from `1`, see below). When done, we close the file connection using the `.close` method. Some functions, such as `rasterio.warp.reproject` used for resampling and reprojecting, directly accept a file connection in `'w'` mode, thus handling the writing (of a resampled or reprojected raster) for us.\n", - "\n", - "Most of the properties are either straightforward to choose, based on our aims, (e.g., `driver`, `crs`, `compress`, `nodata`), or directly derived from the array with the raster values itself (e.g., `height`, `width`, `count`, `dtype`). The most complicated property is the `transform`, which specifies the raster origin and resolution. The `transform` is typically either obtained from an existing raster (serving as a \"template\"), or created from scratch based on manually specified origin and resolution values (e.g., using `rasterio.transform.from_origin`), or calculate automatically (e.g., using `rasterio.warp.calculate_default_transform`).\n", - "\n", - "Earlier in the book, we have already demonstrated the four most common scenarios of writing rasters:\n", - "\n", - "* Creating from scratch (@sec-raster-from-scratch)---We created and wrote two rasters from scratch by associating the `elev` and `grain` arrays with an arbitrary spatial extent. The custom arbitrary `transform` created using `rasterio.transform.from_origin`.\n", - "* Aggregating (@sec-raster-agg-disagg)---We wrote an aggregated a raster, by reading a resampled array from an exising raster, then updating the `transform` using `.transform.scale`.\n", - "* Resampling (@sec-raster-resampling)---We resampled a raster into a custom grid, manually creating the `transform` using `rasterio.transform.from_origin`, then resampling and writing the output using `rasterio.warp.reproject`.\n", - "* Reprojecting (@sec-reprojecting-raster-geometries)---We reprojected a raster into another CRS, by automatically calculating an optimal `transform` using `rasterio.warp.calculate_default_transform`, then resampling and writing the output using `rasterio.warp.reproject`.\n", - "\n", - "A miminal example of writing a raster file named `r.tif` from scratch (i.e., the 1^st^ scenario), to remind some of these concepts, is given below:\n" - ], - "id": "651bbed8" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "# An array with raster values\n", - "r = np.array([1,2,3,4]).reshape(2,2).astype(np.int8)\n", - "r" - ], - "id": "047ffb24", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "# Calculating the transform\n", - "new_transform = rasterio.transform.from_origin(\n", - " west=-0.5, \n", - " north=51.5, \n", - " xsize=2, \n", - " ysize=2\n", - ")\n", - "new_transform" - ], - "id": "ebb976f3", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "# Creating the file connection with the metadata\n", - "dst = rasterio.open(\n", - " 'output/r.tif', 'w', \n", - " driver = 'GTiff',\n", - " height = r.shape[0],\n", - " width = r.shape[1],\n", - " count = 1,\n", - " dtype = r.dtype,\n", - " crs = 4326,\n", - " transform = new_transform\n", - ")\n", - "dst" - ], - "id": "48743dc2", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "# Writing the array values into the file\n", - "dst.write(r, 1)" - ], - "id": "c8b6b2ec", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "# Closing the file\n", - "dst.close()" - ], - "id": "d804ecfd", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This code section creates a new file `output/r.tif`, which is a $2 \\times 2$ raster, having a 2 decimal degree resolution, with the top-left corner placed over London.\n", - "\n", - "To summarize, the various scenarios differ in two aspects:\n", - "\n", - "* The way that the `transform` for the output raster is obtained: \n", - " * Imported from an existing raster (see below)\n", - " * Created from scratch, using `rasterio.transform.from_origin` (@sec-raster-from-scratch)\n", - " * Calculate automatically, using `rasterio.warp.calculate_default_transform` (@sec-reprojecting-raster-geometries)\n", - "* The way that the raster is written:\n", - " * Using the `.write` method, given an existing array (@sec-raster-from-scratch, @sec-raster-agg-disagg)\n", - " * Using `rasterio.warp.reproject` to calculate and write a resampled or reprojected array (@sec-raster-resampling, @sec-reprojecting-raster-geometries)\n", - "\n", - "To make the picture of raster export complete, there are three important concepts we haven't covered yet: array and raster data types, writing multiband rasters, and handling \"No Data\" values.\n", - "\n", - "Arrays (i.e., `ndarray` objects defined in package **numpy**) are used to store raster values when reading them from file, using `.read` (@sec-using-rasterio). All values in an array are of the same type, whereas the **numpy** package supports numerous numeric data types of various precision (and, accordingly, memory footprint). Raster formats, such as GeoTIFF, support exactly the same data types, which means that reading a raster file uses as little RAM as possible. The most relevant types are summarized in @tbl-numpy-data-types.\n", - "\n", - "| Data type | Description |\n", - "|---|-------------|\n", - "| `int8` | Integer in a single byte (`-128` to `127`)\n", - "| `int16` | Integer in 16 bits (`-32768` to `32767`)\n", - "| `int32` | Integer in 32 bits (`-2147483648` to `2147483647`)\n", - "| `uint8` | Unsigned integer (`0` to `255`)\n", - "| `uint16` | Unsigned integer (`0` to `65535`)\n", - "| `uint32` | Unsigned integer (`0` to `4294967295`)\n", - "| `float16` | Half-precision (16 bit) float (`-65504` to `65504`)\n", - "| `float32` | Single-precision (32 bit) float (`1e-38` to `1e38`)\n", - "| `float64` | Double-precision (64 bit) float (`1e-308` to `1e308`)\n", - "\n", - ": Numeric `numpy` data which are commonly used for rasters {#tbl-numpy-data-types}\n", - "\n", - "The raster data type can be specified when writing a raster (see above). For an existing raster file, the data type is accessible through the `.dtype` property of the metadata: \n" - ], - "id": "51966bd3" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "rasterio.open('output/r.tif').meta['dtype']" - ], - "id": "9d019cec", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The file `r.tif` has the data type `np.int8`, which we specified when creating it according to the data type of the original array:\n" - ], - "id": "098b0ac8" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "r.dtype" - ], - "id": "88b77356", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When reading the data back into the Python session, the array with the same data type is recreated:\n" - ], - "id": "d520d0f6" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "rasterio.open('output/r.tif').read().dtype" - ], - "id": "bf1652af", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Writing multiband rasters is similar to writing single-band rasters, only that we need to: \n", - "\n", - "* Define the number of layers (the `count` property in the metadata) that are going to be in the file we are creating\n", - "* Execute the `.write` method multiple times, once for each layer\n", - "\n", - "For completeness, let's demonstrate writing a multi-band raster named `r3.tif`, which is similar to `r.tif`, but having three bands with values `r`, `r*2`, and `r*3` (i.e., the array `r` multiplied by 1, 2, or 3). Since most of the metadata is going to be the same, this is also a good opportunity to (re-)demonstrate updating an existing metadata object rather than creating one from scratch. \n", - "\n", - "First, let's make a copy of the metadata we already have in `r.tif`:\n" - ], - "id": "e4ae1c81" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "dst_kwds = rasterio.open('output/r.tif').meta.copy()\n", - "dst_kwds" - ], - "id": "e118d0f0", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Second, we update the `count` entry, replacing `1` (single-band) with `3` (three-band):\n" - ], - "id": "32bea4cc" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "dst_kwds.update(count=3)\n", - "dst_kwds" - ], - "id": "d599d812", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, we can create a file connection using the updated metadata and then write the values of the three bands:\n" - ], - "id": "b7a3403a" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "dst = rasterio.open('output/r3.tif', 'w', **dst_kwds)\n", - "dst.write(r, 1)\n", - "dst.write(r*2, 2)\n", - "dst.write(r*3, 3)\n", - "dst.close()" - ], - "id": "f6c49e9d", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As a result, a three-band raster named `r3.tif` is created. \n", - "\n", - "Rasters often contain \"No Data\" values, representing missing data, e.g., unreliable measurement due to clouds or pixels outside of the photographed extent. \n", - "In a **numpy** `ndarray` object, \"No Data\" values may be represented by the special `np.nan` value. \n", - "However, due to computer memory limitations, only arrays of type `float` can contain `np.nan`, while arrays of type `int` cannot. \n", - "For `int` rasters containing \"No Data\", we typically mark missing data with a specific value beyond the valid range (e.g., `-9999`). \n", - "The missing data \"flag\" is stored in the file (set through the `nodata` property of the file connection, see above). \n", - "When reading an `int` raster with \"No Data\" back into Python, we need to be aware of these flags. Let's demonstrate through examples.\n", - "\n", - "We will start with the simpler case, rasters of type `float`. Since `float` arrays may contain the \"native\" value `np.nan`, representing \"No Data\" is straightforward. \n", - "For example, suppose that we have a `float` array with `np.nan`:\n" - ], - "id": "d69382bc" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "r = np.array([1.1,2.1,np.nan,4.1]).reshape(2,2)\n", - "r" - ], - "id": "985b2e7f", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "r.dtype" - ], - "id": "e4e61d8c", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When writing the array to file, we do not need to specify any particular `nodata` value:\n" - ], - "id": "006ac97e" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "dst = rasterio.open(\n", - " 'output/r_nodata_float.tif', 'w', \n", - " driver = 'GTiff',\n", - " height = r.shape[0],\n", - " width = r.shape[1],\n", - " count = 1,\n", - " dtype = r.dtype,\n", - " crs = 4326,\n", - " transform = new_transform\n", - ")\n", - "dst.write(r, 1)\n", - "dst.close()" - ], - "id": "5485418f", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This is equivalent to `nodata=None`:\n" - ], - "id": "26512af6" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "rasterio.open('output/r_nodata_float.tif').meta" - ], - "id": "2c64266f", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Reading from the raster back into the Python session reproduces the same exact array, with `np.nan`:\n" - ], - "id": "9fcb665c" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "rasterio.open('output/r_nodata_float.tif').read()" - ], - "id": "8fe52399", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, suppose that we have an `np.int32` array with missing data, which is inevitably flagged using a specific `int` value such as `-9999` (remember that we can't store `np.nan` in an `int` array!):\n" - ], - "id": "127d1cce" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "r = np.array([1,2,-9999,4]).reshape(2,2).astype(np.int32)\n", - "r" - ], - "id": "66f5312d", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "r.dtype" - ], - "id": "cfcbf6a5", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When writing the array to file, we must specify `nodata=-9999` to keep track of our \"No Data\" flag:\n" - ], - "id": "a59eb435" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "dst = rasterio.open(\n", - " 'output/r_nodata_int.tif', 'w', \n", - " driver = 'GTiff',\n", - " height = r.shape[0],\n", - " width = r.shape[1],\n", - " count = 1,\n", - " dtype = r.dtype,\n", - " nodata = -9999,\n", - " crs = 4326,\n", - " transform = new_transform\n", - ")\n", - "dst.write(r, 1)\n", - "dst.close()" - ], - "id": "819b72e4", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Examining the metadata confirms that the `nodata=-9999` setting was stored in the file `r_nodata_int.tif`. \n" - ], - "id": "4e4eaf58" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "rasterio.open('output/r_nodata_int.tif').meta" - ], - "id": "ea054400", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you try to open the file in GIS software, such as QGIS, you will see the missing data interpreted (e.g., the pixel shown as blank), meaning that the software is aware of the flag. \n", - "However, reading the data back into Python reproduces an `int` array with `-9999`, for the same reason stated before:\n" - ], - "id": "bde446ca" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "src = rasterio.open('output/r_nodata_int.tif')\n", - "r = src.read()\n", - "r" - ], - "id": "433a826d", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The Python user must thefore be mindful of \"No Data\" `int` rasters, for example to avoid interpreting the value `-9999` literally. \n", - "For example, if we \"forget\" about the `nodata` flag, the literal calculation of the `.mean` would incorrectly include the value `-9999`:\n" - ], - "id": "2cbadb5f" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "r.mean()" - ], - "id": "a2be0def", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "There are two basic ways to deal with the situation:\n", - "\n", - "* Converting the raster to `float`\n", - "* Using \"No Data\" masks\n", - "\n", - "First, particularly with small rasters where memory constraints are irrelevant, it may be more convenient to go from `int` to `float`, to gain the ability of the natural `np.nan` representation. \n", - "Here is how we can do this with `r_nodata_int.tif`. We detect the missing data flag, conver the raster to `float`, and assign `np.nan` into the cells that are supposed to be missing:\n" - ], - "id": "b2869e23" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "mask = r == src.nodata\n", - "r = r.astype(np.float64)\n", - "r[mask] = np.nan\n", - "r" - ], - "id": "cc790021", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "From there on, we deal with `np.nan` the usual way, such as using `np.nanmean` to calculate the mean excluding \"No Data\":\n" - ], - "id": "4b691213" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "np.nanmean(r)" - ], - "id": "2b983c26", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The second approach is to read the values into a so-called [\"masked\" array](https://numpy.org/doc/stable/reference/maskedarray.generic.html#what-is-a-masked-array), using the argument `masked=True`. \n", - "A masked array can be thought of as an extended `ndarray`, with two components: `.data` (the values) and `.mask` (a corresponding boolean array marking \"No Data\" values): \n" - ], - "id": "ea85e9f2" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "r = src.read(masked=True)\n", - "r" - ], - "id": "7893e07d", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Using masked arrays is beyond the scope of this book. However, the basic idea is that many **numpy** operations \"honor\" the mask, so that the user does not have to keep track of the way that \"No Data\" values are marked, similarly to the natural `np.nan` representation. For example, the `.mean` of a masked array ignores the value `-9999`, because it is masked, taking into account just the valid values `1`, `2`, and `4`:\n" - ], - "id": "934b7fbd" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "r.mean()" - ], - "id": "fafe0ff0", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Keep in mind that, somewhat confusingly, `float` rasters may represent \"No Data\" using a specific value (such as `-9999.0`), instead, or in addition to (!), the native `np.nan` representation. \n", - "In such cases, the same considerations shown for `int` apply to `float` rasters as well. \n", - "\n", - "## Exercises" - ], - "id": "f9e746c1" - } - ], - "metadata": { - "kernelspec": { - "name": "python3", - "language": "python", - "display_name": "Python 3 (ipykernel)" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file diff --git a/07-read-write-plot.qmd b/07-read-write-plot.qmd index 9fe310a0..9e4d4240 100644 --- a/07-read-write-plot.qmd +++ b/07-read-write-plot.qmd @@ -606,7 +606,7 @@ world.to_file('output/world_many_features.gpkg', mode='a') Here, `world_many_features.gpkg` will contain a polygonal layer named `world` with two "copies" of each country (that is 177×2=354 features, whereas the `world` layer has 177 features): -```{pyhton} +```{python} gpd.read_file('output/world_many_features.gpkg').shape ``` @@ -633,26 +633,23 @@ gpd.read_file('output/world_many_layers.gpkg', layer='world2').head(1) ### Raster data {#sec-data-output-raster} -To write a raster file using **rasterio**, we need to pass a raster file path to `rasterio.open`, in writing (`'w'`) mode. This implies creating a new empty file (or overwriting an existing one). As opposed to read (`'r'`, the default) mode, the `rasterio.open` function needs quite a lot of information, in addition to the file path and mode: - -* An array with the raster values -* Metadata describing the raster format and spatial properties - -The metadata needs to specify the following properties: +To write a raster file using **rasterio**, we need to pass a raster file path to `rasterio.open`, in writing (`'w'`) mode. +This implies creating a new empty file (or overwriting an existing one). +As opposed to reading mode (`'r'`, the default) mode, when in writing mode the `rasterio.open` function needs quite a lot of information, in addition to the file path and mode: * `driver`---The file format (The recommendation is `'GTiff'` for GeoTIFF) * `height`---Number of rows * `width`---Number of columns * `count`---Number of bands * `nodata`---The value which represents "No Data", if any -* `dtype`---The raster data type, one of **numpy** types (e.g., `np.int64`) -* `crs`---The CRS, using an EPSG code (e.g., `4326`) +* `dtype`---The raster data type, one of **numpy** types (e.g., `np.int64`) (see @tbl-numpy-data-types) +* `crs`---The CRS, e.g., using an EPSG code (such as `4326`) * `transform`---The transform matrix * `compress`---A compression method to apply, such as `'lzw'`. This is optional and most useful for large rasters. Note that, at the time of writing, this [doesn't work well](https://gis.stackexchange.com/questions/404738/why-does-rasterio-compression-reduces-image-size-with-single-band-but-not-with-m) for writing multiband rasters. Once the file connection with the right metadata is ready, we do the actual writing using the `.write` method of the file connection. If there are several bands we may execute the `.write` method several times, as in `.write(a,n)`, where `a` is the array with band values and `n` is the band index (starting from `1`, see below). When done, we close the file connection using the `.close` method. Some functions, such as `rasterio.warp.reproject` used for resampling and reprojecting, directly accept a file connection in `'w'` mode, thus handling the writing (of a resampled or reprojected raster) for us. -Most of the properties are either straightforward to choose, based on our aims, (e.g., `driver`, `crs`, `compress`, `nodata`), or directly derived from the array with the raster values itself (e.g., `height`, `width`, `count`, `dtype`). The most complicated property is the `transform`, which specifies the raster origin and resolution. The `transform` is typically either obtained from an existing raster (serving as a "template"), or created from scratch based on manually specified origin and resolution values (e.g., using `rasterio.transform.from_origin`), or calculate automatically (e.g., using `rasterio.warp.calculate_default_transform`). +Most of the properties are either straightforward to choose, based on our aims, (e.g., `driver`, `crs`, `compress`, `nodata`), or directly derived from the array with the raster values itself (e.g., `height`, `width`, `count`, `dtype`). The most complicated property is the `transform`, which specifies the raster origin and resolution. The `transform` is typically either obtained from an existing raster (serving as a "template"), or created from scratch based on manually specified origin and resolution values (e.g., using `rasterio.transform.from_origin`), or calculated automatically (e.g., using `rasterio.warp.calculate_default_transform`), as shown in previous chapters. Earlier in the book, we have already demonstrated the four most common scenarios of writing rasters: diff --git a/07-read-write-plot_files/execute-results/html.json b/07-read-write-plot_files/execute-results/html.json deleted file mode 100644 index cfdaeaa2..00000000 --- a/07-read-write-plot_files/execute-results/html.json +++ /dev/null @@ -1,15 +0,0 @@ -{ - "hash": "41c86c1ac9de1ad5caf5f9ec095e6545", - "result": { - "markdown": "# Geographic data I/O {#read-write}\n\n## Prerequisites\n\nLet's import the required packages:\n\n::: {.cell execution_count=1}\n``` {.python .cell-code}\nimport urllib.request\nimport zipfile\nimport numpy as np\nimport fiona\nimport geopandas as gpd\nimport shapely\nimport rasterio\nimport rasterio.plot\nimport cartopy\nimport osmnx as ox\n```\n:::\n\n\n\n\nand load the sample data for this chapter:\n\n::: {.cell execution_count=3}\n``` {.python .cell-code}\nnz = gpd.read_file('data/nz.gpkg')\nnz_elev = rasterio.open('data/nz_elev.tif')\n```\n:::\n\n\n## Introduction\n\nThis chapter is about reading and writing geographic data.\nGeographic data import is essential for geocomputation: real-world applications are impossible without data. \nData output is also vital, enabling others to use valuable new or improved datasets resulting from your work. \nTaken together, these processes of import/output can be referred to as data I/O.\n\nGeographic data I/O is often done with few lines of code at the beginning and end of projects. \nIt is often overlooked as a simple one step process. \nHowever, mistakes made at the outset of projects (e.g. using an out-of-date or in some way faulty dataset) can lead to large problems later down the line, so it is worth putting considerable time into identifying which datasets are available, where they can be found and how to retrieve them. \nThese topics are covered in @sec-retrieving-open-data, which describes various geoportals, which collectively contain many terabytes of data, and how to use them. \nTo further ease data access, a number of packages for downloading geographic data have been developed, as described in @sec-geographic-data-packages.\n\nThere are many geographic file formats, each of which has pros and cons, described in @sec-file-formats. \nThe process of reading and writing files in formats efficiently is covered in Sections @sec-data-input and @sec-data-output, respectively. \n\n## Retrieving open data {#sec-retrieving-open-data}\n\nA vast and ever-increasing amount of geographic data is available on the internet, much of which is free to access and use (with appropriate credit given to its providers).[^gis-open-datasets]\nIn some ways there is now too much data, in the sense that there are often multiple places to access the same dataset. \nSome datasets are of poor quality. \nIn this context, it is vital to know where to look, so the first section covers some of the most important sources. \nVarious 'geoportals' (web services providing geospatial datasets such as [Data.gov](https://catalog.data.gov/dataset?metadata_type=geospatial)) are a good place to start, providing a wide range of data but often only for specific locations (as illustrated in the updated [Wikipedia page](https://en.wikipedia.org/wiki/Geoportal) on the topic).\n\n[^gis-open-datasets]: For example, visit for a list of websites with freely available geographic datasets.\n\nSome global geoportals overcome this issue. \nThe [GEOSS portal](http://www.geoportal.org/) and the [Copernicus Open Access Hub](https://scihub.copernicus.eu/), for example, contain many raster datasets with global coverage. \nA wealth of vector datasets can be accessed from the [SEDAC](http://sedac.ciesin.columbia.edu/) portal run by the National Aeronautics and Space Administration (NASA) and the European Union's [INSPIRE geoportal](http://inspire-geoportal.ec.europa.eu/), with global and regional coverage.\n\nMost geoportals provide a graphical interface allowing datasets to be queried based on characteristics such as spatial and temporal extent, the United States Geological Survey's [EarthExplorer](https://earthexplorer.usgs.gov/) being a prime example. \nExploring datasets interactively on a browser is an effective way of understanding available layers. \nDownloading data is best done with code, however, from reproducibility and efficiency perspectives. \nDownloads can be initiated from the command line using a variety of techniques, primarily via URLs and APIs (see the [Sentinel API](https://scihub.copernicus.eu/twiki/do/view/SciHubWebPortal/APIHubDescription) for example). \nFiles hosted on static URLs can be downloaded with the following method, as illustrated in the code chunk below which accesses the [Natural Earth Data](https://www.naturalearthdata.com/) website to download the world airports layer zip file and to extract the contained Shapefile. \nNote that the download code is complicated by the fact that the server checks the `User-agent` header of the request, basically to make sure that the download takes place through a browser. \nTo overcome this, we add a header corresponding to a request coming from a browser (such as Firefox) in our code:\n\n::: {.cell execution_count=4}\n``` {.python .cell-code}\n# Set URL+filename\nurl = 'https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_airports.zip'\nfilename = 'output/ne_10m_airports.zip'\n\n# Download\nopener = urllib.request.build_opener()\nopener.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0')]\nurllib.request.install_opener(opener)\nurllib.request.urlretrieve(url, filename)\n\n# Extract\nf = zipfile.ZipFile(filename, 'r')\nf.extractall('output')\nf.close()\n```\n:::\n\n\n\n\nThe Shapefile that has been created in the `output` directory can then be imported and plotted (@fig-ne-airports) as follows:\n\n::: {.cell execution_count=6}\n``` {.python .cell-code}\nne = gpd.read_file(filename.replace('.zip', '.shp'))\nne.plot();\n```\n\n::: {.cell-output .cell-output-display}\n![World airports layer, downloaded using Python from the Natural Earth Data website](07-read-write-plot_files/figure-html/fig-ne-airports-output-1.png){#fig-ne-airports width=429 height=183}\n:::\n:::\n\n\n## Geographic data packages {#sec-geographic-data-packages}\n\nMany Python packages have been developed for accessing geographic data, two of which are presented in @tbl-data-packages and demonstrated below. \nThese provide interfaces to one or more spatial libraries or geoportals and aim to make data access even quicker from the command line.\n\n| Package | Description |\n|---|-----------|\n| `cartopy` | Download layers from [Natural Earth Data](https://www.naturalearthdata.com/downloads/) | \n| `osmnx` | Access to [OpenStreetMap](https://www.openstreetmap.org/) data and conversion to spatial networks | \n: Selected Python packages for geographic data retrieval {#tbl-data-packages}\n\nEach data package has its own syntax for accessing data. \nThis diversity is demonstrated in the subsequent code chunks, which show how to get data using the packages from @tbl-data-packages. \n\nAdministrative borders are often useful in spatial analysis. \nThese can be accessed with the `cartopy.io.shapereader.natural_earth` function from the **cartopy** package. \nFor example, the following code loads the `'admin_2_counties'` dataset of US counties:\n\n::: {.cell execution_count=7}\n``` {.python .cell-code}\nfilename = cartopy.io.shapereader.natural_earth(\n resolution='10m',\n category='cultural',\n name='admin_2_counties'\n)\ncounties = gpd.read_file(filename)\ncounties\n```\n\n::: {.cell-output .cell-output-display execution_count=74}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
FEATURECLASCALERANKADM2_CODE...NAME_ZHNAME_ZHTgeometry
0Admin-2 scale rank0USA-53073...霍特科姆县霍特科姆縣MULTIPOLYGON (((-122.75302 48.9...
1Admin-2 scale rank0USA-53047...奥卡诺根县奧卡諾根縣POLYGON ((-120.85196 48.99251, ...
2Admin-2 scale rank0USA-53019...费里县費里縣POLYGON ((-118.83688 48.99251, ...
........................
3221Admin-2 scale rank0USA-72149...維拉爾巴維拉爾巴POLYGON ((-66.44407 18.17665, -...
3222Admin-2 scale rank0USA-72121...大薩瓦納大薩瓦納POLYGON ((-66.88464 18.02481, -...
3223Admin-2 scale rank0USA-72093...馬里考馬里考POLYGON ((-66.89856 18.18790, -...
\n

3224 rows × 62 columns

\n
\n```\n:::\n:::\n\n\nThe layer of counties is plotted in @fig-ne-counties:\n\n::: {.cell execution_count=8}\n``` {.python .cell-code}\ncounties.plot();\n```\n\n::: {.cell-output .cell-output-display}\n![US counties, downloaded from the Natural Earth Data website using package **cartopy**](07-read-write-plot_files/figure-html/fig-ne-counties-output-1.png){#fig-ne-counties width=418 height=120}\n:::\n:::\n\n\nOther layers can be accessed the same way: \n\n* you need to locate the `resolution`, `category`, and `name` of the requested dataset in [Natural Earth Data](https://www.naturalearthdata.com/downloads/), then \n* run the `cartopy.io.shapereader.natural_earth`, which downloads the file(s) and returns the path, and\n* read the file into the Python environment, e.g., using `gpd.read_file`.\n\nThis is an alternative approach to \"directly\" downloading files as shown earlier (@sec-retrieving-open-data). \n\nThe second example uses the **osmnx** package to find parks from the OpenStreetMap (OSM) database. \nAs illustrated in the code-chunk below, OpenStreetMap data can be obtained using the `ox.features.features_from_place` function. \nThe first argument is a string which is geocoded to a polygon (the `ox.features.features_from_bbox` and `ox.features.features_from_polygon` can also be used to query a custom area of interest). \nThe second argument specifies the OSM tag(s), selecting which OSM elements we're interested in (parks, in this case), represented by key-value pairs:\n\n::: {.cell execution_count=9}\n``` {.python .cell-code}\nparks = ox.features.features_from_place(\n query='leeds uk', \n tags={'leisure': 'park'}\n)\nparks\n```\n\n::: {.cell-output .cell-output-display execution_count=76}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
created_bygeometrybarrier...designationwaystype
element_typeosmid
node389460215NaNPOINT (-1.55473 53.78279)NaN...NaNNaNNaN
5293573719NaNPOINT (-1.49361 53.81912)NaN...NaNNaNNaN
5610301653NaNPOINT (-1.38216 53.76826)NaN...NaNNaNNaN
...........................
relation13390072NaNMULTIPOLYGON (((-1.36901 53.743...NaN...NaN[321795481, 997899209]multipolygon
13433407NaNMULTIPOLYGON (((-1.35251 53.906...NaN...NaN[177323345, 1001812511]multipolygon
14552313NaNPOLYGON ((-1.42512 53.80263, -1...NaN...NaN[636081594, 1092557003]multipolygon
\n

528 rows × 58 columns

\n
\n```\n:::\n:::\n\n\nThe result is a `GeoDataFrame` with the parks in Leeds.\nThe following expression plots the geometries with the `name` property in the tooltips (@fig-ox-features):\n\n::: {.cell execution_count=10}\n``` {.python .cell-code}\nparks[['name', 'geometry']].explore()\n```\n\n::: {#fig-ox-features .cell-output .cell-output-display execution_count=77}\n```{=html}\n
Make this Notebook Trusted to load map: File -> Trust Notebook
\n```\n\nParks in Leeds, based on OpenStreetMap data, downloaded using package `osmnx`\n:::\n:::\n\n\nIt should be noted that the `osmnx` package downloads OSM data from the [Overpass API](https://wiki.openstreetmap.org/wiki/Overpass_API), which is rate limited and therefore unsuitable for queries covering very large areas. \nTo overcome this limitation, you can download OSM data extracts, such as in Shapefile format from [Geofabrik](https://download.geofabrik.de/), and then load them from the file into the Python environment.\n\nOpenStreetMap is a vast global database of crowd-sourced data, is growing daily, and has a wider ecosystem of tools enabling easy access to the data, from the [Overpass turbo](https://overpass-turbo.eu/) web service for rapid development and testing of OSM queries to [osm2pgsql](https://osm2pgsql.org/) for importing the data into a PostGIS database. \nAlthough the quality of datasets derived from OSM varies, the data source and wider OSM ecosystems have many advantages: they provide datasets that are available globally, free of charge, and constantly improving thanks to an army of volunteers. \nUsing OSM encourages 'citizen science' and contributions back to the digital commons (you can start editing data representing a part of the world you know well at [www.openstreetmap.org](https://www.openstreetmap.org/)). \n\nSometimes, packages come with built-in datasets. \nThese can be accessed just like any other object (e.g., function) that is imported as part of the package, or in other ways as specified in the package documentation. \nFor example, package **geopandas** comes with few built-in datasets (see `gpd.datasets.available` for a list of names). \nUsing the `gpd.datasets.get_path` function and the dataset name, we can obtain the path to the location of the dataset file on our computer. \nFor example, `'naturalearth_lowres'` is a vector layer of world countries (from Natural Earth Data, which we've alredy met before):\n\n::: {.cell execution_count=11}\n``` {.python .cell-code}\nfilename = gpd.datasets.get_path('naturalearth_lowres')\nfilename\n```\n\n::: {.cell-output .cell-output-display execution_count=78}\n```\n'/home/michael/.local/lib/python3.10/site-packages/geopandas/datasets/naturalearth_lowres/naturalearth_lowres.shp'\n```\n:::\n:::\n\n\nThen, we can import the dataset, just like from any other file:\n\n::: {.cell execution_count=12}\n``` {.python .cell-code}\ngpd.read_file(filename)\n```\n\n::: {.cell-output .cell-output-display execution_count=79}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
pop_estcontinentnameiso_a3gdp_md_estgeometry
0889953.0OceaniaFijiFJI5496MULTIPOLYGON (((180.00000 -16.0...
158005463.0AfricaTanzaniaTZA63177POLYGON ((33.90371 -0.95000, 34...
2603253.0AfricaW. SaharaESH907POLYGON ((-8.66559 27.65643, -8...
.....................
1741794248.0EuropeKosovo-997926POLYGON ((20.59025 41.85541, 20...
1751394973.0North AmericaTrinidad and TobagoTTO24269POLYGON ((-61.68000 10.76000, -...
17611062113.0AfricaS. SudanSSD11998POLYGON ((30.83385 3.50917, 29....
\n

177 rows × 6 columns

\n
\n```\n:::\n:::\n\n\nAnother way to obtain spatial information is to perform geocoding---transform a description of a location, usually an address, into its coordinates. \nThis is usually done by sending a query to an online service and getting the location as a result. \nMany such services exist that differ in the used method of geocoding, usage limitations, costs, or API key requirements. \n[Nominatim](https://nominatim.openstreetmap.org/ui/about.html) is a popular free service, based on OpenStreetMap data. \nIt can be accessed in Python uisng the `osmnx.geocoder.geocode` function. The function returns a `tuple` of the form `(lat,lon)`. \nThe example below searches for John Snow blue plaque coordinates located on a building in the Soho district of London:\n\n::: {.cell execution_count=13}\n``` {.python .cell-code}\nox.geocoder.geocode('54 Frith St, London W1D 4SJ, UK')\n```\n\n::: {.cell-output .cell-output-display execution_count=80}\n```\n(51.513770550000004, -0.13178011626807157)\n```\n:::\n:::\n\n\nIf the query returns no results, an `InsufficientResponseError` is raised, a scenario that the user can deal with using [`try`/`except`](https://docs.python.org/3/tutorial/errors.html#handling-exceptions). \n\nThe alternative function `osmnx.geocoder.geocode_to_gdf` can be used to automatically geocode multiple addresses (accepting a `list` of `string`s) and transforming them into a `GeoDataFrame`. \nThis function also returns `'Polygon'` geometries. \nFor example:\n\n::: {.cell execution_count=14}\n``` {.python .cell-code}\nresult = ox.geocoder.geocode_to_gdf(['54 Frith St, London W1D 4SJ, UK'])\nresult\n```\n\n::: {.cell-output .cell-output-display execution_count=81}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
geometrybbox_northbbox_south...addresstypenamedisplay_name
0POLYGON ((-0.13193 51.51376, -0...51.51384551.513696...building54, Frith Street, Soho, Islingt...
\n

1 rows × 17 columns

\n
\n```\n:::\n:::\n\n\nThe result is visualized below using `.explore` (@fig-ox-geocode):\n\n::: {.cell execution_count=15}\n``` {.python .cell-code}\nresult[['display_name', 'geometry']].explore(color='red')\n```\n\n::: {#fig-ox-geocode .cell-output .cell-output-display execution_count=82}\n```{=html}\n
Make this Notebook Trusted to load map: File -> Trust Notebook
\n```\n\nSpecific address in London, geocoded into a `GeoDataFrame` using package `osmnx`\n:::\n:::\n\n\n\n\n## File formats {#sec-file-formats}\n\nGeographic datasets are usually stored as files or in spatial databases. \nFile formats can either store vector or raster data, while spatial databases such as [PostGIS](https://postgis.net/) can store both. \nThe large variety of file formats may seem bewildering, but there has been much consolidation and standardization since the beginnings of GIS software in the 1960s when the first widely distributed program ([SYMAP](https://news.harvard.edu/gazette/story/2011/10/the-invention-of-gis/)) for spatial analysis was created at Harvard University [@coppock_history_1991].\n\nGDAL (which should be pronounced \"goo-dal\", with the double \"o\" making a reference to object-orientation), the Geospatial Data Abstraction Library, has resolved many issues associated with incompatibility between geographic file formats since its release in 2000. \nGDAL provides a unified and high-performance interface for reading and writing of many raster and vector data formats. \nMany open and proprietary GIS programs, including GRASS, ArcGIS and QGIS, use GDAL behind their GUIs for doing the legwork of ingesting and spitting out geographic data in appropriate formats.\nMost Pyhton packages for working with spatial data, including **geopandas** and **rasterio** used in this book, also rely on GDAL for importing and exporting spatial data files. \n\nGDAL provides access to more than 200 vector and raster data formats. \n@tbl-file-formats presents some basic information about selected and often used spatial file formats.\n\nName | Extension | Info | Type | Model |\n|-----|----|----------|-----|-----|\nESRI Shapefile | `.shp` (the main file) | Popular format consisting of at least three files. No support for: files > 2GB;mixed types; names > 10 chars; cols > 255. | Vector | Partially open |\nGeoJSON | `.geojson` | Extends the JSON exchange format by including a subset of the simple feature representation; mostly used for storing coordinates in longitude and latitude; it is extended by the TopoJSON format. | Vector | Open |\nKML | `.kml` | XML-based format for spatial visualization, developed for use with Google Earth. Zipped KML file forms the KMZ format. | Vector | Open |\nGPX | `.gpx` | XML schema created for exchange of GPS data. | Vector | Open |\nFlatGeobuf | `.fgb` | Single file format allowing for quick reading and writing of vector data. Has streaming capabilities. | Vector | Open |\nGeoTIFF | `.tif/.tiff` | Popular raster format. A TIFF file containing additional spatial metadata. | Raster | Open |\nArc ASCII | `.asc` | Text format where the first six lines represent the raster header, followed by the raster cell values arranged in rows and columns. | Raster | Open |\nSQLite/SpatiaLite | `.sqlite` | Standalone relational database, SpatiaLite is the spatial extension of SQLite. | Vector and raster | Open |\nESRI FileGDB | `.gdb` | Spatial and nonspatial objects created by ArcGIS. Allows: multiple feature classes; topology. Limited support from GDAL. | Vector and raster | Proprietary |\nGeoPackage | `.gpkg` | Lightweight database container based on SQLite allowing an easy and platform-independent exchange of geodata | Vector and (very limited) raster | Open |\n: Commonly used spatial data file formats {#tbl-file-formats}\n\nAn important development ensuring the standardization and open-sourcing of file formats was the founding of the Open Geospatial Consortium ([OGC](http://www.opengeospatial.org/)) in 1994. \nBeyond defining the simple features data model (see @sec-simple-features), the OGC also coordinates the development of open standards, for example as used in file formats such as KML and GeoPackage. \nOpen file formats of the kind endorsed by the OGC have several advantages over proprietary formats: the standards are published, ensure transparency and open up the possibility for users to further develop and adjust the file formats to their specific needs.\n\nESRI Shapefile is the most popular vector data exchange format; however, it is not an open format (though its specification is open). \nIt was developed in the early 1990s and has a number of limitations. \nFirst of all, it is a multi-file format, which consists of at least three files. \nIt only supports 255 columns, column names are restricted to ten characters and the file size limit is 2 GB. \nFurthermore, ESRI Shapefile does not support all possible geometry types, for example, it is unable to distinguish between a polygon and a multipolygon. \nDespite these limitations, a viable alternative had been missing for a long time. \nIn the meantime, [GeoPackage](https://www.geopackage.org/) emerged, and seems to be a more than suitable replacement candidate for ESRI Shapefile. \nGeoPackage is a format for exchanging geospatial information and an OGC standard. \nThe GeoPackage standard describes the rules on how to store geospatial information in a tiny SQLite container. \nHence, GeoPackage is a lightweight spatial database container, which allows the storage of vector and raster data but also of non-spatial data and extensions. \nAside from GeoPackage, there are other geospatial data exchange formats worth checking out (@tbl-file-formats).\n\nThe GeoTIFF format seems to be the most prominent raster data format. \nIt allows spatial information, such as the CRS definition and the transformation matrix (see @sec-using-rasterio), to be embedded within a TIFF file. \nSimilar to ESRI Shapefile, this format was firstly developed in the 1990s, but as an open format. \nAdditionally, GeoTIFF is still being expanded and improved. \nOne of the most significant recent addition to the GeoTIFF format is its variant called COG (Cloud Optimized GeoTIFF). \nRaster objects saved as COGs can be hosted on HTTP servers, so other people can read only parts of the file without downloading the whole file (@sec-input-raster).\n\nThere is also a plethora of other spatial data formats that we do not explain in detail or mention in @tbl-file-formats due to the book limits. \nIf you need to use other formats, we encourage you to read the GDAL documentation about [vector](https://gdal.org/drivers/vector/index.html) and [raster](https://gdal.org/drivers/raster/index.html) drivers. \nAdditionally, some spatial data formats can store other data models (types) than vector or raster. \nIt includes LAS and LAZ formats for storing lidar point clouds, and NetCDF and HDF for storing multidimensional arrays. \n\nFinally, spatial data is also often stored using tabular (non-spatial) text formats, including CSV files or Excel spreadsheets. \nThis can be convenient to share spatial (point) datasets with people who, or software that, struggle with spatial data formats. \nIf necessary, the table can be converted to a point layer (see examples in @sec-vector-layer-from-scratch and @sec-spatial-joining).\n\n## Data input (I) {#sec-data-input}\n\nExecuting commands such as `geopandas.read_file` (the main function we use for loading vector data) or `rasterio.open`+`.read` (the main functions used for loading raster data) silently sets off a chain of events that reads data from files. \nMoreover, there are many Python packages containing a wide range of geographic data or providing simple access to different data sources. \nAll of them load the data into the Python environment or, more precisely, assign objects to your workspace, stored in RAM and accessible within the Python session.\n\n### Vector data\n\nSpatial vector data comes in a wide variety of file formats. \nMost popular representations such as `.shp`, `.geojson`, and `.gpkg` files can be imported and exported with **geopandas** functions `read_file` and `to_file` (covered in @sec-data-output), respectively.\n\n**geopandas** uses GDAL to read and write data, via `fiona` (the [default](https://github.com/geopandas/geopandas/issues/2217)) or `pyogrio` packages (a recently developed alternative to `fiona`). \nAfter `fiona` is imported, the command `fiona.supported_drivers` can be used to list drivers available to GDAL, including whether they can (`'r'`), append (`'a'`), or write (`'w'`) data, or all three:\n\n::: {.cell execution_count=16}\n``` {.python .cell-code}\nfiona.supported_drivers\n```\n\n::: {.cell-output .cell-output-display execution_count=83}\n```\n{'DXF': 'rw',\n 'CSV': 'raw',\n 'OpenFileGDB': 'raw',\n 'ESRIJSON': 'r',\n 'ESRI Shapefile': 'raw',\n 'FlatGeobuf': 'raw',\n 'GeoJSON': 'raw',\n 'GeoJSONSeq': 'raw',\n 'GPKG': 'raw',\n 'GML': 'rw',\n 'OGR_GMT': 'rw',\n 'GPX': 'rw',\n 'MapInfo File': 'raw',\n 'DGN': 'raw',\n 'S57': 'r',\n 'SQLite': 'raw',\n 'TopoJSON': 'r',\n 'KML': 'r'}\n```\n:::\n:::\n\n\nOther, less common, drivers can be [\"activated\"](https://geopandas.org/en/stable/docs/user_guide/io.html) by manually supplementing `fiona.supported_drivers`.\n\nThe first argument of the **geopandas** versatile data import function `gpd.read_file` is `filename`, which is typically a string, but can also be a file connection.\nThe content of a string could vary between different drivers.\nIn most cases, as with the ESRI Shapefile (`.shp`) or the GeoPackage format (`.gpkg`), the `filename` argument would be a path or a URL to an actual file, such as `geodata.gpkg`.\nThe driver is automatically selected based on the file extension, as demonstrated for a `.gpkg` file below:\n\n::: {.cell execution_count=17}\n``` {.python .cell-code}\nworld = gpd.read_file('data/world.gpkg')\nworld\n```\n\n::: {.cell-output .cell-output-display execution_count=84}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
iso_a2name_longcontinent...lifeExpgdpPercapgeometry
0FJFijiOceania...69.9600008222.253784MULTIPOLYGON (((-180.00000 -16....
1TZTanzaniaAfrica...64.1630002402.099404MULTIPOLYGON (((33.90371 -0.950...
2EHWestern SaharaAfrica...NaNNaNMULTIPOLYGON (((-8.66559 27.656...
........................
174XKKosovoEurope...71.0975618698.291559MULTIPOLYGON (((20.59025 41.855...
175TTTrinidad and TobagoNorth America...70.42600031181.821196MULTIPOLYGON (((-61.68000 10.76...
176SSSouth SudanAfrica...55.8170001935.879400MULTIPOLYGON (((30.83385 3.5091...
\n

177 rows × 11 columns

\n
\n```\n:::\n:::\n\n\nFor some drivers, such as a File Geodatabase (`OpenFileGDB`), `filename` could be provided as a folder name.\n\nGeoJSON, a plain text format, can be read from a `.geojson` file, but also from a string:\n\n::: {.cell execution_count=18}\n``` {.python .cell-code}\ngpd.read_file('{\"type\":\"Point\",\"coordinates\":[34.838848,31.296301]}')\n```\n\n::: {.cell-output .cell-output-display execution_count=85}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n
geometry
0POINT (34.83885 31.29630)
\n
\n```\n:::\n:::\n\n\nThe `gpd.read_postgis` function can be used to read a vector layer from a PostGIS database.\n\nSome vector formats, such as GeoPackage, can store multiple data layers. \nBy default, `gpd.read_file` automatically reads the first layer of the file specified in `filename`. \nHowever, using the `layer` argument you can specify any other layer.\n\nThe `gpd.read_file` function also allows for reading just parts of the file into RAM with two possible mechanisms. \nThe first one is related to the `where` argument, which allows specifying what part of the data to read using an SQL `WHERE` expression. \nAn example below extracts data for Tanzania only (@fig-read-shp-query (a)). \nIt is done by specifying that we want to get all rows for which `name_long` equals to `'Tanzania'`:\n\n::: {.cell execution_count=19}\n``` {.python .cell-code}\ntanzania = gpd.read_file('data/world.gpkg', where='name_long=\"Tanzania\"')\ntanzania\n```\n\n::: {.cell-output .cell-output-display execution_count=86}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
iso_a2name_longcontinent...lifeExpgdpPercapgeometry
0TZTanzaniaAfrica...64.1632402.099404MULTIPOLYGON (((33.90371 -0.950...
\n

1 rows × 11 columns

\n
\n```\n:::\n:::\n\n\nIf you do not know the names of the available columns, a good approach is to just read one row of the data using the `rows` argument, which can be used to read the first N rows, then use the `.columns` property to examine the column names:\n\n::: {.cell execution_count=20}\n``` {.python .cell-code}\ngpd.read_file('data/world.gpkg', rows=1).columns\n```\n\n::: {.cell-output .cell-output-display execution_count=87}\n```\nIndex(['iso_a2', 'name_long', 'continent', 'region_un', 'subregion', 'type',\n 'area_km2', 'pop', 'lifeExp', 'gdpPercap', 'geometry'],\n dtype='object')\n```\n:::\n:::\n\n\nThe second mechanism uses the `mask` argument to filter data based on intersection with an existing geometry. \nThis argument expects a geometry (`GeoDataFrame`, `GeoSeries`, or `shapely`) representing the area where we want to extract the data. \nLet's try it using a small example---we want to read polygons from our file that intersect with the buffer of 50,000 $m$ of Tanzania's borders. \nTo do it, we need to: \n\n* transform the geometry to a projected CRS (such as `EPSG:32736`), \n* prepare our \"filter\" by creating the buffer (@sec-buffers), and \n* transform back to the original CRS to be used as a mask:\n\n::: {.cell execution_count=21}\n``` {.python .cell-code}\ntanzania_buf = tanzania.to_crs(32736).buffer(50000).to_crs(4326)\ntanzania_buf\n```\n\n::: {.cell-output .cell-output-display execution_count=88}\n```\n0 POLYGON ((30.26283 -3.18314, 30...\ndtype: geometry\n```\n:::\n:::\n\n\nThe resulting buffered geometry is shown in @fig-tanzania-mask:\n\n::: {#fig-tanzania-mask .cell execution_count=22}\n``` {.python .cell-code}\ntanzania_buf.plot()\n```\n\n::: {#fig-tanzania-mask-1 .cell-output .cell-output-display execution_count=89}\n```\n\n```\n\nTanzania polygon buffered by 50 $km$, to by used for reading a subset of `world.gpkg` (@fig-read-shp-query (b))\n:::\n\n::: {.cell-output .cell-output-display}\n![](07-read-write-plot_files/figure-html/fig-tanzania-mask-output-2.png){#fig-tanzania-mask-2 width=429 height=412}\n:::\n:::\n\n\nNow, we can apply pass the \"filter\" geometry `tanzania_buf` to the `mask` argument of `gpd.read_file`:\n\n::: {.cell execution_count=23}\n``` {.python .cell-code}\ntanzania_neigh = gpd.read_file('data/world.gpkg', mask=tanzania_buf)\ntanzania_neigh\n```\n\n::: {.cell-output .cell-output-display execution_count=90}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
iso_a2name_longcontinent...lifeExpgdpPercapgeometry
0MZMozambiqueAfrica...57.0991079.823866MULTIPOLYGON (((34.55999 -11.52...
1ZMZambiaAfrica...60.7753632.503753MULTIPOLYGON (((30.74001 -8.340...
2MWMalawiAfrica...61.9321090.367208MULTIPOLYGON (((32.75938 -9.230...
........................
6BIBurundiAfrica...56.688803.172837MULTIPOLYGON (((30.46967 -2.413...
7UGUgandaAfrica...59.2241637.275081MULTIPOLYGON (((33.90371 -0.950...
8RWRwandaAfrica...66.1881629.868866MULTIPOLYGON (((30.41910 -1.134...
\n

9 rows × 11 columns

\n
\n```\n:::\n:::\n\n\nOur result, shown in @fig-read-shp-query (b), contains Tanzania and every country intersecting with its 50,000 $m$ buffer. \nNote that the last two expressions are used to add text labels with the `name_long` of each country, placed at the country centroid:\n\n::: {#fig-read-shp-query .cell layout-ncol='2' execution_count=24}\n``` {.python .cell-code}\n# Using 'where'\nfig, ax = plt.subplots()\ntanzania.plot(ax=ax, color='lightgrey', edgecolor='grey')\ntanzania.apply(lambda x: ax.annotate(text=x['name_long'], xy=x.geometry.centroid.coords[0], ha='center'), axis=1);\n\n# Using 'mask'\nfig, ax = plt.subplots()\ntanzania_neigh.plot(ax=ax, color='lightgrey', edgecolor='grey')\ntanzania_buf.plot(ax=ax, color='none', edgecolor='red')\ntanzania_neigh.apply(lambda x: ax.annotate(text=x['name_long'], xy=x.geometry.centroid.coords[0], ha='center'), axis=1);\n```\n\n::: {.cell-output .cell-output-display}\n![Using a `where` query (matching `'Tanzania'`)](07-read-write-plot_files/figure-html/fig-read-shp-query-output-1.png){#fig-read-shp-query-1 width=429 height=409}\n:::\n\n::: {.cell-output .cell-output-display}\n![Using a `mask` (a geometry shown in red)](07-read-write-plot_files/figure-html/fig-read-shp-query-output-2.png){#fig-read-shp-query-2 width=395 height=411}\n:::\n\nReading a subset of the vector layer file `world.gpkg`\n:::\n\n\nOften we need to read CSV files (or other tabular formats) which have x and y coordinate columns, and turn them into a `GeoDataFrame` with point geometries. \nTo do that, we can import the file using **pandas** (e.g., using `pd.read_csv` or `pd.read_excel`), then go from `DataFrame` to `GeoDataFrame` using the `gpd.points_from_xy` function, as shown earlier in the book (See @sec-vector-layer-from-scratch and @sec-spatial-joining). For example, the table `cycle_hire_xy.csv`, where the coordinates are stored in the `X` and `Y` columns in `EPSG:4326`, can be imported, converted to a `GeoDataFrame`, and plotted, as follows (@fig-cycle_hire_xy-layer):\n\n::: {.cell execution_count=25}\n``` {.python .cell-code}\ncycle_hire = pd.read_csv('data/cycle_hire_xy.csv')\ngeom = gpd.points_from_xy(cycle_hire['X'], cycle_hire['Y'], crs=4326)\ngeom = gpd.GeoSeries(geom)\ncycle_hire_xy = gpd.GeoDataFrame(data=cycle_hire, geometry=geom)\ncycle_hire_xy.plot();\n```\n\n::: {.cell-output .cell-output-display}\n![The `cycle_hire_xy.csv` table transformed to a point layer](07-read-write-plot_files/figure-html/fig-cycle_hire_xy-layer-output-1.png){#fig-cycle_hire_xy-layer width=440 height=264}\n:::\n:::\n\n\nInstead of columns describing 'XY' coordinates, a single column can also contain the geometry information, not necessarily points but possible any other geometry type. \nWell-known text (WKT), well-known binary (WKB), and GeoJSON are examples of formats used to encode geometry in such a column. \nFor instance, the `world_wkt.csv` file has a column named `'WKT'`, representing polygons of the world's countries (in WKT format). \nTo import and convert it to a `GeoDataFrame`, we can apply the `shapely.from_wkt` function (@sec-geometries) on the WKT strings, to convert them into `shapely` geometries (@fig-world_wkt-layer):\n\n::: {.cell execution_count=26}\n``` {.python .cell-code}\nworld_wkt = pd.read_csv('data/world_wkt.csv')\nworld_wkt['geometry'] = world_wkt['WKT'].apply(shapely.from_wkt)\nworld_wkt = gpd.GeoDataFrame(world_wkt)\nworld_wkt.plot();\n```\n\n::: {.cell-output .cell-output-display}\n![The `world_wkt.csv` table transformed to a polygon layer](07-read-write-plot_files/figure-html/fig-world_wkt-layer-output-1.png){#fig-world_wkt-layer width=429 height=221}\n:::\n:::\n\n\n::: {.callout-note}\nNot all of the supported vector file formats store information about their coordinate reference system. \nIn these situations, it is possible to add the missing information using the `.set_crs` function. \nPlease refer also to @sec-querying-and-setting-coordinate-systems for more information. \n:::\n\nAs a final example, we will show how **geopandas** also reads KML files. \nA KML file stores geographic information in XML format---a data format for the creation of web pages and the transfer of data in an application-independent way [@nolan_xml_2014]. \nHere, we access a KML file from the web. \nFirst, if necessary, we may need to \"activate\" the `KML` driver, which isn't always available by default:\n\n::: {.cell execution_count=27}\n``` {.python .cell-code}\nfiona.supported_drivers['KML'] = 'r'\n```\n:::\n\n\nThe sample KML file `KML_Samples.kml` contains more than one layer. \nTo list the available layers, we can use the `fiona.listlayers` function: \n\n::: {.cell execution_count=28}\n``` {.python .cell-code}\nu = 'https://developers.google.com/kml/documentation/KML_Samples.kml'\nfiona.listlayers(u)\n```\n\n::: {.cell-output .cell-output-display execution_count=95}\n```\n['Placemarks',\n 'Highlighted Icon',\n 'Paths',\n 'Google Campus',\n 'Extruded Polygon',\n 'Absolute and Relative']\n```\n:::\n:::\n\n\nWe can choose, for instance, the first layer `'Placemarks'` and read it, using `gpd.read_file` with an additional `layer` argument:\n\n::: {.cell execution_count=29}\n``` {.python .cell-code}\nplacemarks = gpd.read_file(u, layer='Placemarks')\nplacemarks\n```\n\n::: {.cell-output .cell-output-display execution_count=96}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
NameDescriptiongeometry
0Simple placemarkAttached to the ground. Intelli...POINT Z (-122.08220 37.42229 0....
1Floating placemarkFloats a defined distance above...POINT Z (-122.08407 37.42200 50...
2Extruded placemarkTethered to the ground by a cus...POINT Z (-122.08577 37.42157 50...
\n
\n```\n:::\n:::\n\n\n### Raster data {#sec-input-raster}\n\nSimilar to vector data, raster data comes in many file formats with some of them supporting multilayer files. \n`rasterio.open` is used to create a file connection to a raster file, which can be subsequently used to read the metadata and/or the values, as shown previously (@sec-using-rasterio). \nFor example: \n\n::: {.cell execution_count=30}\n``` {.python .cell-code}\nsrc = rasterio.open('data/srtm.tif')\nsrc\n```\n\n::: {.cell-output .cell-output-display execution_count=97}\n```\n\n```\n:::\n:::\n\n\nAll of the previous examples, like the one above, read spatial information from files stored on your hard drive. \nHowever, GDAL also allows reading data directly from online resources, such as HTTP/HTTPS/FTP web resources. \nThe only thing we need to do is to add a `/vsicurl/` prefix before the path to the file. \nLet's try it by connecting to the global monthly snow probability at 500 $m$ resolution for the period 2000-2012 [@hengl_t_2021_5774954]. \nSnow probability for December is stored as a Cloud Optimized GeoTIFF (COG) file (see @sec-file-formats). \nTo read an online file, we need to provide its URL together with the `/vsicurl/` prefix:\n\n::: {.cell execution_count=31}\n``` {.python .cell-code}\nsrc = rasterio.open('/vsicurl/https://zenodo.org/record/5774954/files/clm_snow.prob_esacci.dec_p.90_500m_s0..0cm_2000..2012_v2.0.tif')\nsrc\n```\n\n::: {.cell-output .cell-output-display execution_count=98}\n```\n\n```\n:::\n:::\n\n\nIn the example above `rasterio.open` creates a connection to the file without obtaining any values, as we did for the local `srtm.tif` file.\nThe values can read, into an `ndarray`, using the `.read` method of the file connection (@sec-using-rasterio). \nUsing parameters of `.read` allows us to just read a small portion of the data, without downloading the entire file. \nThis is very useful when working with large datasets hosted online from resource-constrained computing environments such as laptops.\n\nFor example, we can read a specified rectangular extent of the raster.\nWith **rasterio**, this is done using the so-called [windowed reading](https://rasterio.readthedocs.io/en/latest/topics/windowed-rw.html) capabilities. \nNote that, with windowed reading, we import just a subset of the raster extent into an `ndarray` covering any partial extent. \nWindowed reading is therefore memory- (and, in this case, bandwidth-) efficient, since it avoids reading the entire raster into memory. \nWindowed reading can also be considered an alternative pathway to *cropping* (@sec-raster-cropping).\n\nTo read a raster *window*, let's first define the bounding box coordinates. \nFor example, here we use a $10 \\times 10$ degrees extent coinciding with Reykjavik:\n\n::: {.cell execution_count=32}\n``` {.python .cell-code}\nxmin=-30\nxmax=-20\nymin=60\nymax=70\n```\n:::\n\n\nUsing the extent coordinates along with the raster transformation matrix, we create a window object, using the [`rasterio.windows.from_bounds`](https://rasterio.readthedocs.io/en/stable/api/rasterio.windows.html#rasterio.windows.from_bounds) function. \nThe `rasterio.windows.from_bounds` function basically \"translates\" the extent from coordinates, to row/column ranges:\n\n::: {.cell execution_count=33}\n``` {.python .cell-code}\nw = rasterio.windows.from_bounds(\n left=xmin, \n bottom=ymin,\n right=xmax,\n top=ymax, \n transform=src.transform\n)\nw\n```\n\n::: {.cell-output .cell-output-display execution_count=100}\n```\nWindow(col_off=35999.99999999998, row_off=4168.799999999996, width=2399.9999999999927, height=2400.0)\n```\n:::\n:::\n\n\nNow we can read the partial array, accordint to the specified window `w`, passing it to the `.read` method:\n\n::: {.cell execution_count=34}\n``` {.python .cell-code}\nr = src.read(1, window=w)\nr\n```\n\n::: {.cell-output .cell-output-display execution_count=101}\n```\narray([[100, 100, 100, ..., 255, 255, 255],\n [100, 100, 100, ..., 255, 255, 255],\n [100, 100, 100, ..., 255, 255, 255],\n ...,\n [255, 255, 255, ..., 255, 255, 255],\n [255, 255, 255, ..., 255, 255, 255],\n [255, 255, 255, ..., 255, 255, 255]], dtype=uint8)\n```\n:::\n:::\n\n\nNote that the transformation matrix of the window is not the same as that of the original raster (unless it incidentally starts from the top-left corner)! \nTherefore, we must re-create the transformation matrix, with the modified origin (`xmin`,`ymax`), yet the same resolution, as follows:\n\n::: {.cell execution_count=35}\n``` {.python .cell-code}\nw_transform = rasterio.transform.from_origin(\n west=xmin, \n north=ymax, \n xsize=src.transform[0],\n ysize=abs(src.transform[4])\n)\nw_transform\n```\n\n::: {.cell-output .cell-output-display execution_count=102}\n```\nAffine(0.00416666666666667, 0.0, -30.0,\n 0.0, -0.00416666666666667, 70.0)\n```\n:::\n:::\n\n\nThe array `r` along with the updated transformation matrix `w_transform` comprise the partial window, which we can keep working with just like with any other raster, as shown in previous chapters.\n@fig-raster-window shows the result, along with the location of Reykjavik (see below):\n\n::: {.cell execution_count=36}\n``` {.python .cell-code}\nfig, ax = plt.subplots()\nrasterio.plot.show(r, transform=w_transform, ax=ax)\ngpd.GeoSeries(shapely.Point(-21.94, 64.15)).plot(ax=ax, color='red');\n```\n\n::: {.cell-output .cell-output-display}\n![Raster window read from a remote Cloud Optimized GeoTIFF (COG) file source](07-read-write-plot_files/figure-html/fig-raster-window-output-1.png){#fig-raster-window width=429 height=416}\n:::\n:::\n\n\nAnother option is to extract raster values at particular points, directly from the file connection, using the `.sample` method (see @sec-spatial-subsetting-raster). \nFor example, we can get the snow probability for December in Reykjavik (70%) by specifying its coordinates and applying `.sample`:\n\n::: {.cell execution_count=37}\n``` {.python .cell-code}\ncoords = (-21.94, 64.15)\nvalues = src.sample([coords])\nlist(values)\n```\n\n::: {.cell-output .cell-output-display execution_count=104}\n```\n[array([70], dtype=uint8)]\n```\n:::\n:::\n\n\nThe example above efficiently extracts and downloads a single value instead of the entire GeoTIFF file, saving valuable resources.\n\nThe `/vsicurl/` prefix also works for *vector* file formats, enabling you to import datasets from online storage with **geopandas** just by adding it before the vector file URL.\nImportantly, `/vsicurl/` is not the only prefix provided by GDAL---many more exist, such as `/vsizip/` to read spatial files from ZIP archives without decompressing them beforehand or `/vsis3/` for on-the-fly reading files available in AWS S3 buckets. \nYou can learn more about it at .\n\n## Data output (O) {#sec-data-output}\n\nWriting geographic data allows you to convert from one format to another and to save newly created objects for permanent storage. \nDepending on the data type (vector or raster), object class (e.g., `GeoDataFrame`), and type and amount of stored information (e.g., object size, range of values), it is important to know how to store spatial files in the most efficient way. \nThe next two sections will demonstrate how to do this.\n\n### Vector data\n\nThe counterpart of `gpd.read_file` is the `.to_file` method that a `GeoDataFrame` has. \nIt allows you to write `GeoDataFrame` objects to a wide range of geographic vector file formats, including the most common ones, such as `.geojson`, `.shp` and `.gpkg`. \nBased on the file name, `.to_file` decides automatically which driver to use. The speed of the writing process depends also on the driver.\n\nFor example, here is how we export the `world` layer to a GeoPackage file:\n\n::: {.cell execution_count=38}\n``` {.python .cell-code}\nworld.to_file('output/world.gpkg')\n```\n:::\n\n\nNote: if you try to write to the same data source again, the function will overwrite the file:\n\n::: {.cell execution_count=39}\n``` {.python .cell-code}\nworld.to_file('output/world.gpkg')\n```\n:::\n\n\nInstead of overwriting the file, we could add a new layer to the file with `mode='a'` (\"append\" mode, as opposed to the default `mode='w'` for \"write\" mode). Appending is supported by several spatial formats, including GeoPackage. For example:\n\n::: {.cell execution_count=40}\n``` {.python .cell-code}\nworld.to_file('output/world_many_features.gpkg')\nworld.to_file('output/world_many_features.gpkg', mode='a')\n```\n:::\n\n\nHere, `world_many_features.gpkg` will contain a polygonal layer named `world` with two \"copies\" of each country (that is 177×2=354 features, whereas the `world` layer has 177 features).\n\nAlternatively, you can create another, separate, layer, within the same file. The GeoPackage format also supports multiple layers within one file. For example:\n\n::: {.cell execution_count=41}\n``` {.python .cell-code}\nworld.to_file('output/world_many_layers.gpkg')\nworld.to_file('output/world_many_layers.gpkg', layer='world2')\n```\n:::\n\n\nIn this case, `world_many_layers.gpkg` has two \"layers\", `world_many_layers` (same as the file name, when `layer` is unspecified) and `world2`. Incidentally, the contents of the two layers is identical, but this doesn't have to be so. Each layer from such a file can be imported separately, as in: \n\n::: {.cell execution_count=42}\n``` {.python .cell-code}\ngpd.read_file('output/world_many_layers.gpkg', layer='world_many_layers').head(1)\n```\n\n::: {.cell-output .cell-output-display execution_count=109}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
iso_a2name_longcontinent...lifeExpgdpPercapgeometry
0FJFijiOceania...69.968222.253784MULTIPOLYGON (((-180.00000 -16....
\n

1 rows × 11 columns

\n
\n```\n:::\n:::\n\n\n::: {.cell execution_count=43}\n``` {.python .cell-code}\ngpd.read_file('output/world_many_layers.gpkg', layer='world2').head(1)\n```\n\n::: {.cell-output .cell-output-display execution_count=110}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
iso_a2name_longcontinent...lifeExpgdpPercapgeometry
0FJFijiOceania...69.968222.253784MULTIPOLYGON (((-180.00000 -16....
\n

1 rows × 11 columns

\n
\n```\n:::\n:::\n\n\n### Raster data {#sec-data-output-raster}\n\nTo write a raster file using **rasterio**, we need to pass a raster file path to `rasterio.open`, in writing (`'w'`) mode. This implies creating a new empty file (or overwriting an existing one). As opposed to read (`'r'`, the default) mode, the `rasterio.open` function needs quite a lot of information, in addition to the file path and mode:\n\n* An array with the raster values\n* Metadata describing the raster format and spatial properties\n\nThe metadata needs to specify the following properties:\n\n* `driver`---The file format (The recommendation is `'GTiff'` for GeoTIFF)\n* `height`---Number of rows\n* `width`---Number of columns\n* `count`---Number of bands\n* `nodata`---The value which represents \"No Data\", if any\n* `dtype`---The raster data type, one of **numpy** types (e.g., `np.int64`)\n* `crs`---The CRS, using an EPSG code (e.g., `4326`)\n* `transform`---The transform matrix\n* `compress`---A compression method to apply, such as `'lzw'`. This is optional and most useful for large rasters. Note that, at the time of writing, this [doesn't work well](https://gis.stackexchange.com/questions/404738/why-does-rasterio-compression-reduces-image-size-with-single-band-but-not-with-m) for writing multiband rasters.\n\nOnce the file connection with the right metadata is ready, we do the actual writing using the `.write` method of the file connection. If there are several bands we may execute the `.write` method several times, as in `.write(a,n)`, where `a` is the array with band values and `n` is the band index (starting from `1`, see below). When done, we close the file connection using the `.close` method. Some functions, such as `rasterio.warp.reproject` used for resampling and reprojecting, directly accept a file connection in `'w'` mode, thus handling the writing (of a resampled or reprojected raster) for us.\n\nMost of the properties are either straightforward to choose, based on our aims, (e.g., `driver`, `crs`, `compress`, `nodata`), or directly derived from the array with the raster values itself (e.g., `height`, `width`, `count`, `dtype`). The most complicated property is the `transform`, which specifies the raster origin and resolution. The `transform` is typically either obtained from an existing raster (serving as a \"template\"), or created from scratch based on manually specified origin and resolution values (e.g., using `rasterio.transform.from_origin`), or calculate automatically (e.g., using `rasterio.warp.calculate_default_transform`).\n\nEarlier in the book, we have already demonstrated the four most common scenarios of writing rasters:\n\n* Creating from scratch (@sec-raster-from-scratch)---We created and wrote two rasters from scratch by associating the `elev` and `grain` arrays with an arbitrary spatial extent. The custom arbitrary `transform` created using `rasterio.transform.from_origin`.\n* Aggregating (@sec-raster-agg-disagg)---We wrote an aggregated a raster, by reading a resampled array from an exising raster, then updating the `transform` using `.transform.scale`.\n* Resampling (@sec-raster-resampling)---We resampled a raster into a custom grid, manually creating the `transform` using `rasterio.transform.from_origin`, then resampling and writing the output using `rasterio.warp.reproject`.\n* Reprojecting (@sec-reprojecting-raster-geometries)---We reprojected a raster into another CRS, by automatically calculating an optimal `transform` using `rasterio.warp.calculate_default_transform`, then resampling and writing the output using `rasterio.warp.reproject`.\n\nA miminal example of writing a raster file named `r.tif` from scratch (i.e., the 1^st^ scenario), to remind some of these concepts, is given below:\n\n::: {.cell execution_count=44}\n``` {.python .cell-code}\n# An array with raster values\nr = np.array([1,2,3,4]).reshape(2,2).astype(np.int8)\nr\n```\n\n::: {.cell-output .cell-output-display execution_count=111}\n```\narray([[1, 2],\n [3, 4]], dtype=int8)\n```\n:::\n:::\n\n\n::: {.cell execution_count=45}\n``` {.python .cell-code}\n# Calculating the transform\nnew_transform = rasterio.transform.from_origin(\n west=-0.5, \n north=51.5, \n xsize=2, \n ysize=2\n)\nnew_transform\n```\n\n::: {.cell-output .cell-output-display execution_count=112}\n```\nAffine(2.0, 0.0, -0.5,\n 0.0, -2.0, 51.5)\n```\n:::\n:::\n\n\n::: {.cell execution_count=46}\n``` {.python .cell-code}\n# Creating the file connection with the metadata\ndst = rasterio.open(\n 'output/r.tif', 'w', \n driver = 'GTiff',\n height = r.shape[0],\n width = r.shape[1],\n count = 1,\n dtype = r.dtype,\n crs = 4326,\n transform = new_transform\n)\ndst\n```\n\n::: {.cell-output .cell-output-display execution_count=113}\n```\n\n```\n:::\n:::\n\n\n::: {.cell execution_count=47}\n``` {.python .cell-code}\n# Writing the array values into the file\ndst.write(r, 1)\n```\n:::\n\n\n::: {.cell execution_count=48}\n``` {.python .cell-code}\n# Closing the file\ndst.close()\n```\n:::\n\n\nThis code section creates a new file `output/r.tif`, which is a $2 \\times 2$ raster, having a 2 decimal degree resolution, with the top-left corner placed over London.\n\nTo summarize, the various scenarios differ in two aspects:\n\n* The way that the `transform` for the output raster is obtained: \n * Imported from an existing raster (see below)\n * Created from scratch, using `rasterio.transform.from_origin` (@sec-raster-from-scratch)\n * Calculate automatically, using `rasterio.warp.calculate_default_transform` (@sec-reprojecting-raster-geometries)\n* The way that the raster is written:\n * Using the `.write` method, given an existing array (@sec-raster-from-scratch, @sec-raster-agg-disagg)\n * Using `rasterio.warp.reproject` to calculate and write a resampled or reprojected array (@sec-raster-resampling, @sec-reprojecting-raster-geometries)\n\nTo make the picture of raster export complete, there are three important concepts we haven't covered yet: array and raster data types, writing multiband rasters, and handling \"No Data\" values.\n\nArrays (i.e., `ndarray` objects defined in package **numpy**) are used to store raster values when reading them from file, using `.read` (@sec-using-rasterio). All values in an array are of the same type, whereas the **numpy** package supports numerous numeric data types of various precision (and, accordingly, memory footprint). Raster formats, such as GeoTIFF, support exactly the same data types, which means that reading a raster file uses as little RAM as possible. The most relevant types are summarized in @tbl-numpy-data-types.\n\n| Data type | Description |\n|---|-------------|\n| `int8` | Integer in a single byte (`-128` to `127`)\n| `int16` | Integer in 16 bits (`-32768` to `32767`)\n| `int32` | Integer in 32 bits (`-2147483648` to `2147483647`)\n| `uint8` | Unsigned integer (`0` to `255`)\n| `uint16` | Unsigned integer (`0` to `65535`)\n| `uint32` | Unsigned integer (`0` to `4294967295`)\n| `float16` | Half-precision (16 bit) float (`-65504` to `65504`)\n| `float32` | Single-precision (32 bit) float (`1e-38` to `1e38`)\n| `float64` | Double-precision (64 bit) float (`1e-308` to `1e308`)\n\n: Numeric `numpy` data which are commonly used for rasters {#tbl-numpy-data-types}\n\nThe raster data type can be specified when writing a raster (see above). For an existing raster file, the data type is accessible through the `.dtype` property of the metadata: \n\n::: {.cell execution_count=49}\n``` {.python .cell-code}\nrasterio.open('output/r.tif').meta['dtype']\n```\n\n::: {.cell-output .cell-output-display execution_count=116}\n```\n'int8'\n```\n:::\n:::\n\n\nThe file `r.tif` has the data type `np.int8`, which we specified when creating it according to the data type of the original array:\n\n::: {.cell execution_count=50}\n``` {.python .cell-code}\nr.dtype\n```\n\n::: {.cell-output .cell-output-display execution_count=117}\n```\ndtype('int8')\n```\n:::\n:::\n\n\nWhen reading the data back into the Python session, the array with the same data type is recreated:\n\n::: {.cell execution_count=51}\n``` {.python .cell-code}\nrasterio.open('output/r.tif').read().dtype\n```\n\n::: {.cell-output .cell-output-display execution_count=118}\n```\ndtype('int8')\n```\n:::\n:::\n\n\nWriting multiband rasters is similar to writing single-band rasters, only that we need to: \n\n* Define the number of layers (the `count` property in the metadata) that are going to be in the file we are creating\n* Execute the `.write` method multiple times, once for each layer\n\nFor completeness, let's demonstrate writing a multi-band raster named `r3.tif`, which is similar to `r.tif`, but having three bands with values `r`, `r*2`, and `r*3` (i.e., the array `r` multiplied by 1, 2, or 3). Since most of the metadata is going to be the same, this is also a good opportunity to (re-)demonstrate updating an existing metadata object rather than creating one from scratch. \n\nFirst, let's make a copy of the metadata we already have in `r.tif`:\n\n::: {.cell execution_count=52}\n``` {.python .cell-code}\ndst_kwds = rasterio.open('output/r.tif').meta.copy()\ndst_kwds\n```\n\n::: {.cell-output .cell-output-display execution_count=119}\n```\n{'driver': 'GTiff',\n 'dtype': 'int8',\n 'nodata': None,\n 'width': 2,\n 'height': 2,\n 'count': 1,\n 'crs': CRS.from_epsg(4326),\n 'transform': Affine(2.0, 0.0, -0.5,\n 0.0, -2.0, 51.5)}\n```\n:::\n:::\n\n\nSecond, we update the `count` entry, replacing `1` (single-band) with `3` (three-band):\n\n::: {.cell execution_count=53}\n``` {.python .cell-code}\ndst_kwds.update(count=3)\ndst_kwds\n```\n\n::: {.cell-output .cell-output-display execution_count=120}\n```\n{'driver': 'GTiff',\n 'dtype': 'int8',\n 'nodata': None,\n 'width': 2,\n 'height': 2,\n 'count': 3,\n 'crs': CRS.from_epsg(4326),\n 'transform': Affine(2.0, 0.0, -0.5,\n 0.0, -2.0, 51.5)}\n```\n:::\n:::\n\n\nFinally, we can create a file connection using the updated metadata and then write the values of the three bands:\n\n::: {.cell execution_count=54}\n``` {.python .cell-code}\ndst = rasterio.open('output/r3.tif', 'w', **dst_kwds)\ndst.write(r, 1)\ndst.write(r*2, 2)\ndst.write(r*3, 3)\ndst.close()\n```\n:::\n\n\nAs a result, a three-band raster named `r3.tif` is created. \n\nRasters often contain \"No Data\" values, representing missing data, e.g., unreliable measurement due to clouds or pixels outside of the photographed extent. \nIn a **numpy** `ndarray` object, \"No Data\" values may be represented by the special `np.nan` value. \nHowever, due to computer memory limitations, only arrays of type `float` can contain `np.nan`, while arrays of type `int` cannot. \nFor `int` rasters containing \"No Data\", we typically mark missing data with a specific value beyond the valid range (e.g., `-9999`). \nThe missing data \"flag\" is stored in the file (set through the `nodata` property of the file connection, see above). \nWhen reading an `int` raster with \"No Data\" back into Python, we need to be aware of these flags. Let's demonstrate through examples.\n\nWe will start with the simpler case, rasters of type `float`. Since `float` arrays may contain the \"native\" value `np.nan`, representing \"No Data\" is straightforward. \nFor example, suppose that we have a `float` array with `np.nan`:\n\n::: {.cell execution_count=55}\n``` {.python .cell-code}\nr = np.array([1.1,2.1,np.nan,4.1]).reshape(2,2)\nr\n```\n\n::: {.cell-output .cell-output-display execution_count=122}\n```\narray([[1.1, 2.1],\n [nan, 4.1]])\n```\n:::\n:::\n\n\n::: {.cell execution_count=56}\n``` {.python .cell-code}\nr.dtype\n```\n\n::: {.cell-output .cell-output-display execution_count=123}\n```\ndtype('float64')\n```\n:::\n:::\n\n\nWhen writing the array to file, we do not need to specify any particular `nodata` value:\n\n::: {.cell execution_count=57}\n``` {.python .cell-code}\ndst = rasterio.open(\n 'output/r_nodata_float.tif', 'w', \n driver = 'GTiff',\n height = r.shape[0],\n width = r.shape[1],\n count = 1,\n dtype = r.dtype,\n crs = 4326,\n transform = new_transform\n)\ndst.write(r, 1)\ndst.close()\n```\n:::\n\n\nThis is equivalent to `nodata=None`:\n\n::: {.cell execution_count=58}\n``` {.python .cell-code}\nrasterio.open('output/r_nodata_float.tif').meta\n```\n\n::: {.cell-output .cell-output-display execution_count=125}\n```\n{'driver': 'GTiff',\n 'dtype': 'float64',\n 'nodata': None,\n 'width': 2,\n 'height': 2,\n 'count': 1,\n 'crs': CRS.from_epsg(4326),\n 'transform': Affine(2.0, 0.0, -0.5,\n 0.0, -2.0, 51.5)}\n```\n:::\n:::\n\n\nReading from the raster back into the Python session reproduces the same exact array, with `np.nan`:\n\n::: {.cell execution_count=59}\n``` {.python .cell-code}\nrasterio.open('output/r_nodata_float.tif').read()\n```\n\n::: {.cell-output .cell-output-display execution_count=126}\n```\narray([[[1.1, 2.1],\n [nan, 4.1]]])\n```\n:::\n:::\n\n\nNow, suppose that we have an `np.int32` array with missing data, which is inevitably flagged using a specific `int` value such as `-9999` (remember that we can't store `np.nan` in an `int` array!):\n\n::: {.cell execution_count=60}\n``` {.python .cell-code}\nr = np.array([1,2,-9999,4]).reshape(2,2).astype(np.int32)\nr\n```\n\n::: {.cell-output .cell-output-display execution_count=127}\n```\narray([[ 1, 2],\n [-9999, 4]], dtype=int32)\n```\n:::\n:::\n\n\n::: {.cell execution_count=61}\n``` {.python .cell-code}\nr.dtype\n```\n\n::: {.cell-output .cell-output-display execution_count=128}\n```\ndtype('int32')\n```\n:::\n:::\n\n\nWhen writing the array to file, we must specify `nodata=-9999` to keep track of our \"No Data\" flag:\n\n::: {.cell execution_count=62}\n``` {.python .cell-code}\ndst = rasterio.open(\n 'output/r_nodata_int.tif', 'w', \n driver = 'GTiff',\n height = r.shape[0],\n width = r.shape[1],\n count = 1,\n dtype = r.dtype,\n nodata = -9999,\n crs = 4326,\n transform = new_transform\n)\ndst.write(r, 1)\ndst.close()\n```\n:::\n\n\nExamining the metadata confirms that the `nodata=-9999` setting was stored in the file `r_nodata_int.tif`. \n\n::: {.cell execution_count=63}\n``` {.python .cell-code}\nrasterio.open('output/r_nodata_int.tif').meta\n```\n\n::: {.cell-output .cell-output-display execution_count=130}\n```\n{'driver': 'GTiff',\n 'dtype': 'int32',\n 'nodata': -9999.0,\n 'width': 2,\n 'height': 2,\n 'count': 1,\n 'crs': CRS.from_epsg(4326),\n 'transform': Affine(2.0, 0.0, -0.5,\n 0.0, -2.0, 51.5)}\n```\n:::\n:::\n\n\nIf you try to open the file in GIS software, such as QGIS, you will see the missing data interpreted (e.g., the pixel shown as blank), meaning that the software is aware of the flag. \nHowever, reading the data back into Python reproduces an `int` array with `-9999`, for the same reason stated before:\n\n::: {.cell execution_count=64}\n``` {.python .cell-code}\nsrc = rasterio.open('output/r_nodata_int.tif')\nr = src.read()\nr\n```\n\n::: {.cell-output .cell-output-display execution_count=131}\n```\narray([[[ 1, 2],\n [-9999, 4]]], dtype=int32)\n```\n:::\n:::\n\n\nThe Python user must thefore be mindful of \"No Data\" `int` rasters, for example to avoid interpreting the value `-9999` literally. \nFor example, if we \"forget\" about the `nodata` flag, the literal calculation of the `.mean` would incorrectly include the value `-9999`:\n\n::: {.cell execution_count=65}\n``` {.python .cell-code}\nr.mean()\n```\n\n::: {.cell-output .cell-output-display execution_count=132}\n```\n-2498.0\n```\n:::\n:::\n\n\nThere are two basic ways to deal with the situation:\n\n* Converting the raster to `float`\n* Using \"No Data\" masks\n\nFirst, particularly with small rasters where memory constraints are irrelevant, it may be more convenient to go from `int` to `float`, to gain the ability of the natural `np.nan` representation. \nHere is how we can do this with `r_nodata_int.tif`. We detect the missing data flag, conver the raster to `float`, and assign `np.nan` into the cells that are supposed to be missing:\n\n::: {.cell execution_count=66}\n``` {.python .cell-code}\nmask = r == src.nodata\nr = r.astype(np.float64)\nr[mask] = np.nan\nr\n```\n\n::: {.cell-output .cell-output-display execution_count=133}\n```\narray([[[ 1., 2.],\n [nan, 4.]]])\n```\n:::\n:::\n\n\nFrom there on, we deal with `np.nan` the usual way, such as using `np.nanmean` to calculate the mean excluding \"No Data\":\n\n::: {.cell execution_count=67}\n``` {.python .cell-code}\nnp.nanmean(r)\n```\n\n::: {.cell-output .cell-output-display execution_count=134}\n```\n2.3333333333333335\n```\n:::\n:::\n\n\nThe second approach is to read the values into a so-called [\"masked\" array](https://numpy.org/doc/stable/reference/maskedarray.generic.html#what-is-a-masked-array), using the argument `masked=True`. \nA masked array can be thought of as an extended `ndarray`, with two components: `.data` (the values) and `.mask` (a corresponding boolean array marking \"No Data\" values): \n\n::: {.cell execution_count=68}\n``` {.python .cell-code}\nr = src.read(masked=True)\nr\n```\n\n::: {.cell-output .cell-output-display execution_count=135}\n```\nmasked_array(\n data=[[[1, 2],\n [--, 4]]],\n mask=[[[False, False],\n [ True, False]]],\n fill_value=-9999,\n dtype=int32)\n```\n:::\n:::\n\n\nUsing masked arrays is beyond the scope of this book. However, the basic idea is that many **numpy** operations \"honor\" the mask, so that the user does not have to keep track of the way that \"No Data\" values are marked, similarly to the natural `np.nan` representation. For example, the `.mean` of a masked array ignores the value `-9999`, because it is masked, taking into account just the valid values `1`, `2`, and `4`:\n\n::: {.cell execution_count=69}\n``` {.python .cell-code}\nr.mean()\n```\n\n::: {.cell-output .cell-output-display execution_count=136}\n```\n2.3333333333333335\n```\n:::\n:::\n\n\nKeep in mind that, somewhat confusingly, `float` rasters may represent \"No Data\" using a specific value (such as `-9999.0`), instead, or in addition to (!), the native `np.nan` representation. \nIn such cases, the same considerations shown for `int` apply to `float` rasters as well. \n\n## Exercises\n\n", - "supporting": [ - "07-read-write-plot_files" - ], - "filters": [], - "includes": { - "include-in-header": [ - "\n\n\n" - ] - } - } -} \ No newline at end of file diff --git a/07-read-write-plot_files/figure-html/cell-22-output-1.svg b/07-read-write-plot_files/figure-html/cell-22-output-1.svg deleted file mode 100644 index 5d91f281..00000000 --- a/07-read-write-plot_files/figure-html/cell-22-output-1.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/07-read-write-plot_files/figure-html/cell-25-output-1.png b/07-read-write-plot_files/figure-html/cell-25-output-1.png deleted file mode 100644 index 63d20b27..00000000 Binary files a/07-read-write-plot_files/figure-html/cell-25-output-1.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/cell-26-output-1.png b/07-read-write-plot_files/figure-html/cell-26-output-1.png deleted file mode 100644 index 63d20b27..00000000 Binary files a/07-read-write-plot_files/figure-html/cell-26-output-1.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/cell-27-output-1.png b/07-read-write-plot_files/figure-html/cell-27-output-1.png deleted file mode 100644 index 1037ab9e..00000000 Binary files a/07-read-write-plot_files/figure-html/cell-27-output-1.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/cell-35-output-1.png b/07-read-write-plot_files/figure-html/cell-35-output-1.png deleted file mode 100644 index 26345135..00000000 Binary files a/07-read-write-plot_files/figure-html/cell-35-output-1.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/cell-35-output-2.png b/07-read-write-plot_files/figure-html/cell-35-output-2.png deleted file mode 100644 index 4d39db85..00000000 Binary files a/07-read-write-plot_files/figure-html/cell-35-output-2.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/cell-35-output-3.png b/07-read-write-plot_files/figure-html/cell-35-output-3.png deleted file mode 100644 index 23a0ee93..00000000 Binary files a/07-read-write-plot_files/figure-html/cell-35-output-3.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/cell-36-output-1.png b/07-read-write-plot_files/figure-html/cell-36-output-1.png deleted file mode 100644 index 1829efd0..00000000 Binary files a/07-read-write-plot_files/figure-html/cell-36-output-1.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/cell-36-output-2.png b/07-read-write-plot_files/figure-html/cell-36-output-2.png deleted file mode 100644 index 87235074..00000000 Binary files a/07-read-write-plot_files/figure-html/cell-36-output-2.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/cell-37-output-1.png b/07-read-write-plot_files/figure-html/cell-37-output-1.png deleted file mode 100644 index 1829efd0..00000000 Binary files a/07-read-write-plot_files/figure-html/cell-37-output-1.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/fig-cycle_hire_xy-layer-output-1.png b/07-read-write-plot_files/figure-html/fig-cycle_hire_xy-layer-output-1.png deleted file mode 100644 index 63d20b27..00000000 Binary files a/07-read-write-plot_files/figure-html/fig-cycle_hire_xy-layer-output-1.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/fig-ne-airports-output-1.png b/07-read-write-plot_files/figure-html/fig-ne-airports-output-1.png deleted file mode 100644 index bff71a3e..00000000 Binary files a/07-read-write-plot_files/figure-html/fig-ne-airports-output-1.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/fig-ne-counties-output-1.png b/07-read-write-plot_files/figure-html/fig-ne-counties-output-1.png deleted file mode 100644 index 887e4398..00000000 Binary files a/07-read-write-plot_files/figure-html/fig-ne-counties-output-1.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/fig-raster-window-output-1.png b/07-read-write-plot_files/figure-html/fig-raster-window-output-1.png deleted file mode 100644 index f59e8650..00000000 Binary files a/07-read-write-plot_files/figure-html/fig-raster-window-output-1.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/fig-read-shp-query-output-1.png b/07-read-write-plot_files/figure-html/fig-read-shp-query-output-1.png deleted file mode 100644 index 92c363f6..00000000 Binary files a/07-read-write-plot_files/figure-html/fig-read-shp-query-output-1.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/fig-read-shp-query-output-2.png b/07-read-write-plot_files/figure-html/fig-read-shp-query-output-2.png deleted file mode 100644 index fa52723e..00000000 Binary files a/07-read-write-plot_files/figure-html/fig-read-shp-query-output-2.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/fig-tanzania-mask-output-1.svg b/07-read-write-plot_files/figure-html/fig-tanzania-mask-output-1.svg deleted file mode 100644 index 5d91f281..00000000 --- a/07-read-write-plot_files/figure-html/fig-tanzania-mask-output-1.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/07-read-write-plot_files/figure-html/fig-tanzania-mask-output-2.png b/07-read-write-plot_files/figure-html/fig-tanzania-mask-output-2.png deleted file mode 100644 index 68fa22d5..00000000 Binary files a/07-read-write-plot_files/figure-html/fig-tanzania-mask-output-2.png and /dev/null differ diff --git a/07-read-write-plot_files/figure-html/fig-world_wkt-layer-output-1.png b/07-read-write-plot_files/figure-html/fig-world_wkt-layer-output-1.png deleted file mode 100644 index 1037ab9e..00000000 Binary files a/07-read-write-plot_files/figure-html/fig-world_wkt-layer-output-1.png and /dev/null differ diff --git a/output/world.gpkg b/output/world.gpkg index 81a8c24b..5ab4eacd 100644 Binary files a/output/world.gpkg and b/output/world.gpkg differ diff --git a/output/world_many_features.gpkg b/output/world_many_features.gpkg index 32254a67..280c1e38 100644 Binary files a/output/world_many_features.gpkg and b/output/world_many_features.gpkg differ diff --git a/output/world_many_layers.gpkg b/output/world_many_layers.gpkg index e9fc1540..3ebe90b8 100644 Binary files a/output/world_many_layers.gpkg and b/output/world_many_layers.gpkg differ