Book Image

Learning Geospatial Analysis with Python

By : Joel Lawhead
Book Image

Learning Geospatial Analysis with Python

By: Joel Lawhead

Overview of this book

Geospatial Analysis is used in almost every field you can think of from medicine, to defense, to farming. This book will guide you gently into this exciting and complex field. It walks you through the building blocks of geospatial analysis and how to apply them to influence decision making using the latest Python software. Learning Geospatial Analysis with Python, 2nd Edition uses the expressive and powerful Python 3 programming language to guide you through geographic information systems, remote sensing, topography, and more, while providing a framework for you to approach geospatial analysis effectively, but on your own terms. We start by giving you a little background on the field, and a survey of the techniques and technology used. We then split the field into its component specialty areas: GIS, remote sensing, elevation data, advanced modeling, and real-time data. This book will teach you everything you need to know about, Geospatial Analysis from using a particular software package or API to using generic algorithms that can be applied. This book focuses on pure Python whenever possible to minimize compiling platform-dependent binaries, so that you don’t become bogged down in just getting ready to do analysis. This book will round out your technical library through handy recipes that will give you a good understanding of a field that supplements many a modern day human endeavors.
Table of Contents (17 chapters)
Learning Geospatial Analysis with Python Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Geographic information system concepts


In order to begin geospatial analysis, it is important to understand some key underlying concepts unique to the field. The list isn't long, but nearly every aspect of analysis traces back to one of these ideas.

Thematic maps

As its name suggests, a thematic map portrays a specific theme. A general reference map visually represents features as they relate geographically for navigation or planning. A thematic map goes beyond location to provide the geographic context for information around a central idea. Usually, a thematic map is designed for a targeted audience to answer specific questions. The value of thematic maps lies in what they do not show. A thematic map will use minimal geographic features to avoid distracting the reader from the theme. Most thematic maps include political boundaries such as country or state borders but omit navigational features, such as street names or points of interest beyond major landmarks that orient the reader. The cholera map by Dr. John Snow earlier in this chapter is a perfect example of a thematic map. Common uses for thematic maps are visualizing health issues, such as disease, election results, and environmental phenomena such as rainfall. These maps are also the most common output of geospatial analysis. The following map from the United States Census Bureau shows cancer mortality rates by state:

Thematic maps tell a story and are very useful. However, it is important to remember that while thematic maps are models of reality as any other map, they are also generalizations of information. Two different analysts using the same source of information will often come up with very different thematic maps depending on how they analyze and summarize the data. They may also choose to focus on different aspects of the dataset. The technical nature of thematic maps often leads people to treat them as if they are scientific evidence. However, geospatial analysis is often inconclusive. While the analysis may be based on scientific data, the analyst does not always follow the rigor of the scientific method. In his classic book, How to Lie with Maps, Mark Monmonier, University of Chicago Press, demonstrates in great detail how maps are easily manipulated models of reality, which are commonly abused. This fact doesn't degrade the value of these tools. The legendary statistician, George Box, wrote in his 1987 book, Empirical Model-Building and Response Surfaces that, "Essentially, all models are wrong, but some are useful." Thematic maps have been used as guides to start (and end) wars, stop deadly diseases in their tracks, win elections, feed nations, fight poverty, protect endangered species, and rescue those impacted by disaster. Thematic maps may be the most useful models ever created.

Spatial databases

In its purest form, a database is simply an organized collection of information. A database management system (DBMS) is an interactive suite of software that can interact with a database. People often use the word database as a catch-all term referring to both the DBMS and underlying data structure. Databases typically contain alphanumeric data and, in some cases, binary large objects or blobs, which can store binary data such as images. Most databases also allow a relational database structure in which entries in normalized tables can be referenced to each other in order to create many-to-one and one-to-many relationships among data.

Spatial databases, also known as geodatabases, use specialized software to extend a traditional relational database management system (RDBMS) to store and query data defined in two-dimensional or three-dimensional space. Some systems also account for a series of data over time. In a spatial database, attributes about geographic features are stored and queried as traditional relational database structures. The spatial extensions allow you to query geometries using Structured Query Language (SQL) in a similar way as traditional database queries. Spatial queries and attribute queries can also be combined to select results based on both location and attributes.

Spatial indexing

Spatial indexing is a process that organizes the geospatial vector data for faster retrieval. It is a way of prefiltering the data for common queries or rendering. Indexing is commonly used in large databases to speed up returns to queries. Spatial data is no different. Even a moderately-sized geodatabase can contain millions of points or objects. If you perform a spatial query, every point in the database must be considered by the system in order to include it or eliminate it in the results. Spatial indexing groups data in ways that allow large portions of the dataset to be eliminated from consideration by doing computationally simpler checks before going into a detailed and slower analysis of the remaining items.

Metadata

Metadata is defined as data about data. Accordingly, geospatial metadata is data about geospatial datasets that provide traceability for the source and history of a dataset as well as summary of the technical details. Metadata also provides long-term preservation of information holdings. Geospatial metadata can be represented by several possible standards. One of the most prominent standards is the international standard, ISO 19115-1, which includes hundreds of potential fields to describe a single geospatial dataset. Additionally, the ISO 19115-2 includes extensions for geospatial imagery and gridded data. Some example fields include spatial representation, temporal extent, and lineage. The primary use of metadata is cataloging datasets. Modern metadata can be ingested by geographic search engines making it potentially discoverable by other systems automatically. It also lists points of contact for a dataset if you have questions. Metadata is an important support tool for geospatial analysts and adds credibility and accessibility to your work. The Open Geospatial Consortium (OGC) created the Catalog Service for the Web (CSW) to manage metadata. The pycsw Python library implements the CSW standard. You can learn more about it at http://pycsw.org.

Map projections

Map projections have entire books devoted to them and can be a challenge for new analysts. If you take any three-dimensional object and flatten it on a plane, such as your screen or a sheet of paper, the object is distorted. Many grade school geography classes demonstrate this concept by having students peel an orange and then attempt to lay the peel flat on their desk in order to understand the resulting distortion. The same effect occurs when you take the round shape of the Earth and project it on a computer screen.

In geospatial analysis, you can manipulate this distortion to preserve common properties, such as area, scale, bearing, distance, or shape. There is no one-size-fits-all solution to map projections. The choice of projection is always a compromise of gaining accuracy in one dimension in exchange for error in another. Projections are typically represented as a set of over 40 parameters as either XML or a text format called Well-Known Text (WKT), which is used to define the transformation algorithm.

The International Association of Oil & Gas Producers (IOGP) maintains a registry of the most known projections. The organization was formerly known as the European Petroleum Survey Group (EPSG). The entries in the registry are still known as EPSG codes. The EPSG maintained the registry as a common benefit for the oil and gas industry, which is a prolific user of geospatial analysis for energy exploration. At the last count, this registry contained over 5,000 entries.

As recently as 10 years ago, map projections were a primary concern for a geospatial analyst. Data storage was expensive, high-speed Internet was rare, and cloud computing didn't really exist. Geospatial data was typically exchanged among small groups working in separate areas of interest. The technology constraints at the time meant that geospatial analysis was highly localized. Analysts would use the best projection for their area of interest. Data in different projections cannot be displayed on the same map because they represent two different models of the Earth. Any time an analyst received data from a third party, it had to be reprojected before using it with the existing data. This process was tedious and time-consuming. Most geospatial data formats do not provide a way to store the projection information. This information is stored in an ancillary file as text or XML usually. As analysts didn't exchange data often, many people wouldn't bother defining projection information. Every analyst's nightmare was to come across an extremely valuable dataset missing the projection information. It rendered the dataset useless. The coordinates in the file are just numbers and offer no clue to the projection. With over 5,000 choices, it was nearly impossible to guess.

Now, thanks to modern software and the Internet making data exchange easier and more common, nearly every data format has added a metadata format that defines the projection or places it in the file header if supported. Advances in technology have also allowed for global basemaps, which allow for more common uses of projections such as the common Google Mercator projection used for Google Maps. This projection is also known as Web Mercator and uses code EPSG:3857 (or the deprecated EPSG:900913). Geospatial portal projects such as OpenStreetMap.org and NationalAtlas.gov have consolidated datasets for much of the world in common projections. Modern geospatial software can also reproject data on the fly saving the analyst the trouble of preprocessing the data before using it. Closely related to map projections are geodetic datums. A datum is a model of the Earth's surface used to match the location of features on the Earth to a coordinate system. One common datum is called WGS 84 that is used by GPS devices.

Rendering

The exciting part of geospatial analysis is visualization. As geospatial analysis is a computer-based process, it is good to be aware of how geographic data appears on a computer screen.

Geographic data including points, lines, and polygons are stored numerically as one or more points, which come in (x,y) pairs or (x,y,z) tuples. The x represents the horizontal axis on a graph. The y represents the vertical axis. The z represents terrain elevation. In computer graphics, a computer screen is represented by an x and y axis. A z axis is not used because the computer screen is treated as a two-dimensional plane by most graphics software APIs. However, as desktop computing power continues to improve, three-dimensional maps are starting to become more common.

Another important factor is screen coordinates versus world coordinates. Geographic data is stored in a coordinate system representing a grid overlaid on the Earth, which is three-dimensional and round. Screen coordinates, also known as pixel coordinates, represent a grid of pixels on a flat, two-dimensional computer screen. Mapping x and y world coordinates to pixel coordinates is fairly straightforward and involves a simple scaling algorithm. However, if a z coordinate exists, then a more complicated transform must be performed to map coordinates from three-dimensional space to a two-dimensional plane. These transformations can be computationally costly and therefore slow if not handled correctly.

In the case of remote sensing data, the challenge is typically the file size. Even a moderately-sized satellite image that is compressed can be tens, if not hundreds, of megabytes. Images can be compressed using lossless or lossy methods. Lossless methods use tricks to reduce the file size without discarding any data. Lossy compression algorithms reduce the file size by reducing the amount of data in the image while avoiding a significant change in the appearance of the image. Rendering an image on the screen can be computationally-intensive. Most remote sensing file formats allow for the storing of multiple lower-resolution versions of the image—called overviews or pyramids—for the sole purpose of faster rendering at different scales. When zoomed out from the image to a scale where you couldn't see the detail of the full resolution image, a preprocessed, lower-resolution version of the image is displayed quickly and seamlessly.

Remote sensing concepts

Most of the GIS concepts described also apply to raster data. However, raster data has some unique properties as well. Earlier in this chapter, in the history of remote sensing, the focus was on Earth imaging from aerial platforms. It is important to note that raster data can come in many forms including ground-based radar, laser range finders, and other specialized devices to detect gases, radiation, and other forms of energy in a geographic context. For the purpose of this book, we will focus on remote sensing platforms that capture large amounts of Earth data. These sources included Earth imaging systems, certain types of elevation data, and some weather systems where applicable.

Images as data

Raster data is captured digitally as square tiles. This means that the data is stored on a computer as a numerical array of rows and columns. If the data is multispectral, the dataset will usually contain multiple arrays of the same size, which are geospatially referenced together to represent a single area on the Earth. These different arrays are called bands. Any numerical array can be represented on a computer as an image. In fact, all computer data is ultimately numbers. It is important in geospatial analysis to think of images as a numeric array because mathematical formulas are used to process them.

In remotely sensed images, each pixel represents both space (location on the Earth of a certain size) and the reflectance captured as light reflected from the Earth at this location into space. So, each pixel has a ground size and contains a number representing the intensity. As each pixel is a number, we can perform mathematical equations on this data to combine data from different bands and highlight specific classes of objects in the image. If the wavelength value is beyond the visible spectrum, we can highlight features not visible to the human eye. Substances such as chlorophyll in plants can be greatly contrasted using a specific formula called Normalized Difference Vegetation Index (NDVI).

By processing remotely sensed images, we can turn this data into visual information. Using the NDVI formula, we can answer the question, what is the relative health of the plants in this image? You can also create new types of digital information, which can be used as input for computer programs to output other types of information.

Remote sensing and color

Computer screens display images as combinations of Red, Green, and Blue (RGB) to match the capability of the human eye. Satellites and other remote sensing imaging devices can capture light beyond this visible spectrum. On a computer, wavelengths beyond the visible spectrum are represented in the visible spectrum so that we can see them. These images are known as false color images. In remote sensing, for instance, infrared light makes moisture highly visible. This phenomenon has a variety of uses such as monitoring ground saturation during a flood or finding hidden leaks in a roof or levee.