Book Image

Geospatial Data Science Quick Start Guide

By : Abdishakur Hassan, Jayakrishnan Vijayaraghavan
Book Image

Geospatial Data Science Quick Start Guide

By: Abdishakur Hassan, Jayakrishnan Vijayaraghavan

Overview of this book

Data scientists, who have access to vast data streams, are a bit myopic when it comes to intrinsic and extrinsic location-based data and are missing out on the intelligence it can provide to their models. This book demonstrates effective techniques for using the power of data science and geospatial intelligence to build effective, intelligent data models that make use of location-based data to give useful predictions and analyses. This book begins with a quick overview of the fundamentals of location-based data and how techniques such as Exploratory Data Analysis can be applied to it. We then delve into spatial operations such as computing distances, areas, extents, centroids, buffer polygons, intersecting geometries, geocoding, and more, which adds additional context to location data. Moving ahead, you will learn how to quickly build and deploy a geo-fencing system using Python. Lastly, you will learn how to leverage geospatial analysis techniques in popular recommendation systems such as collaborative filtering and location-based recommendations, and more. By the end of the book, you will be a rockstar when it comes to performing geospatial analysis with ease.
Table of Contents (9 chapters)

Location data

What is location data and why is it different than other data formats? It is quite common to see phrases such as spatial data is special or another more popular adage, 80% of data is geographic. While these are not easily provable, we tend to witness an increased amount of location data. From geotagged images, text, and sensor data, location data is ubiquitous and the world is datafied. In this connected and data-driven driven era, we generate, keep track of, and store huge mounts of data every day. Think of the number of tweets, Instagram images, bank transactions, searches on the web, and routing requests from APIs. We collect more data than at any other period of time in the past, and thus the big data revolution. Many of the datasets collected have an inherent location dimension but are often hidden within the data and not utilized fully.

Understanding location data from various perspectives

We can examine location data from different perspectives: business, technical, and data perspectives.

From a business perspective

From a business perspective, the value of maps and location data is crucial in many business applications. A quick look at big companies such as Google, Apple, Microsoft, and Nokia shows that each of these companies has their own location and mapping services and products.

Think about how often you use Google Maps API's location service through your phone. This also highlights the importance of location data as all these companies would not go to such lengths to have their own in-house location data production if it was not necessary. Business applications in location data include not only individual uses of location data but also innovative applications spanning from individualized marketing, autonomous vehicles, logistics, and transportation to healthcare.

From a technical perspective

The technical perspective of location data indicates that it entails both opportunities as well as challenges. Location data, in contrast to other data, has a topology, which holds the relationships between geometry (points, lines, and polygons) and geographic features that they represent. In the case of conventional data, we store data into tables or a Relational Database Management System (RDBMS). However, spatial relations and topology require us to store the geometry of objects.

Due to the nature of location data, which is derived from Tobler's first law of geography, Everything is related to everything else, but near things are more related than distant things. The essence of this law entails also the presence of strong autocorrelation and interdependency in continuous near locations, which is not necessarily present in conventional data (non-spatial attributes).

From a data perspective

Having looked into the nature of location data from a technical perspective, let's also examine it from a data perspective. How is location data different than other data? In location data, we use geographic coordinates (2D) to represent the world (3D).

For example, Digital Elevation Models (DEMs) are used to represent heights and terrain surface. The first law of geography applies here as well. At a certain point of time, a particular terrain is very likely to have the same height with its relatively close surrounding, while we can expect a difference based on elevation in two areas distant from each other. As mentioned earlier, spatial autocorrelation in location data is assumed to be present in spatial data, while in other types of data, such as the statistical analysis of conventional data, we assume the independence of data points. That means location data can be categorized as stochastic, while other data is probabilistic.

Another complication in location data also arises from what we call Modifiable Area Unit Problem (MAUP), which arises from different aggregated units that produce different results. An example of this is poverty or crime estimates and aggregations. For example, areas of high poverty rates could be overestimated or underestimated depending on the boundaries of measured areas. By moving into different aggregations (that is, zip code, neighborhood, or district level), which can create different impressions and patterns created by the different scales and aggregations.

Types of location data

Geographic data types can be divided into two broad categories:

  • Vector data: This is represented as points, lines, or polygons. The data is likely created by digitizing it and storing information in longitude and latitude. This type of data is useful for storing data that has discrete and distinct boundaries such as borders, land parcels, streets, and points of interest.
  • Raster data: This stores information in cells and therefore is suitable for storing data that is continuous, such as satellite images, elevation models, and other aerial photographs.