Loading flat data may seem to be a bit dull initially but it is important to understand that many popular and interesting datasets often contain the spatial information in very different formats, such as:
- Coordinates expressed in Lon/Lat or projected coordinates
- Encoded geometry, for example WKT, TopoJSON, GeoJSON
- Location in the form of an address
- Location in non-cartesian coordinates, for example start point, angle and direction
- While the earlier examples indicate the data would require further processing in order to extract the spatial content into a usable form, clearly ability to import flat datasets should not be underestimated
Note
Flat data in our scenario is data with no explicitly expressed geometry - non-spatial format, text-based files
Psql is the pgsql's command-line tool. While one can achieve quite a lot with GUI based database management utilities, psql is very useful when one needs to handle database backups, management and alike via scripting. When there is no GUI installed on the server, psql becomes pretty much the only option so it is worth being familiar with it even if you're not a fan.
In order to import the data in psql we will use a \COPY
command. This requires us to define the data model for the incoming data first.
Defining the table data model from a text file may be prone to errors that will prevent data from being imported. If for of some reason you are not sure what data types are stored in the particular columns of your source file you can import all the data as text and then re-cast it as required at a later time.
In this example we will import the earthquakes data from USGS. So let's fire up psql and connect to the database server:
F:\mastering_postgis\chapter02>psql -h localhost -p 5434 -U postgres
You should see a similar output:
psql (9.5.0) Type "help" for help. postgres=#
Then we need to connect to the mastering_postgis
database:
postgres=# \c mastering_postgis
The following output should be displayed:
You are now connected to database "mastering_postgis" as user "postgres". mastering_postgis=#
Note
In the psql examples I am using postgres
user. As you may guess, it is a superuser account. This is not the thing you would normally do, but it will keep the examples simple. In a production environment, using a db user with credentials allowing access to specific resources is a sensible approach.
If you have not had a chance to create our data_import
schema, let's take care of it now by typing the following command:
mastering_postgis=# create schema if not exists data_import;
You should see a similar output:
NOTICE: schema "data_import" already exists, skipping CREATE SCHEMA
Once the schema is there, we create the table that will store the data. In order to do so just type or paste the following into psql:
create table data_import.earthquakes_csv ( "time" timestamp with time zone, latitude numeric, longitude numeric, depth numeric, mag numeric, magType varchar, nst numeric, gap numeric, dmin numeric, rms numeric, net varchar, id varchar, updated timestamp with time zone, place varchar, type varchar, horizontalError numeric, depthError numeric, magError numeric, magNst numeric, status varchar, locationSource varchar, magSource varchar );
You should see the following output:
mastering_postgis=# create table data_import.earthquakes_csv ( mastering_postgis(# "time" timestamp with time zone, mastering_postgis(# latitude numeric, mastering_postgis(# longitude numeric, mastering_postgis(# depth numeric, mastering_postgis(# mag numeric, mastering_postgis(# magType varchar, mastering_postgis(# nst numeric, mastering_postgis(# gap numeric, mastering_postgis(# dmin numeric, mastering_postgis(# rms numeric, mastering_postgis(# net varchar, mastering_postgis(# id varchar, mastering_postgis(# updated timestamp with time zone, mastering_postgis(# place varchar, mastering_postgis(# type varchar, mastering_postgis(# horizontalError numeric, mastering_postgis(# depthError numeric, mastering_postgis(# magError numeric, mastering_postgis(# magNst numeric, mastering_postgis(# status varchar, mastering_postgis(# locationSource varchar, mastering_postgis(# magSource varchar mastering_postgis(# ); CREATE TABLE
Now, as we have our data table ready, we can finally get to the import part. The following command should handle importing the data into our newly created table:
\copy data_import.earthquakes_csv from data\2.5_day.csv with DELIMITER ',' CSV HEADER
You should see a similar output:
mastering_postgis=# \copy data_import.earthquakes_csv from data\2.5_day.csv with DELIMITER ',' CSV HEADER COPY 25
While you can customize your data after importing, you may wish to only import a subset of columns in the first place. Unfortunately \COPY
command imports all the columns (although you may specify where to put them) so the solution to this would be:
- Create a table that models the input CSV
- Import all the data
- Create a table with a subset of columns
- Copy data over
- Delete the input table
Even though everything said earlier is possible in psql, it requires quite a lot of typing. Because of that we will take care of this scenario in the next stage. Demonstrating the non-interactive psql mode.
For the non-interactive psql data import example we'll do a bit more than in the interactive mode. We'll:
- Import the full earthquakes dataset
- Select a subset of earthquakes data mentioned in the previous example and insert it into its own table
- Import another dataset - in this case the Ordnance Survey's POIs
Basically the non-interactive usage of psql means we simply provide it with an SQL to execute. This way we can put together many statements without having to execute them one by one.
Once again we will need the data model prior to loading the data, and then a \COPY
command will be used.
If you're still in psql, you can execute a script by simply typing:
\i path\to\the\script.sql
For example:
\i F:/mastering_postgis/chapter02/code/data_import_earthquakes.sql
You should see a similar output:
mastering_postgis-# \i F:/mastering_postgis/chapter02/code/data_import_earthquakes.sql CREATE SCHEMA psql:F:/mastering_postgis/chapter02/code/data_import_earthquakes.sql:5: NOTICE: table "earthquakes_csv" does not exist, skipping DROP TABLE CREATE TABLE COPY 25 psql:F:/mastering_postgis/chapter02/code/data_import_earthquakes.sql:58: NOTICE: table "earthquakes_csv_subset" does not exist, skipping DROP TABLE SELECT 25 mastering_postgis-#
If you quit psql already, type the following command into cmd:
psql -h host -p port -U user -d database -f path\to\the\script.sql
For example:
psql -h localhost -p 5434 -U postgres -d mastering_postgis -f F:\mastering_postgis\chapter02\code\data_import_earthquakes.sql
You should see a similar output:
F:\mastering_postgis\chapter02>psql -h localhost -p 5434 -U postgres -d mastering_postgis -f F:\mastering_postgis\chapter02\code\data_import_earthquakes.sql psql:F:/mastering_postgis/chapter02/code/data_import_earthquakes.sql:2: NOTICE: schema "data_import" already exists, skipping CREATE SCHEMA DROP TABLE CREATE TABLE COPY 25 DROP TABLE SELECT 25
The script executed earlier is in the book's code repository under Chapter02/code/ data_import_earthquakes.sql
.
Loading OS POI data is now a piece of cake. This dataset is in a bit of a different format though, so it requires slight adjustments. You can review the code in Chapter02/code/ data_import_gb_poi.sql
.
In this section we'll import some new data we have not interacted with before - this time we'll have a look at the Ordnance Survey's address data we obtained in the CSV format.
Note
Depending on the pgAdmin version, the UI may differ a bit. The described functionality should always be present though. For the examples involving pgAdmin, screenshots were taken using pgAdmin III (1.22.2).
PgAdmin's import functionality is basically a wrapper around the \COPY
so it does require a data model in order to work. Because of that, let's quickly create a table that will be populated with the imported data. You can do it with the GUI by simply right-clicking a schema node you want to create the table in and choosing New Object
| New Table
and then providing all the necessary model definitions in the displayed window:
You can also type some SQL which in many cases is a bit quicker:
drop table if exists data_import.osgb_addresses; create table data_import.osgb_addresses( uprn bigint, os_address_toid varchar, udprn integer, organisation_name varchar, department_name varchar, po_box varchar, sub_building_name varchar, building_name varchar, building_number varchar, dependent_thoroughfare varchar, thoroughfare varchar, post_town varchar, dbl_dependent_locality varchar, dependent_locality varchar, postcode varchar, postcode_type varchar, x numeric, y numeric, lat numeric, lon numeric, rpc numeric, country varchar, change_type varchar, la_start_date date, rm_start_date date, last_update_date date, class varchar );
Once our table is ready, importing data is just a matter of right clicking the table node in PgAdmin
and choosing Import
. An import wizard that assists with the import process will be displayed:
As we have some flat data already in our database, it's time to extract the spatial information. So far all the datasets, used Cartesian coordinate systems so our job is really straightforward:
drop table if exists data_import.earthquakes_subset_with_geom; select id, "time", depth, mag, magtype, place,Points of Interest in TXT format ST_SetSRID(ST_Point(longitude, latitude), 4326) as geom into data_import.earthquakes_subset_with_geom from data_import.earthquakes_csv;
This example extracts a subset of data and puts data into a new table with coordinates being expressed as a geometry type, rather than two columns with numeric data appropriate for Lon and Lat.
In order to quickly preview the data, we dump the table's content to KML using ogr2ogr
(this is a little spoiler on the next chapter on exporting the data from PostGIS indeed):
ogr2ogr -f "KML" earthquakes_from_postgis.kml PG:"host=localhost port=5434 user=postgres dbname=mastering_postgis" data_import.earthquakes_subset_with_geom -t_srs EPSG:4326
Such KML can be viewed for example in Google Earth (you can use the original KML downloaded from USGS just as a cross check for the output data):