George MacKerron: code blog

GIS, software development, and other snippets

Using OS Code-Point Polygons in PostGIS

Ordnance Survey’s Code-Point with Polygons “provides a precise geographical location for each postcode unit in Great Britain”. It’s available in various formats, including ESRI .shp files.

Many UK academics can access the data via institutional subscription to EDINA Digimap. I’m using it in my research into subjective wellbeing and environmental quality.

This post shows how to:

  1. import the data files into a PostGIS database; and
  2. de-normalise the data into a single table, where there’s a one-to-one mapping of postcodes to rows, and each row contains either all geographical locations covered by a postcode (as a single geometry column, of type multipolygon) or the reason why no such location is available

Why de-normalise?

Step 2 above is required for my purposes because Code-Point data is supplied in a number of separate files per postcode area:

  • a .shp file (and associated .shx and .dbf) mapping postcodes and ‘vertical streets’ to the locations they cover;
  • an accompanying text file mapping postcodes to ‘vertical streets’; and
  • another text file listing postcodes with no associated locations, either because they represent PO boxes or because the data just isn’t available.

Vertical streets generally represent high-rise buildings, where one location in 2D space may be associated with multiple postcodes. Vertical streets are also a serious pain: not only may one vertical street be associated with many postcodes, but one postcode may be associated with many vertical streets and with non-vertical-street locations too.

Import the data files

Download and unzip the Code-Point with Polygons data, in ESRI .shp format, for the regions you want (where indicative query timings are provided later, these are for whole-UK data — 120 postcode areas — on a 2Ghz Intel iMac).

Dump everything in the same directory. For each postcode area XX you should have the following five files:

  • XX.shp, XX.shx and XX.dbf
  • XX_vstreet_lookup.txt
  • XX_discard.txt

If you don’t already have one you want to use for this purpose, create a PostGIS-enabled PostgreSQL database (in this post, the database is named geo). If you’re not running PostgreSQL 8.4 or later you may need to fiddle with some fsm_ settings in the Postgres .conf file in order to cope with a data set this large.

Execute the following SQL to create the tables for discards and vertical streets (e.g. from within pgAdmin):

create table cpp_vertical_streets (
  postcode character varying(8),
  vstreet character varying(8)
);
 
create table cpp_discards (
  postcode character varying(8),
  reason character varying(7)
);

Next we need to create the table for the .shp-file polygons. We switch to Ruby for this — you can just enter the commands in an IRB terminal session:

path = '/path/to/CodePoint polygons'
`/usr/local/pgsql/bin/shp2pgsql -p -D -s 27700 "#{path}/ab.shp" cpp_polygons | /usr/local/pgsql/bin/psql -d geo -U postgres`

The -p flag to shp2pgsql just creates a table structure, and the -s 27700 gives it the EPSG code to tell it we’re using the OSGB36 datum. You can use any of the .shp files you’ve downloaded — it need not be the AB area one.

Now to do the import into the three tables. Still in Ruby, execute the following loop:

Dir["#{path}/*.shp"].each do |f|
  puts f
  `/usr/local/pgsql/bin/shp2pgsql -a -D -s 27700 "#{f}" cpp_polygons | /usr/local/pgsql/bin/psql -d geo -U postgres`
  `echo "copy cpp_vertical_streets from '#{f.sub(/\.shp$/, '_vstreet_lookup.txt')}' csv;" | /usr/local/pgsql/bin/psql geo postgres`
  `echo "copy cpp_discards from '#{f.sub(/\.shp$/, '_discard.txt')}' csv;" | /usr/local/pgsql/bin/psql geo postgres`
end

Back in SQL, add some boolean flags we’ll use later on, then create an index on the postcode column, which will save a lot of time later:

alter table cpp_polygons add column discard boolean;
alter table cpp_polygons add column vstreet boolean;
alter table cpp_polygons add column vstreet_and_std boolean;
 
create index pc_index on cpp_polygons (postcode);
 
vacuum analyze cpp_polygons;

If you like, you can confirm here that there’s only one row per postcode — this query should return nothing:

select * 
from cpp_polygons a 
inner join cpp_polygons b 
on a.postcode = b.postcode 
where a.gid <> b.gid;

So, we now have all the OS data held in three separate tables. In the rest of the post, we’ll be merging the data in the discards and vertical streets tables into the main table, cpp_polygons.

Discards

We’re going to create a new table, with the same structure as cpp_polygons, to hold the discarded postcode data. These postcodes will have a NULL geometry column, and a TRUE discard column (one of the booleans we added earlier). Later, we’ll insert the contents of this table into cpp_polygons.

Run the following SQL:

create sequence cpp_discard_seq start with 2000000;
 
create table cpp_discard_polys as (
select 
  nextval('cpp_discard_seq') as gid, 
  postcode,
  cast(null as character varying(20)) as upp, 
  substring(postcode from '^[A-Z][A-Z]?') as pc_area, 
  cast(null as geometry) as the_geom, 
  true as discard, 
  false as vstreet,
  false as vstreet_and_std
from cpp_discards
);

Vertical streets

Similarly, we’re now going to create a new table with the same structure as cpp_polygons to map vertical street postcodes to the right polygons. This requires a left join of the vertical streets data with some of the geometry data in cpp_polygons.

set enable_seqscan = false; 
-- (otherwise pg sometimes fails to use the index)
 
create sequence cpp_vstreet_seq start with 3000000;
 
create table cpp_vstreet_polys as (
  select 
    nextval('cpp_vstreet_seq') as gid, 
    max(v.postcode) as postcode, 
    cast(null as varchar(20)) as upp, 
    max(pc_area) as pc_area, 
    st_union(the_geom) as the_geom, 
    false as discard,
    true as vstreet,
    false as vstreet_and_std
  from cpp_vertical_streets v 
  left join cpp_polygons p 
  on v.vstreet = p.postcode
  group by v.postcode
);

Our last new table is for postcodes that are associated with both standard polygons and vertical street polygons.

create sequence cpp_vstreet_and_std_seq start with 4000000;
 
create table cpp_vstreet_and_std_polys as (
select 
  nextval('cpp_vstreet_and_std_seq') as gid, 
  p.postcode as postcode,
  cast(null as varchar(20)) as upp, 
  p.pc_area as pc_area,
  st_multi(st_union(p.the_geom, v.the_geom)) as the_geom, 
  false as discard, 
  false as vstreet,
  true as vstreet_and_std
from cpp_polygons p
inner join cpp_vstreet_polys v
on p.postcode = v.postcode
);

Cleaning and merging

Now we need to remove the vertical streets and polygons we just merged into a new table from their respective source tables.

delete from cpp_polygons where postcode in (select postcode from cpp_vstreet_and_std_polys);
-- the above could take around 25 mins
delete from cpp_vstreet_polys where postcode in (select postcode from cpp_vstreet_and_std_polys);

And, finally, to merge our three new tables into the main table.

insert into cpp_polygons select * from cpp_discard_polys;
insert into cpp_polygons select * from cpp_vstreet_polys;
insert into cpp_polygons select * from cpp_vstreet_and_std_polys;

In order to join my own data with this data, I find it easiest to format the postcodes on which the join is made with no spaces. So I add and index the following extra column.

alter table cpp_polygons add column postcode_no_sp character varying(8);
update cpp_polygons set postcode_no_sp = replace(postcode, ' ', ''); 
-- the above could take 10 - 15 mins
create index pcns_index on cpp_polygons (postcode_no_sp);
 
vacuum analyze cpp_polygons;

You might also want to remove the original vertical street rows, where the postcode column begins with a ‘V’, from the cpp_polygons table — I didn’t have any need for this.

Share

Written by George

November 14th, 2009 at 1:13 pm

Posted in GIS,PostGIS,Ruby,SQL

  • dear George,

    I’m curious to see what use you make of this data combined with other social statistics – will you put the analysis online?

    As you have Digimap access you might find some use in your work for another EDINA service, Unlock Places. This provides a gazetteer search across many of the Ordnance Survey data sources in Digimap, including BoundaryLine, with results in KML, GeoJSON etc – API key needed for UK academic access. It’s all backed with PostGIS too.

    I found this blog via your other post on the NSPD – there’s been some talk here about making a RESTful version of NSPD data in RDF – would you have applications for that?

    be well,