GHCN-v2 Data

Recent work done by Steve McIntyre with contributions by Anthony Watts (who created surfacestations.org) motivated me to take a close look at the GHCN-v2 global temperature data set.

My motivation was twofold: First, I wanted to see the range of temperatures by location. Second, I wanted to see data availability over time at each location. The usual average global temperature anomaly graphs do not provide this information.

To this end, I wrote some quick and dirty Perl scripts which produce the 13,736 files consisting of HTML pages and PNG time series graphs you can find on climate.unur.com. You can find links to the data I used on that page as well.

To plot the data, I first sliced and diced the large monolithic data sets using the script below:

#!/usr/bin/perl

# parse.pl

use strict;
use warnings;

use File::Basename;
use File::Slurp;
use File::Spec::Functions qw( catfile );

use Readonly;
Readonly::Scalar my $RECORD_LAYOUT
    => q{ A3 A9 A4 A5 A5 A5 A5 A5 A5 A5 A5 A5 A5 A5 A5 };

my ( $temps_file ) = @ARGV;
die usage() unless defined $temps_file;

Readonly::Scalar my $OUTPUT_DIR => ( split /\./, $temps_file )[-1];

mkdir $OUTPUT_DIR;

open my $temps, '<', $temps_file
    or die "Cannot open '$temps_file': $!";

while ( <$temps> ) {
    my ($country, $station, $year, @data) = unpack $RECORD_LAYOUT => $_;

    my $dir = catfile $OUTPUT_DIR, $country;

    mkdir $dir or die $! unless -d $dir;

    my $out;
    for my $month ( 1 .. 12 ) {
        $out .= sprintf(
            "%4.4d,%2.2d,%d\n",
            $year, $month, $data[ $month - 1 ],
        );
    }

    write_file catfile( $dir, $station ), { append => 1 }, \$out ;
}

sub usage { sprintf "Usage: %s <temperature_data_file>\n", basename $0; }
__END__

This is a fairly simple script which creates a file for each unique station (consisting of WMO station identifier, modifier and duplicate number). These files have three columns: Year, month and temperature. For purposes of illustration, running this on v2.mean results in 13,471 files

By the way, I decided to use the filesystem as the database as this cut down the time taken by each script to the order of minutes rather than hours compared with an earlier version which used SQLite.

Next, I generated a single series for each WMO location. When there were multiple series available for a given WMO location, I averaged the non-missing observations by month. Here is the source code for that script:

#!/usr/bin/perl

# combine.pl

use strict;
use warnings;

use File::Slurp;
use File::Spec::Functions qw( catfile );
use Readonly;

Readonly::Scalar my $INVENTORY_FILE => 'v2.temperature.inv';

open my $inv, '<', $INVENTORY_FILE
    or die "Cannot open '$INVENTORY_FILE': $!";

while ( my $station = <$inv> ) {
    my ($country, $wmo) = unpack 'A3 A5', substr( $station, 0, 8 );
    process( $country, $wmo );
}

sub process {
    my ( $country, $wmo ) = @_;

    for my $series ( qw( mean mean_adj min min_adj max max_adj ) ) {

        my $dir = catfile $series, $country;
        my @files;

        eval {
            @files = grep { /^$wmo/ } read_dir $dir;
        };

        next if $@;

        my (%data, %n);

        for my $file ( @files ) {
            my $path = catfile $dir, $file;

            open my $data, '<', $path
                or die "Cannot open '$path': $!";

            while ( <$data> ) {
                chomp;
                my ($year, $month, $temp) = split /,/, $_;
                next if $temp eq '-9999';

                my $key = "$year:$month";

                if ( exists $data{ $key } and ( defined $data{ $key } ) ) {
                    my $obs = $data{ $key };
                    my $n = $n{ $key };
                    $data{ $key } = ( $n * $obs + $temp )/($n + 1);
                    ++ $n{ $key };
                }
                else {
                    $data{ $key } = $temp;
                    $n{ $key } = 1;
                }
            }

            close $data or die $!;
        }

        $_ = sprintf( '%.0f', $_ ) for values %data;

        my $output_dir = catfile 'combined', $series, $country;
        unless ( -d $output_dir ) {
            mkdir 'combined';
            mkdir catfile( combined => $series );
            mkdir $output_dir or die $!;
        }

        write_file(
            catfile( $output_dir, $wmo ),
            [ map { "$_,$data{$_}\n" } sort keys %data ],
        );
    }
}
__END__

After running this script, I had 4,498 files for monthly mean files by WMO location and 4,407 adjusted monthly mean files by WMO location. Each file contains two columns: A year-month label and a combined temperature observation. This is not the correct way to do it for statistical analysis because each observation ends up being an average of potentially different numbers of observations. However, it was sufficient for my visualization-only purpose.

Once I had those files, graphing was easy:

#!/usr/bin/perl

use strict;
use warnings;

use GD;
use GD::Graph::lines;

use FindBin qw( $Bin );
use File::Basename;
use File::Find;
use File::Slurp;
use File::Spec::Functions qw( catfile canonpath );

use lib catfile( $Bin, 'lib' );

use GHCNv2::Constants qw(
    @AXIS_FONT
    @TITLE_FONT

    @FIXED_DIM
    @ZOOMED_DIM
    @FIXED_X_LABELS
    @SOURCES

    $INPUT_DIR
    $OUTPUT_DIR

    %COMMON_GRAPH_ATTR
    %FIXED_GRAPH_ATTR
    %ZOOMED_GRAPH_ATTR
);

use GHCNv2::Countries qw( %COUNTRIES );
use GHCNv2::Stations qw( %STATIONS );
use GHCNv2::Util qw( make_x_labels );

mkdir $OUTPUT_DIR or die $! unless -d $OUTPUT_DIR;

find( \&amp;plot_wmo, catfile( $INPUT_DIR, 'mean' ) );

sub plot_wmo {
    return unless /^\d{5}$/;

    my $wmo = $_;
    my $country = basename $File::Find::dir;

    my @sources = map { canonpath $_ } (
        $wmo, map {
            catfile( $INPUT_DIR, $_, $country, $wmo )
        } @SOURCES[ 1 .. $#SOURCES ]
    );

    my %data;

    for my $s ( 0 .. $#sources ) {
        eval {
            open my $input, '<', $sources[ $s ]
                or die "Cannot open '$sources[ $s ]': $!";
            while ( my $obs = <$input> ) {
                chomp $obs;
                my ( $label, $temp ) = split /,/, $obs;
                undef $temp if $temp eq '-9999';
                $data{ $SOURCES[ $s ] }->{ $label } = $temp;
            }
            close $input
                or die "Cannot close '$sources[ $s ]': $!";
        };
        warn "$@\n" and next if $@;
    }

    my ($start, $end) = map { /^(\d{4})/ }
        ( sort keys %{ $data{ mean } } )[0 , -1];

    my @x_labels = make_x_labels( $start, $end );

    my $fixed_graph = GD::Graph::lines->new( @FIXED_DIM );
    $fixed_graph->set( %COMMON_GRAPH_ATTR, %FIXED_GRAPH_ATTR, );

    my $zoomed_graph = GD::Graph::lines->new( @ZOOMED_DIM );
    $zoomed_graph->set( %COMMON_GRAPH_ATTR, %ZOOMED_GRAPH_ATTR,
        x_label_skip => @x_labels / 10,
    );

    for my $g ( $fixed_graph, $zoomed_graph ) {
        $g->set_x_axis_font ( @AXIS_FONT  );
        $g->set_y_axis_font ( @AXIS_FONT  );
        $g->set_x_label_font( @AXIS_FONT  );
        $g->set_y_label_font( @AXIS_FONT  );
        $g->set_title_font  ( @TITLE_FONT );
        $g->set(
            title => sprintf(
                "%s ( %s ) [ lat = %s, lon = %s ]",
                $STATIONS{ $wmo }->[0]->{ name },
                $COUNTRIES{ $country },
                $STATIONS{ $wmo }->[0]{ latitude },
                $STATIONS{ $wmo }->[0]{ longitude },
            )
        );
    }

    my %mean_adj = %{ $data{mean_adj} || {} };
    my %mean     = %{ $data{mean}     || {} };

    my $fixed_series = [
        \@FIXED_X_LABELS,
        [ @mean_adj{ @FIXED_X_LABELS } ],
        [ @mean    { @FIXED_X_LABELS } ],
    ];

    my $zoomed_series = [
        \@x_labels,
        [ @mean_adj{ @x_labels } ],
        [ @mean    { @x_labels } ],
    ];

    my $output_dir = catfile $OUTPUT_DIR, $country ;
    unless( -d $output_dir ) {
        mkdir $output_dir or die $!;
    }

    write_file(
        catfile( $output_dir, "$wmo.png" ),
        { binmode => ':raw' },
        \$fixed_graph->plot( $fixed_series )->png( 9 ),
    );

    write_file(
        catfile( $output_dir, "$wmo-zoomed.png" ),
        { binmode => ':raw' },
        \$zoomed_graph->plot( $zoomed_series )->png( 9 ),
    );
}

__END__

You can download the source code and custom modules.