Data analytics

Posted by: genezeien | January 16, 2010

Finding raw temperature data

ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily a bit of a mess, but so is “raw” food 😉

If you prefer your data in graphs, try:

http://climexp.knmi.nl/ Go prepared with latitude & longitude coordinates, to find the temperature station of interest. Raw and adjusted data are available, here.

http://data.giss.nasa.gov/gistemp/station_data/ These are adjusted temperatures.

http://www7.ncdc.noaa.gov/IPS/coop/coop.html PDFs of original paper records.

Monthly gridded data in easy to use format

Monthy_1x1.txt 34Mb file. First column is date i.e. year+(month-1)/12 So, January of 1932 comes out as 1932.00, and December of 1932 is 1932.92. Remaining columns have latitude,longitude in the first row, and temperatures in Celsius thereafter. Columns are separate by a space.

Quality control:

If a 1×1 degree sector has only one day’s worth of data, that is the temperature for that month: (TMAX+TMIN)/2. The intent is to avoid discarding data from sites that are subject to extreme weather events, or observer hardships. GISS quality control method is to discard a whole month for a site when a single day is missing, then “fill in” from surrounding stations, up to 1200km away. I believe the GISS method favors sites which are easy to observe.
Daily minimum of maximum temperatures above 100C or below -100C were discarded. GHCN uses -9999 as a key for missing values, and occasionally a -999 would sneak in. In reality, the allowed range could be tighter to weed out any other irrational temperatures.

#!/bin/bash

# Monthly data
#wget -Nr ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2 -o monthly.log
#find ftp.ncdc.noaa.gov/pub/data/ghcn -name \*.Z -exec uncompress '{}' \;

# Daily data
#wget -Nr --exclude-directories=grid ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily -o daily.log
H=`pwd`
cd ftp.ncdc.noaa.gov/pub/data/ghcn/daily

# Start with list of stations' coordinates & codes
# ghcnd-stations.txt:   List of stations and their metadata (e.g., coordinates)
# Contents look like:
# ID           Lat       Long      Elev   State     Name    GSNflag  HCNflag     WMOid
# AJ000037749  40.7000   47.7000   95.0   GEOKCAY                                37749

# Look up min & max temps from ${StationID}.dly files
# ex: "AJ000037749.dly"  line 1 looks like:
#            YearMo      Day1    Day2    Day3    ...
# AJ000037749193602PRCP    0  I   17  I    0  I    0  I    0  I    0  I    0  I   78  I    0  I    0  I    0  I    0  I    0  I   20  I    0  I    0  I    0  I    0  I    0  I    0  I    0  I   12  I    0  I    0  I    0  I    0  I    0  I    0  I    0  I-9999   -9999
# AJ000037749200910TMAX  240  S  224  S  252  S  256  S  250  S  250  S  239  S  212  S  220  S  221  S  225  S  239  S-9999     253  S  256  S  231  S  239  S  240  S  230  S  231  S  229  S  230  S  231  S-9999     243  S-9999     181  S  187  S  212  S  170  S  173  S

# List of stations with temperature data.
# This takes some time.  Delete files to freshen the lists.
if [ ! -e ${H}/TempStations.txt ] ; then
  awk '(F!=FILENAME)&&($1~/TMAX/){F=FILENAME; print F > "'${H}/TempStations.txt'"}' all/*.dly
fi

# Which stations are in each sector?  Format:  Lat Lon StationID
# where Lat is rounded to the nearest integer.
awk 'FILENAME~/ghcn/{\
      Lat[$1]=int($2+90.5); Lon[$1]=int($3+180.5);\
     }\
     FILENAME~/TempSta/{\
      ID=substr($1,5,11);\
      Station[ID]=1;\
     }\
     END{\
      for(ID in Station)print Lat[ID]-90,Lon[ID]-180,ID;\
     }' ${H}/TempStations.txt ghcnd-stations.txt > ${H}/LatLonIDs.txt

#################################################################################
# Calculate daily average for each sector.
# Calculate monthly from dailies for each sector.
echo -n "" > ${H}/Daily_1x1.txt
for Coord in `awk '{print $1","$2}' ${H}/LatLonIDs.txt | sort -u`; do
   # remove the comma between Lat & Lon
   C="${Coord/,/ }"
   echo $C
   # list of stations with these coordinates
   Tstations=(`grep ^"$C" ${H}/LatLonIDs.txt | awk '{print "all/"$3".dly"}'`)
         awk -v Coord="$Coord" '{St=substr($0,0,11);Yr=substr($0,12,4);Mo=substr($0,16,2);}\
           $1~/TMAX/{l=length($0); d=1; \
               for(i=22;i<l;i+=8){\
                  mx=substr($0,i,5)/10.0;\
                  if((mx<100)&&(mx>(-100))){\
                     tmx[Yr,Mo,d]+=mx; Nx[Yr,Mo,d]++;\
                     if(D[Yr,Mo,d]==""){\
                        D[Yr]++; Day[Yr,Mo,d]=1;\
                     }\
                  }\
                  d++}}\
           $1~/TMIN/{l=length($0); d=1;\
               for(i=22;i<l;i+=8){\
                  mn=substr($0,i,5)/10.0;\
                  if((mn<100)&&(mn>(-100))){tmn[Yr,Mo,d]+=mn; Nn[Yr,Mo,d]++;}\
                  d++}}\
           END{split(Coord,C,",");\
               for(day in D){\
                 if(tmn[day]!=""){d=day; gsub(SUBSEP,"/",d); \
                    print C[1],C[2],d,((tmx[day]/Nx[day])+(tmn[day]/Nn[day]))/2.0}\
               }}' ${Tstations[@]} | sort -n >> ${H}/Daily_1x1.txt

done
echo "Finished daily, starting monthly"
#################################################################################
# Monthly by sector
awk '{split($3,day,"/");\
      if(day[1]<1900){next}\
      date=day[1]+(day[2]-1)/12.0;\
      T[$1,$2,date]+=$4; Tn[$1,$2,date]++;\
      days[date]=1; LatLon[$1,$2]=1;\
     }\
     END{\
      printf("Year ");\
      for(LL in LatLon){split(LL,info,SUBSEP); printf("%d,%d ",info[1],info[2]);}\
      printf("\n");\
      for(d in days){\
         printf("%1.3f ",d);\
         for(LL in LatLon){\
            if(Tn[LL,d]!="") {\
               printf("%1.4f ",T[LL,d]/Tn[LL,d]);\
            }else{\
               printf("nan ");\
            }\
         }\
         printf("\n");\
      }\
     }' ${H}/Daily_1x1.txt | sort -n +0 -1 > ${H}/Monthly_1x1.txt

exit

1 Comment

Posted in Uncategorized

Posted by: genezeien | December 29, 2009

Style vs Content

Feel free to leave an analysis suggestion, or point out coding problems in the comments area on “Step by Step“. Contrary to typical blogging, the “Step by Step” entry will continue to grow as additional analysis steps are completed, rather than having new posts appear above the information that gives context to the later analysis. If you have visited before, and are looking for the latest progress, just scroll down to the end. Most graphs are self-explanatory, though many analyses build upon previous steps. Code is primarily bash & awk. If you are familiar with C, awk code is fairly easy to read. The associative arrays make awk a natural fit for parsing files with content of unknown extent.

See also: UHI, Arctic Next up: Tobs aka Time of observation bias, as discussed in NCDC documentation.

2 Comments

Posted in Uncategorized

Posted by: genezeien | December 28, 2009

Step by step: Is the climate changing?

Author

Eugene Zeien, BS Applied Physics 1991 18 years experience in data analysis & IT support at the University of Iowa.

Abstract

Having heard, so frequently, that the data underlying the current consensus was robustly supportive, I decided to take the time to find raw, unadjusted data and undertake some simple analyses. I was quite surprised by the results. I am posting those here for comments and suggestions, along with source code and links to the raw data.

The majority of climate researchers use the adjusted data in their work, because CRU, GISS, and NCDC make the adjusted data easily accessible and easy to use. Since evidence has surfaced which suggests those three entities are not independent, all three adjustment methods may be suspect. Let’s take a look.

Methods

Starting with a home PC, I installed Sun’s VirtualBox. Next, since my experience is primarily Linux/Unix based, I installed Ubuntu 9.10 on a virtual 40Gb disk. GHCN maintains a nice, though hard to find, ftp repository of raw climate station data, which was downloaded and decompressed. The documentation in readme.txt was fairly easy to follow.

As a first pass, all the stations available in GHCN were included. Temperature data was combined into a bin from all the stations within a geographical 1×1 degree sector. This methodology allows for station relocation and overlap with minimal impact upon the results. An annual temperature was computed for a sector with more than 240 daily readings within that year. 240 was a completely arbitrary decision, based upon the reasoning that the stations with high dropout rates are in the most inhospitable regions.

#!/bin/bash

# Monthly data
#wget -Nr ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2 -o monthly.log
#find ftp.ncdc.noaa.gov/pub/data/ghcn -name \*.Z -exec uncompress '{}' \;

# Daily data
wget -Nr --exclude-directories=grid ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily -o daily.log
H=`pwd`

# Start with list of stations' coordinates & codes
# ghcnd-stations.txt:   List of stations and their metadata (e.g., coordinates)
# Contents look like:
# ID           Lat       Long      Elev   State     Name    GSNflag  HCNflag     WMOid
# AJ000037749  40.7000   47.7000   95.0   GEOKCAY                                37749

# Look up min & max temps from ${StationID}.dly files
# ex: "AJ000037749.dly"  line 1 looks like:
#            YearMo      Day1    Day2    Day3    ...
# AJ000037749193602PRCP    0  I   17  I    0  I    0  I    0  I    0  I    0  I   78  I    0  I    0  I    0  I    0  I    0  I   20  I    0  I    0  I    0  I    0  I    0  I    0  I    0  I   12  I    0  I    0  I    0  I    0  I    0  I    0  I    0  I-9999   -9999
# AJ000037749200910TMAX  240  S  224  S  252  S  256  S  250  S  250  S  239  S  212  S  220  S  221  S  225  S  239  S-9999     253  S  256  S  231  S  239  S  240  S  230  S  231  S  229  S  230  S  231  S-9999     243  S-9999     181  S  187  S  212  S  170  S  173  S

# Build list of Lat Lon pairs to check for data
cd ftp.ncdc.noaa.gov/pub/data/ghcn/daily/
awk '{printf("%d %d\n",$2,$3)}' ghcnd-stations.txt | sort -u | grep . > ${H}/Coords.txt

# Going to populate a 180x360 degree grid
# anything from X.0 - X.99 goes into X's box
echo -n "" > ${H}/TAvg_1x1.txt

let Lines=`cat ${H}/Coords.txt | wc -l`
let C=0
while [ $C -lt $Lines ] ; do
      let C++
      Coords=(`awk 'NR=='$C'{print $0; exit}' ${H}/Coords.txt`)
      Lat=${Coords[0]} ; Lon=${Coords[1]}
      ## select stations with this Lat & Lon
      stations=(`awk -v lat=$Lat -v lon=$Lon '\
                (lat+0.0<=$2)&&($2<lat+1.0)&&\
                (lon+0.0<=$3)&&($3<lon+1.0){print $1}' ghcnd-stations.txt`)
      if [ ${#stations[@]} -gt 0 ] ; then
         printf '+ %s %s %s\n' $Lat $Lon ${#stations[@]}
      else
         printf '  %s %s\n' $Lat $Lon
      fi
      ## Poll stations for years with temperature data.  -9999 is missing value.
      Tstations=()
      for S in ${stations[@]} ; do
         Tstations=(${Tstations[@]} `awk '$1~/TMAX/{print FILENAME; exit}' all/${S}.dly`)
      done
      if [ ${#Tstations[@]} -eq 0 ] ; then
         let Lon++
         continue
      fi
      printf 'T %s %s %s\n' $Lat $Lon ${#Tstations[@]}
      ## Would be best to pair up TMIN & TMAX
      ## Parsing is FUN!
      ## Output file should be Lat Lon Year TAvg (divided by 10.0 to remove builtin T*10)
      ## Added a minimum N>240 to avoid regions with one temperature obs/year  😉
      awk '{St=substr($0,0,11);Yr=substr($0,12,4);Mo=substr($0,16,2);}\
           $1~/TMAX/{l=length($0); d=1; \
               for(i=22;i<l;i+=8){\
                  mx=substr($0,i,5)/10.0;\
                  if((mx<100)&&(mx>(-100))){\
                     tmx[Yr]+=mx; Nx[Yr]++; \
                     if(D[Yr,Mo,d]==""){\
                        D[Yr]++; Day[Yr,Mo,d]=1;\
                     }\
                  }\
                  d++}}\
           $1~/TMIN/{l=length($0); d=1; \
               for(i=22;i<l;i+=8){\
                  mn=substr($0,i,5)/10.0;\
                  if((mn<100)&&(mn>(-100))){tmn[Yr]+=mn; Nn[Yr]++;}\
                  d++}}\
           END{for(i in tmx){\
                 if((tmn[i]!="")&&(D[i]>240)){ \
                    print "'$Lat'","'$Lon'",i,((tmx[i]/Nx[i])+(tmn[i]/Nn[i]))/2.0}\
               }}' ${Tstations[@]} | sort -n >> ${H}/TAvg_1x1.txt

done
awk '{Temp[$3]+=$4;n[$3]++}\
     END{for(i in Temp){print i,Temp[i]/n[i]}}' ${H}/TAvg_1x1.txt > ${H}/TAvg_Globe.txt
exit

In order to minimize the effect of temperature stations appearing and dropping out, sectors were selected which had continuous annual temperature data from 1900 to 2009. The data from the 613 sectors has a fairly strong sinusoidal pattern. The poorly correlated linear trend is probably related to the point in the wave’s peaks and troughs where the data begins and ends. Do not copy this graph, I have an idea…

Correlation 0.43 using a 30 year sine wave combined with a 0.3/century drop in temperature.

## This is a follow-up from the previous processing script.
## Sift out sector that are have measures for all of 1900-1999
awk '{Lat=$1; Lon=$2; Year=$3;\
         Coords[Lat,Lon]=1; T[Lat,Lon,Year]=$4;\
      }\
      END{\
         for(C in Coords){\
            Bad=0;\
            for(Y=1900; Y<=1999; Y++){\
               if(T[C,Y]==""){Bad=1}\
            }\
            if(Bad==0){\
               for(Y=1900; Y<=2009; Y++){Keep[C,Y]+=T[C,Y]; N[C,Y]++}\
            }\
         }\
         print "Year Celsius";\
         for(Y=1900; Y<=2009; Y++){\
            for(C in Coords){\
               if(N[C,Y]!=0){AnnualT[Y]+=(Keep[C,Y]/N[C,Y]); Nsec[Y]++}\
            }\
            print Y,AnnualT[Y]/Nsec[Y],Y,Nsec[Y];\
         }\
      }' ${H}/TAvg_1x1.txt > ${H}/Continuous_TAvg_Globe.txt

The reason 1900 was chosen is due to the increase in the number of stations just prior to the turn of the century. The next two graphs do not represent the 613 sectors used above. I will post source code.. tomorrow?

Intuitively, the monthly average temperature oscillates with the Northern seasons.

Having established a basic overview of what the raw data looks like, now it is time to look at the temperature anomalies. Starting with the 1×1 degree sector data, those with continuous temperatures 1900-1999 were used to demonstrate the effect of transforming and averaging raw temperatures across the globe versus averaging anomalies(base 1961-1990). This graph illustrates the wild fluctuations of land temperature anomalies from one year to the next and the 11 year moving average.

## Get sectors with complete 1900-1999 data, compute anomalies (base 1961-1990)
## 11 yr average is easily done in spreadsheet, so that's not here
awk '{Lat=$1; Lon=$2; Year=$3;\
         Coords[Lat,Lon]=1; T[Lat,Lon,Year]=$4;\
      }\
      END{\
         for(C in Coords){\
            Bad=0;\
            for(Y=1900; Y<=1999; Y++){\
               if(T[C,Y]==""){Bad=1}\
            }\
            if(Bad==0){\
               for(Y=1900; Y<=2009; Y++){Keep[C,Y]+=T[C,Y]; N[C,Y]++}\
               for(Y=1961; Y<=1990; Y++){BaseSum[C]+=T[C,Y]; BaseN[C]++}\
               Base[C]=BaseSum[C]/BaseN[C];\
            }\
         }\
         print "Year Celsius";\
         for(Y=1900; Y ${H}/Continuous_Anomaly_Globe.txt

How does an anomaly differ from real temperature minus a constant? Surprise! A constant was chosen that left the two plots offset by 0.1 C.

How does the temperature data go from the chaotic, variable state seen above to the relatively quiet plot posted by GISS:

Note the subtitle (Meteorological Stations). Clearly, this does not include ocean data. Other GISS graphs are available here.

How do the GISS computed anomalies compare with unadjusted anomalies? The 1961-1990 mean from the GISS anomalies was 0.09, whereas the 1961-1990 mean from my anomalies was 0.00. Clearly, the GISS data is adjusted further after the conversion to anomalies.

Clearly, something needs to be done to reduce the variance in the raw temperature data. Let’s take a look at temperature within 10 degree latitude bands, i.e. latitude 80 is 75 to 84.99. Using all sectors with 240 days of data within the year:

## First pass, average temp by latitude x10
awk '{if($1 < 0){Lat=int(($1-5)/10)}else{Lat=int(($1+5)/10)} \
      Temp[Lat,$3]+=$4;n[Lat,$3]++; Years[$3]=1; Lats[Lat]=1;}\
     END{printf("Year ");\
         for(L=90; L>=-90; L-=10){printf("Lat=%d ",L);}\
         printf("\n");\
         for(Y=1900; Y<=2009; Y++){\
           printf("%d ",Y);\
           for(L=9; L>=-9; L-=1){\
              printf("%f ",Temp[L,Y]/(n[L,Y]*1.0));\
           }\
           printf("\n");\
         }\
     }' ${H}/TAvg_1x1.txt > ${H}/TAvg_byLat_Globe.txt

Two surprises: there was no data in the range -45 to -84.99 latitude, and none above 80 latitude. The entire Lat=80 band is data from +75 to +79.99. At last, a warming trend has been found. Perhaps this graph is easier to read. Latitudes with no data have been eliminated. The legend has been rearranged with equatorial latitudes at the top, near their temperature plots.

The Arctic trend will be considered in depth here. https://justdata.wordpress.com/arctic-trends/

UHI effects will be considered as well. This one is too good to hide, greater NYC area versus 40N latitude stations and 35N to 45N stations.

## Gather up all the sectors in the 40.0 to 40.99 Latitude range
## Odd side-effect of my gridding method,
## NY NEW YORK CNTRL PRK  lat=40.7800  long=-73.9700
## ends up in the lat=40 long=-74 sector.
awk 'BEGIN{print "Year 40N 35-45N NYC"}\
     (($1==40) && ($2!=(-74)))){Avg[$3]+=$4; N[$3]++}\
     (($1>=35)&&($1<=45)){Zavg[$3]+=$4; Zn[$3]++}\
     (($1==40)&&($2==(-74))){NYt[$3]+=$4; NYn[$3]++}\
     END{\
         for(Y in Avg){print Y,Avg[Y]/N[Y],Zavg[Y]/Zn[Y],NYt[Y]/NYn[Y]}\
     }' TAvg_1x1.txt | sort -n > 40N_latitude3.txt

Data analytics

Finding raw temperature data

Monthly gridded data in easy to use format

Style vs Content

Step by step: Is the climate changing?

Nothing to see here, move along

Categories

Recent Posts

Top Posts