Saturday, June 22, 2013

Flood frequency curves in ggplot2

This post should be considered obsolete as the code has been included in the hydro_prob_trans and hydro_flow_trans functions provided in hydroutils.

What's a flood frequency curve?

In hydrology, a common plot for assessing displaying the likelyhood of a flood event is the frequency curve, a plot of the cumulative distribution function for the observed annual maximums. Typically, in the United States, the Log Pearson type III distribution is fitted to the observed data, to extrapolate the 1% annual chance of exceedence event, sometimes called the 100-year flood. By using a logarithmic y-axis and showing the normal deviate on the x-axis, a distribution without skew will show up as a straight line on the plot. This special x-axis is always the complicated part in creating these plots in many languages, but in R, there is the scales package provides a transform that ggplot2 can use to create the probability axis quite easily.

Below is the output from the code example outlined below.


Bottom line, up front.

This is the code you're looking for if you came to this page with a deadline. The full example, explained in the next section, is available on my github.

library(ggplot2)
library(scales) ## required for probability axis

## Setup breaks for plot
xbreaks = c(0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.975, 0.99, 0.995)
log.range = log10(range(bwpeaks$FLOW))
lower = 10^floor(log.range[1])
upper = 10^ceiling(log.range[2])
cap = lower
ybreaks = NULL
while(cap < upper){
  ybreaks = c(ybreaks, seq(cap, cap*9, by=cap))
  cap = cap*10
}

plt = ggplot(bwpeaks, aes(x=PROB, y=FLOW)) + geom_point() + theme_bw()
plt = plt + scale_y_continuous(trans="log10", breaks=ybreaks, name="Flow (cfs)")
plt = plt + scale_x_continuous(trans=probability_trans(distribution="norm"),
                               breaks=xbreaks, labels=paste0((1-xbreaks)*100,"%"),
                               name="Annual Chance of Exceedence")
plt = plt + ggtitle("Brandywine Creek nr. Wilmington, Delaware")
plt
 
The first block of code sets up the breaks, the grid lines used in ggplot2, the majority of which is there to create breaks at the multiples of each power of 10, starting at the lowest value of the plot through the maximum. This makes the y-axis a little bit easier to read, than just showing each power of 10.

The dataframe bwpeaks contains my dataset to plot, the annual peak flows, sorted, with the Weibull plotting position used for the frequencies. The probability_trans function from the scales package provides the fancy x-axis. Two options exist to put the annual chance of exceedence, in the typical order. The first, shown above, is to store the data as the probability of non-exceedence, and set the labels on the x-axis to 1-xbreaks. The second option is to store the annual chance of exceedence, and reverse the x-axis using limits=c(1,0).

But wait! How do I get data from a DSS file into a format for that code?

Part 1: Getting the data out of a DSS file.


If you've got a DSS file, it's quite easy to open, read, and write data from it into DataContainer objects using the get and commands. If it's a time series object, it has two arrays: times which contains integers representing minutes since January 1st, 1970 and values, containing the observations. There's also a paired data container, which contains a single set of x ordinates and one or more y ordinates, stored as xOrdinates and yOrdinates. The rJava package makes it easy to access these objects as if they were S3 objects.

require(dssrip)

## Open file and get TSC
bwfile = opendss("C:/dssrt/test.dss")
bwcreek = getFullTSC(bwfile, "A=BRANDYWINE CREEK C=FLOW E=1DAY")
colnames(bwcreek) = "FLOW" ## TODO - add naming the columns by C-part

The above is an example from some helper functions I've written to read DSS files. getFullTSC allows a search of the DSS file, which might be overly broad if the user isn't careful, and returns a xts object that many R time series objects support.
Note: Be careful with using the times produced by this object - I'm pretty sure they can be 24 hours off, if the timestamps in the DSS file are at 2400 hours, which can be interpreted as the start of the day or the end of the day. I need to check how this is being handled in my code.

Part 2: Turning a daily time series into annual peaks


The alternative to this would be to just download the annual peaks from the USGS, which is easy enough to do, but I want to show off one of the cool parts of using R, the aggregate function. I also realize I could have used stat_ecdf from the ggplot2 package, but as far as I can tell it does not offer an option to put points at Weibull plotting positions, which besides not being the standard for flood frequency plots, conflicts with the infinite distance between 0 and 1 on the probability axis.
bwcreek$WY = 1900 + ifelse(.indexmon(bwcreek)+1 < 9, .indexyear(bwcreek), 1+.indexyear(bwcreek))

## Get peaks for each water year, assign probabilities
bwpeaks = aggregate(FLOW ~ WY, bwcreek, max)
bwpeaks = subset(bwpeaks, WY!=2013)  ## An artifact of my timestamp issue.
bwpeaks = bwpeaks[order(bwpeaks$FLOW),]  ## Sort the flows
n = length(bwpeaks$FLOW)
bwpeaks$PROB = (1:n)/(1+n) ## Weibull plotting postions.

Saturday, June 1, 2013

An update on reading DSS files in R.

I've decided to release some R functions I've put together to read from a DSS file.  A simple file to source is on my GitHub, provided without warranty.  Eventually, this'll become an R module.  It saved my butt a few times last week.  R is much better at re-arranging data than a stock Jython install, even version 2.5.