Saturday, June 22, 2013

Flood frequency curves in ggplot2

This post should be considered obsolete as the code has been included in the hydro_prob_trans and hydro_flow_trans functions provided in hydroutils.

What's a flood frequency curve?

In hydrology, a common plot for assessing displaying the likelyhood of a flood event is the frequency curve, a plot of the cumulative distribution function for the observed annual maximums. Typically, in the United States, the Log Pearson type III distribution is fitted to the observed data, to extrapolate the 1% annual chance of exceedence event, sometimes called the 100-year flood. By using a logarithmic y-axis and showing the normal deviate on the x-axis, a distribution without skew will show up as a straight line on the plot. This special x-axis is always the complicated part in creating these plots in many languages, but in R, there is the scales package provides a transform that ggplot2 can use to create the probability axis quite easily.

Below is the output from the code example outlined below.


Bottom line, up front.

This is the code you're looking for if you came to this page with a deadline. The full example, explained in the next section, is available on my github.

library(ggplot2)
library(scales) ## required for probability axis

## Setup breaks for plot
xbreaks = c(0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.975, 0.99, 0.995)
log.range = log10(range(bwpeaks$FLOW))
lower = 10^floor(log.range[1])
upper = 10^ceiling(log.range[2])
cap = lower
ybreaks = NULL
while(cap < upper){
  ybreaks = c(ybreaks, seq(cap, cap*9, by=cap))
  cap = cap*10
}

plt = ggplot(bwpeaks, aes(x=PROB, y=FLOW)) + geom_point() + theme_bw()
plt = plt + scale_y_continuous(trans="log10", breaks=ybreaks, name="Flow (cfs)")
plt = plt + scale_x_continuous(trans=probability_trans(distribution="norm"),
                               breaks=xbreaks, labels=paste0((1-xbreaks)*100,"%"),
                               name="Annual Chance of Exceedence")
plt = plt + ggtitle("Brandywine Creek nr. Wilmington, Delaware")
plt
 
The first block of code sets up the breaks, the grid lines used in ggplot2, the majority of which is there to create breaks at the multiples of each power of 10, starting at the lowest value of the plot through the maximum. This makes the y-axis a little bit easier to read, than just showing each power of 10.

The dataframe bwpeaks contains my dataset to plot, the annual peak flows, sorted, with the Weibull plotting position used for the frequencies. The probability_trans function from the scales package provides the fancy x-axis. Two options exist to put the annual chance of exceedence, in the typical order. The first, shown above, is to store the data as the probability of non-exceedence, and set the labels on the x-axis to 1-xbreaks. The second option is to store the annual chance of exceedence, and reverse the x-axis using limits=c(1,0).

But wait! How do I get data from a DSS file into a format for that code?

Part 1: Getting the data out of a DSS file.


If you've got a DSS file, it's quite easy to open, read, and write data from it into DataContainer objects using the get and commands. If it's a time series object, it has two arrays: times which contains integers representing minutes since January 1st, 1970 and values, containing the observations. There's also a paired data container, which contains a single set of x ordinates and one or more y ordinates, stored as xOrdinates and yOrdinates. The rJava package makes it easy to access these objects as if they were S3 objects.

require(dssrip)

## Open file and get TSC
bwfile = opendss("C:/dssrt/test.dss")
bwcreek = getFullTSC(bwfile, "A=BRANDYWINE CREEK C=FLOW E=1DAY")
colnames(bwcreek) = "FLOW" ## TODO - add naming the columns by C-part

The above is an example from some helper functions I've written to read DSS files. getFullTSC allows a search of the DSS file, which might be overly broad if the user isn't careful, and returns a xts object that many R time series objects support.
Note: Be careful with using the times produced by this object - I'm pretty sure they can be 24 hours off, if the timestamps in the DSS file are at 2400 hours, which can be interpreted as the start of the day or the end of the day. I need to check how this is being handled in my code.

Part 2: Turning a daily time series into annual peaks


The alternative to this would be to just download the annual peaks from the USGS, which is easy enough to do, but I want to show off one of the cool parts of using R, the aggregate function. I also realize I could have used stat_ecdf from the ggplot2 package, but as far as I can tell it does not offer an option to put points at Weibull plotting positions, which besides not being the standard for flood frequency plots, conflicts with the infinite distance between 0 and 1 on the probability axis.
bwcreek$WY = 1900 + ifelse(.indexmon(bwcreek)+1 < 9, .indexyear(bwcreek), 1+.indexyear(bwcreek))

## Get peaks for each water year, assign probabilities
bwpeaks = aggregate(FLOW ~ WY, bwcreek, max)
bwpeaks = subset(bwpeaks, WY!=2013)  ## An artifact of my timestamp issue.
bwpeaks = bwpeaks[order(bwpeaks$FLOW),]  ## Sort the flows
n = length(bwpeaks$FLOW)
bwpeaks$PROB = (1:n)/(1+n) ## Weibull plotting postions.

Saturday, June 1, 2013

An update on reading DSS files in R.

I've decided to release some R functions I've put together to read from a DSS file.  A simple file to source is on my GitHub, provided without warranty.  Eventually, this'll become an R module.  It saved my butt a few times last week.  R is much better at re-arranging data than a stock Jython install, even version 2.5.

Saturday, April 13, 2013

HEC-DSS files and R

 
Update: If you've come here looking for a way to read DSS files in R, please check out my DSS-Rip library that wraps up the code required to link R to DSS, and simplifies converting DSS time series to R's xts format.

Motivation:


The following is the beginnings of what I hope will become a library to read and write data from HEC-DSS format files.  This is partially inspired by a desire to be able to plot the data in an environment such as ggplot2.

The process:


First, the rJava library for R allows the calling of Java code from within the R environment.  It also provides a nice R-esque/S3-style syntax for calling functions within an object, by using the $ delimiter, as I'll show later.
> library(rJava)
Next, I need to configure the location of my HEC-DSSVue install, as to call the Java functions contained within.  The next few lines may need to be varied for Windows 7.
> dss_location = "C:\\Program Files\\HEC\\HEC-DSSVue\\" 
> jars = c("hec", "heclib", "rma", "hecData") 
> jars = paste0(dss_location, "jar\\", jars, ".jar")
> libs = "-Djava.library.path=C:\\Program Files\\HEC\\HEC-DSSVue\\lib\\"
Now that I have the required JAR files and locations of required DLLs in some variables, I can start the JVM, passing it their locations.
> .jinit(classpath=jars, parameters=libs)
Here's where I create a new DSS file object by calling the static open function that creates a HecDss object:

> dssFile = .jcall("hec/heclib/dss/HecDss", "Lhec/heclib/dss/HecDss;",   method="open", "C:\\test.dss")
Finally, reading a known pathname, and plotting the time series data.  The get function returns a TimeSeriesContainer object, two properties of which are the sequence of timestamps and values at each timestamp. This should not be confused with the read function, which returns a HecMath representation of the data, useful for calling their built in time series math code, but not very helpful if we want the raw numbers.

> data = dssFile$get("/RACCOON CREEK/SWEDESBORO NJ/FLOW/12APR2013/IR-DAY/USGS/")
> plot(data$times, data$values, main="Raccoon Creek - Swedesboro, NJ", xlab="Time", ylab="Flow (cfs)")

Conclusions:


So, it's possible to read, and potentially write DSS data from within R.  I hope that by using the interface to the DSSVue program, I can avoid trying to deal with all sorts of specific cases.  Some future work may require making the DSS files more navigable from code.  This will probably require writing wrappers for the HecDss.get, HecDss.put, and HecDss.getCatalogedPathnames functions so that file can be searched and more R friendly versions of the data can be produced.  I focused this on a Windows environment, because that is what is available to me at work, but a Linux version of DSSVue exists, and a cross-platform solution would be useful.

* I added the "and Python" in the title, because with some luck, this future library and Rpy2 may be an easy way to get data from DSS files into Python.