planetwater

ground- water, geo- statistics, environmental- engineering, earth- science

Archive for 2012

New Books on Numpy, Pandas, Data Analysis

without comments

I became aware recently of three books that are related to data-analysis, modelling, and statistics in a fairly broad sense.

They are pictured below, from left to right:

NumpyBooks

Python for Data Analysis

This is the most in-depth book. It covered the most important python modules: iPython, NumPy, Pandas, Matplotlib. Additionally, it has chapters containing examples on practical issues with data (aggregation, data with a time-stamp, sorting). I really just started diving into it. However, it already led me to upgrade to iPython 13.1. It seems like it is well suited for my level of understanding of programming: Having some experience, trying to learn more existing tools

NumPy Cookbook

The title already gives it away: The book is organised in sections with “recipes”. Mostly, these recipes are self-containing. The focus is clearly on NumPy, even though Matplotlib, iPython, and also Pandas are covered to some extent. I enjoyed browsing through it, most of the examples are interesting (resizing images, playing with PythonAnywhere (like I’ve done before), f.ex). Generally, I think this is a great resource to have.

NumPy 1.5

Despite being by the same author (Ivan Idris), there is positively little overlap between his two books. “NumPy 1.5” covered NumPy in great detail, and is as such mostly useful for beginners who try to use python for some numerical analysis. When I read this book, I also was reminded, that the webpage that lists NumPy functions is a very valuable resource (which I tend to spend too little time with).

It is interesting to see that people realise that there is a market for books explaining open source tools. And I do think those books complement available documentation nicely.

Written by Claus

November 20th, 2012 at 12:01 pm

Posted in

Pandas in Python: Use with Hydrological Time Series

without comments

I recently had some time series analysis to do, and I decided to do this work in Pandas. Particualarly, I had to deal with many timeseries, stretching from different startpoints in time to different endpoints in time. Each timeseries was available as an ASCII file. The data were daily data. The time information was given in three columns, each representing year, month, and day, respectively.

Looking Out

Image: http://flic.kr/p/9gK3ZH

I found these two posts by Randal S. Olson very useful resources:

  • Using pandas DataFrames to process data from multiple replicate runs in Python (link)
  • Statistical analysis made easy in Python with SciPy and pandas DataFrames (link)

Here is a cookbook style layout of what I did:

The following steps show how easy it was to deal with the data 1. Read the input data

    cur_sim_ts = pa.io.parsers.read_table(os.path.join(data_path, 'test', result)
        , header=None
        , sep='\s*'
        , parse_dates=[[0,1, 2070]]
        , names=['year','month', 'day', result[:-4]]
        , index_col=[0]
        )

where `’\s*’ means any whitespace. The dates were given in three colums, one for year, month, and day. Can it get simpler and nicer than that?

  1. It is possible to repeat 1 multiple time, each time extending the pandas data_frame. Unfortunately, this looks a little ugly still, but works

    if counter_cur_file > 0:
        final_df = final_df.combine_first(cur_sim_ts)
    else:
        final_df = cur_sim_ts.copy() / 10.0 
    

    In the else part, the Pandas data_frame is initialised. It so happens that this and only this series of the loop has to be divided by ten. In all other cases the time_series that was read in step 1 is “appended” (or rather combined with) the the previously initialized data_frame. The wicked thing is that each time_series is put at the propper “place” in time within the data_frame. Dates are real dates. This is beautiful, but I had to be a little carfule with the data I had at hand, in which every month has 30 days.

  2. As soon as this data_frame is constructed, things are easy, for example

  • plotting, particularly plotting only a certain time-interval of the data.

     final_df['2070-01-01':'2100-12-30'].plot(ylim=[-10,45])
    
  • saving the data_frame

     final_df.save(os.path.join(out_path, pickle_filename))
    
  1. For me it was of particular interest to find out, how many consecutive dry and wet days there are in each time series. I introduced a threshold of precipitation. If the daily amount of precipitation is above that threshold, this day is considered to be “wet”, else it’s considered to be “dry”. I wanted to count the number of consecutive dry and wet days, and remember them for a time series. This is the purpose of the function below. It is coded a little bute force. Still I was surprised, that it performed reasonably well. If anybody has a better idea, please let me know. Maybe it can be of use for other Pandas users. Note: a time_series in the Pandas world is obtained by looping over a data_frame

    def dry_wet_spells(ts, threshold):
     """
     returns the duration of spells below and above threshold
    
     input
     -----
     ts          a pandas timeseries
     threshold   threshold below and above which dates are counted
    
     output
     ------
     ntot_ts               total number of measurements in ts
     n_lt_threshold        number of measurements below threshold
     storage_n_cons_days   array that stores the lengths of sequences
                           storage_n_cons_days[0] for dry days
                           storage_n_cons_days[1] for wet days
     """
     # total number in ts
     ntot_ts = ts[~ ts.isnull()].count()
     # number lt threshold
     n_lt_threshold = ts[ts <= threshold].count()
    
     # type_day = 0   # dry
     # type_day = 1   # wet
    
     # initialisierung: was ist der erste Tag
     type_prev_day = 0
     storage_n_cons_days = [[],[]]
     n_cons_days = 0
    
     for cur_day in ts[~ ts.isnull()]:
         # current day is dry
         if cur_day <= threshold:
             type_cur_day = 0
             if type_cur_day == type_prev_day:
                 n_cons_days += 1
             else:
                 storage_n_cons_days[1].append(n_cons_days)
                 n_cons_days = 1
             type_prev_day = type_cur_day
         else:
             type_cur_day = 1
             if type_cur_day == type_prev_day:
                 n_cons_days += 1
             else:
                 storage_n_cons_days[0].append(n_cons_days)
                 n_cons_days = 1
             type_prev_day = type_cur_day
    
     return ntot_ts, n_lt_threshold, storage_n_cons_days    
    
  2. With all of this, I can produce histograms like this:

P 9224 MEAS 1950 01 1992 04 HistDryWetDays

Written by Claus

November 20th, 2012 at 9:57 am

Posted in

Water Levels in New York City

with 4 comments

update Monday; October 29, 2012 3:18pm (CDT) Eric Holthaus at the Wall Street Journal has a more in-depth analysis for the coming hours


I’ve been following the unfolding Sandy in the last couple of days. I’ve been relying of many online sources. Since I don’t have anything original to say, I decided to wait with a post here until things have settled down a bit.

However, I wanted to share this one chart of the water levels at “The Battery NY” (original available here. It shows observed (red dots) vs. modelled (pink and green) water levels. Note that the current water level is (slightly) higher than predicted. This has been true for the last couple of high tides, but those had smaller peaks.

Water Levels in New York City

The coming hours will be critical! There is a high tide coming up, coinciding with the landfall of Sandy. Additionally there’s a mid-latitude trough just east of the Great Lakes that pulls Sandy onto the North American continent. Additionally, as if that wasn’t enough, the North Atlantic Oscillation is in a negative phase, pulling Sandy towards the North-East. Decent summaries can be found here and here.

Written by Claus

October 29th, 2012 at 9:09 pm

Posted in

Update on planetwater Calendar

with 2 comments

update Thursday; October 25, 2012 thanks to @wpworx I do have now a solution that works:

I resorted to a google calendar, which I will use to maintain planetwater.org relevant dates. Here are the advantages:

  • you can subscribe to the ics calendar using this url
  • It syncs to a dedicated Calendar page on planetwater.org, so readers on the blog should see what events are happening in the future (with some sync delay)
  • I have it setup in my Calendar so I can directly enter events there

Recently, the calendar plugin (“Event Calendar“) that I used to use within WordPress (WP), and that has worked reasonably well over the past few years has decided to not work anymore. This might be the reason why some posts were not visible in the last couple of days.

My requirements seem simple:

  • add/modify/delete posts
  • subscribe to the “posts” / “events” in iCal (or any other calendar app)
  • maybe, have access to a WP widget

The following list shows some of the solutions I played with:

Neither of these satisfied me. Some of are bloated, some don’t allow for the deletion of “posts”/”events”.

To me, of course, it seems like I don’t need or want anything fancy. But I just could not find online an existing possibility for WP. Can anyone help?

Right now, I have a version of “The Events Calendar” running, but there is not much there. For now, the old posts are stored, but not visible.

Written by Claus

October 24th, 2012 at 8:36 pm

Posted in

Python: How to start and how to make python(x,y) work with ArcGIS 10

without comments

In the end, it comes down to finding a little project and you have to figure stuff for yourself on the way. My hope is that this write-up saves you some time with technicalities.

My python setups are different between my mac and my windows machine — on windows I am largely limited by the python version that ESRI imposes with Arc10. On my mac, there is no such thing. The link between ArcGIS and me is still existent from my M.Sc. work. A side project back then was to build a calibration tool that is not only based on point measurements. Instead, the goal was to get the general hydraulic head surface in the two main aquifers reasonably right. Back then I delved into VBA, and it worked. Back then we were using Arc 8.0, and the inclusion of python in Arc was just starting (and numpy either didn’t exist or I didn’t know about it). I am still not very happy with the implementation of python in ArcGIS, but it is one of the reason, why I keep 2.x versions around. A current ESRI view is expressed here and in these pdf slides

Python

There are several python versions out there. The big distinction is between versions 2.x and 3.x. Obviously, 3.x is newer, but a lot of packages (“modules”) are compatible to 3.x yet. Hence, I would suggest, for now, to stick with 2.x — currently, I use 2.7.3. on the mac and 2.6.x on windows

Where do I get the installer from?

Python is open source, and basically, you can get an installer right from the python homepage. Probably, this is a smart way to do. Even smarter it would probably be to download the source and compile it directly on your machine. But, we are no computer scientists. Plus you will need a few more packages (“modules”), despite the fact that python comes with “batteries included“, i.e. a lot of powerful possibilities within the standard library (the “pure” version that you can install from python.org).

To facilitate, there are a few people/organisations who provide packaged “python distributions”, which are updated in certain intervals. However, if you always want the bleeding edge version of a module, you can’t go that path.

I use the “Enthought Python Distribution” (EPD) on the mac, which is free for educational use. On windows, a good way to go is probably pythonxy. Other options on the mac include the Scipy Superpack or a combination of homebrew and pypi installs.

Installation with ESRI Arc10

I found that this hack works for me. It is based on information that I collected on various webpages (sorry for lack of reference here). I am sure there are other/better solutions. Please let me know! 😉

  • install ArcGIS 10 WITHOUT python (unfortunately, this could mean to de-install it first completely)
  • install python(xy) 2.6.6.1 in C:\python26 (download)
  • create a file that is called Desktop10.pth, and put the following three lines (ASCII) into that file:
    C:\Program Files (x86)\ArcGIS\Desktop10.0\bin
    C:\Program Files (x86)\ArcGIS\Desktop10.0\arcpy
    C:\Program Files (x86)\ArcGIS\Desktop10.0\ArcToolbox\Script

  • copy that file into the directory C:\python26

Where to write code (IDEs)

For windows, this is a no-brainer, because it is so awesome: pyscripter. On other OSes, I use eclipse, with the pydev plugin.

Packages (“Modules”)

  • Numpy: numerical library (“numerical python”): read from files directly into arrays, matrix algebra, basic stats — this is so awesome and so much fun! This page will be your best friend!
  • Scipy: the sister of numpy — “scientific python”: all kinds of goodness: optimizers, fft, all kinds of statistics, image processing
  • matplotlib: make awesome plots straight out of python (similar to matlab); somewhat limited still to 2D, even if 3D is possible
  • pandas: if you ever have to deal with data that has time stamps – use pandas!
  • basemap: In the last little while I have dealt mostly with rasters, and it is quite easily possible to display them in matplotlib using the command “imshow”. For more advanced mapping related stuff, basemap might be an option (check out their gallery)

There are many, many more. And you can write your own packages (such as copyla)

Where to start

I think the best starting point is the python tutorial. After this I would go to the scipy/numpy/matplotlib “lecture notes“. Andy Terrel has put together a good resource page, but I think it is a bit advanced.

Books

There are many books. Generally, I read through parts of I think all the o’reilly ones, and I can recommend them, even if I don’t own any. I do own the one by John Zelle, which I enjoy, but it covers general computer science concepts, using python, but it is not really ideal to learn python. Geoff has one of the Langtangen books, which are all very good. Generally, there is so much information available online, that I find books become more and more obsolete.

Written by Claus

October 24th, 2012 at 5:21 am

Posted in

Identi.ca Weekly Updates for 2012-09-21

without comments

Written by Claus

September 21st, 2012 at 1:00 pm

Posted in

Identi.ca Updates for 2012-09-18

without comments

Written by Claus

September 18th, 2012 at 8:00 am

Posted in

Identi.ca Updates for 2012-09-16

without comments

Written by Claus

September 16th, 2012 at 8:00 am

Posted in

Identi.ca Weekly Updates for 2012-08-17

without comments

Written by Claus

August 17th, 2012 at 1:00 pm

Posted in

Charts of Extremes

without comments

Two charts caught my eye during the last two days. From a statistical viewpoint, both show extreme events.

The first chart shows monthly temperature anomalies (deviance from long term mean) for 117 years worth of temperature data for the U.S. Since march, this year has been the hottest on record, and not by a slim margin. Read the full post at newscientist.com

year-to-date temperature anomalies for contiguous U.S Photo by newscientist.org / NOAA

The second chart relates to arctic sea ice extent. The spring did not start out in an extreme way. There was a comparatively large area covered by sea ice in April / May. The area shrunk, however, and since a couple of days, the areal extent of sea ice compared to this time in previous years, is the smallest. The full post with more charts is available at realclimate.org

areal extent of arctic sea ice Photo by realclimate.org / IARC-JAXA

Written by Claus

August 14th, 2012 at 8:52 am

Posted in