We had a great all-day R workshop on campus today, and I learned quite a bit. I’d struggled in the past when trying to learn R on my own with parsing data sets and trying to do things with R that I’m very used to doing with SQL (such as filtering results given conditions, like with a WHERE clause.).
I was surprised, and happy to learn in our workshop about the R package sqldf which lets you use SQL syntax against data frames. Very cool stuff.. Unfortunately, I noticed it was dog-slow, at least on the mediums sized (200k rows) data set I used. We also learned a bit about the dplyr package, which seems to have very similar functionality but with a bit of a steeper learning curve (for me, at least) than the sqldf package.
> ptm <- proc.time() > sqldf_results <- sqldf("SELECT * FROM hflights where Origin = 'IAH'") > proc.time() - ptm user system elapsed 1.091 0.053 1.142
> ptm <- proc.time() > tbldf_results <- tbl_df(hflights) %.% filter(Origin=="IAH") > proc.time() - ptm user system elapsed 0.027 0.007 0.034
dplyr was ~33x faster on this data set. As much as I’d love to re-use my rather extensive SQL knowledge, I think it’s time to learn a bit more about the in’s and out’s of dplyr.