Brandon Harris bio photo

Brandon Harris

Cloud + Data Engineering + Analytics

Twitter LinkedIn Instagram Github Photography

We had a great all-day R workshop on campus today, and I learned quite a bit. I’d struggled in the past when trying to learn R on my own with parsing data sets and trying to do things with R that I’m very used to doing with SQL (such as filtering results given conditions, like with a WHERE clause.).

I was surprised, and happy to learn in our workshop about the R package sqldf which lets you use SQL syntax against data frames. Very cool stuff.. Unfortunately, I noticed it was dog-slow, at least on the mediums sized (200k rows) data set I used. We also learned a bit about the dplyr package, which seems to have very similar functionality but with a bit of a steeper learning curve (for me, at least) than the sqldf package.

> ptm <- proc.time()
> sqldf_results <- sqldf("SELECT * FROM hflights where Origin = 'IAH'")
> proc.time() - ptm
user system elapsed
1.091 0.053 1.142
> ptm <- proc.time()
> tbldf_results <- tbl_df(hflights) %.% filter(Origin=="IAH")
> proc.time() - ptm
user system elapsed
0.027 0.007 0.034

Yikes.

dplyr was ~33x faster on this data set. As much as I’d love to re-use my rather extensive SQL knowledge, I think it’s time to learn a bit more about the in’s and out’s of dplyr.