R, Statistics and Visualization

UseR! 2017 Recap

I recently returned from a week at the UseR! 2017 conference in Brussels, which was a great opportunity to catch up on the latest trends in the R world. This conference was noticeably different from the 2015 Aalborg conference in the demographics of the audience; in prior conferences, the attendees were overwhelmingly either PhD faculty or PhD candidates but at this conference many if not the majority were consultants and practitioners from industry. There is a lot to cover, so I’ll split things into a few categories:

The opening reception was well attended...and the speeches were delightfully short.
The opening reception was well attended...and the speeches were delightfully short.
UseR! 2017 was the first to have on-site childcare, with a large check-in and supervised play area.
UseR! 2017 was the first to have on-site childcare, with a large check-in and supervised play area
The beer pairings for the UseR! 2017 conference banquet were great.
The beer pairings for the UseR! 2017 conference banquet were great

At each of the UseR! conferences, you can look back and see a trend in the talks submitted; here are the trends from the conferences that I have attended:

  • Ames, 2007–this was really the year of critical mass. Everyone talked about this being the T-shirt to save to prove that you had attended. Sweave and ODFweave were big topics, but did not dominate discussion as they would in 2011. The difficulty of managing package dependency and support was a major discussion point; most packages were developed in academia, but faculty had (and have) no incentive for ongoing support. There were big pharma and finance contingents at this conference.
  • Dortmund, 2008–I think infrastructure, computational speed and parallelism were the major topics. This was the year that I remember people first talking about alternative interpreters and the possible routes (and potential funding) for converting the interpreter from 32 to 64 bits.
  • Rennes, 2009–this was the year of beginning of the conversion from plot to ggplot2. There was a lot of discussion about lattice as well, but much more about ggplot2.
  • Warwick, 2011–reproducible research and integrated development environments (IDEs) were the major topics here. RStudio, other IDEs, sweave and knitr were big topics. The pharma crowd generally did not make it to this conference, as most were going to the BioC conference by this point.
  • Albacete, 2013–dplyr was introduced. The conversion to 64-bit and release of 3.0 were the major topics, along with parallelism in graphics cards (cuda). Improvements to CRAN and package dependency management were big discussion points. This was the first year that the finance crowd largely did not show up, as most were going to the RFinance conference at this point.
  • Aalborg, 2015–the Microsoft acquisition of Revolution Analytics was huge news, as was all manner of interactive visualizations. Alternative interpreters were a major point of discussion at this conference. Docker was discussed in several settings.

At Brussels, 2017, there was surprisingly little discussion of alternative interpreters, static and interactive graphics and IDEs. The three themes that I could pick out were

  • Natural language processing (NLP)
  • Mapping
  • Extensions to Shiny

The next sections will talk about these major themes and some individual lectures that I think should get outsized attention.

Natural Language Processing

In this year’s tutorial lists, there was a half-day tutorial on natural language processing of text, and four-presentation session on NLP-related topics. It is clear that analysis of text is beginning to enter the main stream. There were two major packages discussed:

  • cleanNLP and coreNLP by Taylor Arnold and Lauren Tilton. These were the primary packages discussed in the Tuesday afternoon tutorial led by the package authors, both of the University of Richmond. They are also the co-authors of a book, Humanities Data in R. Tilton is historian who uses text analysis to look at authorship and other topics of interest in analyzing historical documents. Arnold is a statistician whose work is primarily algorithmic. These packages are quite robust, but take some work to learn to use effectively.
  • tidytext by Julia Silge, the co-author of Tidy Textmining in R. Silge is a data scientist at Stack Overflow. The tidytext package appears to be easier to use but somewhat less capable for complex analysis.

It is clear that text mining is now a mainstream application.

Taylor Arnold describes cleanNLP CRAN package using an Arthur Conan Doyle (Sherlock Holmes) document corpus to illustrate character appearance frequency.
Taylor Arnold describes cleanNLP CRAN package using an Arthur Conan Doyle (Sherlock Holmes) document corpus to illustrate character appearance frequency
Lauren Tilton describes cleanNLP CRAN package using a US State of the Union address document corpus to illustrate changes in document length over time.
Lauren Tilton describes cleanNLP CRAN package using a US State of the Union address document corpus to illustrate changes in document length over time.
Julia Silge describes tidytext CRAN package using a Jane Austen document corpus to illustrate sentiment analysis.
Julia Silge describes tidytext CRAN package using a Jane Austen document corpus to illustrate sentiment analysis

A Tidal Wave of Mapping

Mapping is another application that has improved to the point that it is mainstream, as there were two mapping-related tutorials and several presentations. Much of the discussion was on the migration of mapping tools from the older sp to newer sf (tidy) spatial object types. If writing new code, you definitely want to use sf object types wherever possible.

Shiny Stuff

Shiny has been around for several years now, but some people have stayed away from it due to the high cost of the Enterprise version with encryption and authentication. It is much easier to solve those problems with some of the new packages presented at the conference, but especially ShinyProxy and pools

Secure the Open Source Version of Shiny–ShinyProxy

ShinyProxy is a new package that replaces the proxy server that is internal to Shiny. It allows you to run a cluster of Shiny servers and to implement SSL encryption without getting the enterprise version of Shiny. This will make Shiny implementations much, much easier.

Speed up Database Connections in Shiny–pools

In large Shiny implementations, the cost of opening and closing database connections can become a big performance problem. The pools package implements connection pooling–something used in transaction processing systems for years–in Shiny. If you are doing a large Shiny implementation, this is an important new tool.

Docker Was Common

In Albacete (2013) there was a lot of discussion about the use of R in finance applications and the problems of reproducibility of calculations for regulatory compliance. No one had a clean solution, and the best solution appeared to be the use of Virtual Machines. In Aalborg (2015), there were several presentations that discussed Docker as a way to help with the configuration management problems common to regulatory compliance. In Brussels, Docker was a theme underlying numerous presentations. This is clearly part of the main stream skill set at this point.

Tidyverse

The tidy/tidyverse approach to data structures has become the defacto standard, as was clear in numerous presentations.

Mixed Integer Programming

There was a single presentation on a pair of new packages called ompr and ompr.roi by Dirk Shumacher that implement easy to use model definition functions for mixed integer linear programs. This session was sparsely attended, but the people who were there–and who stayed to talk to the author–were key players in the R world. This package is a huge deal for people (like me) who come from an operations research background, and it will make a number of statistical and analysis methods much easier to implement.

Dirk Schumacher describes his ompr CRAN package for specifying mixed integer programming optimization models.
Dirk Schumacher describes his ompr CRAN package for specifying mixed integer programming optimization models

Parallel and Cloud

Although not a development that will change the content of next year’s UseR! conference, the doAzureParallel CRAN package will make parallel processing at cloud-scale much easier.

The doAzureParallel CRAN package will be very useful for some parallel processing applications.
The doAzureParallelCRAN package will be very useful for some parallel processing applications

Making Web Sites Accessible to the Blind

One of the most enlightening presentations that I attended was one by Jonathan Godfrey on “Interactive Graphs for Blind and Print Disabled People,” one of the more enlightening presentations that I attended. The short story is that the screen readers used by blind data scientists cannot do much with .png and other bit-map formats, but can read and translate A LOT from .svg format files. The .svg files that are created with ggsave are not useful for screen readers, but files created with gridSVG can be interpreted by screen readers.

As a result of this lecture, I am changing my workflow.

Summary

The UseR! 2017 did not disappoint.

We use cookies to ensure you get the best experience on our website.