Creating interactive graphs: an example with twitter data

I recently had to make an interactive visualisation of some twitter-data. Here I’ll explain how I went about it.

rtweet

Getting twitter data is reasonably easy once you have the rtweet package going, although there are certainly some steps that you have to go through to set-up this package (with respect to the API authorization). You can read about these steps here. We’ll also use the tidyverse-package (as always, I will rely heavily on ggplot that is within the tidyverse). We’ll also use the widgetframe-package, so that we can display the different interactive graphs.

# install.packages("rtweet")
library(rtweet)
library(tidyverse)
library(widgetframe)

Let’s get the tweets from Andrew Gelman (@StatModeling) and store it in a dataframe; we can maximally get ~3200 posts. Here, we’ll onlu use the variables: created_at (date at which tweet was made), text (text of the tweet), and retweet_count (number of times tweet was retweeted).

gelman <- get_timelines(c("StatModeling"), n = 5000)

Plotting the number of retweets over time:

gelman_plot <- ggplot(gelman, aes(x = created_at, y = retweet_count)) +
  geom_point(colour = "lightblue", alpha = 0.5, size = 2) +
  geom_smooth(se = FALSE, colour = "orange", alpha = 0.5) +
  theme_minimal()  +
  labs(x = "Date", y = "# retweets", 
       title = "Number of retweets of tweets '@StatModeling' over time") 
gelman_plot 

Because of one extreme outlier, we can’t really get enough detail on most of the tweets. So let’s zoom in first:

gelman_plot + coord_cartesian(ylim = c(0, 130))

Better, but still the bulk is non-visible, let’s try one more solution, by transforming the y-scale into a logarithmic scale. There is one problem: many tweets are not retweeted, so there are zeros in our raw data. A log of zero does not exist. Zeros should not be on the log-scale. But, breaking social scientific norms, I shall add them anyway, because I do not want to remove those data in the visuals. In order to do so, I’ll add one to every retweet-count. To compensate for having transformed the data, I will adjust the y-axis. So a newly transformed datapoint with the value 17 needs to have a label of 16 in the graph.

gelman_plot + 
  aes(y = retweet_count + 1) +
  scale_y_continuous(trans = "log2",
                     breaks = c(1, 2, 3, 5, 9, 17, 33, 65, 129, 257, 513),
                     labels = c(0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512))

Looks a bit funky, but we get to see some more datapoints. Finally, let’s add all the above information in one function call and tweak some features of the graph:

gelman_plot <- ggplot(gelman, aes(x = created_at, y = retweet_count + 1)) +
  geom_point(colour = "lightblue", alpha = 0.5, size = 2) +
  geom_smooth(se = FALSE, colour = "orange", alpha = 0.5) +
  theme_minimal()  +
  labs(x = "Date", y = "# retweets", 
       title = "Number of retweets of tweets \n'@StatModeling' over time") +
  scale_y_continuous(trans = "log2",
                     breaks = c(1, 2, 3, 5, 9, 17, 33, 65, 129, 257, 513),
                     labels = c(0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512)) +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        axis.title = element_text(size = 20),
        axis.text = element_text(size = 15),
        title = element_text(size = 22)) 
gelman_plot

Let’s make it interactive with ggiraph

ggiraph is an amazing package for making ggplot-graphs interactive. The beauty is that they work in html-pages, meaning that it’s self-containted and that you don’t need have a server with R running in the background (as you would need for, for instance, Shiny-apps). The code looks almost the same, except that instead of geom_point we use geom_point_interactive. Within this function, we can specify a third variable that will show the value for that variable for the datapoint above which you hover your mouse. Let’s try with the variable text which stores the text of that tweet:

library(ggiraph)
gelman_int <- ggplot(gelman, aes(x = created_at, y = retweet_count + 1)) +
  geom_point_interactive(aes(tooltip = text), 
                         colour = "lightblue", alpha = 0.5, size = 2) +
  geom_smooth(se = FALSE, colour = "orange", alpha = 0.5) +
  theme_minimal()  +
  labs(x = "Date", y = "# retweets", 
       title = "Number of retweets of tweets \n'@StatModeling' over time") +
  scale_y_continuous(trans = "log2",
                     breaks = c(1, 2, 3, 5, 9, 17, 33, 65, 129, 257, 513),
                     labels = c(0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512)) +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        axis.title = element_text(size = 20),
        axis.text = element_text(size = 15),
        title = element_text(size = 22)) 
  
ggiraph(code = print(gelman_int), width = 1 )
## Error in set_attr(ids = as.integer(ids), str = encode_cr(x$tooltip), attribute = "title"): str cannot contain single quote "'".

We get some error-message because apparently there are in the tweets that we can’t use. Let’s remove them all with the help of stringr.

library(stringr)
gelman$text_woq <- str_replace_all(gelman$text, "'", "")

gelman_int <- ggplot(gelman, aes(x = created_at, y = retweet_count + 1)) +
  geom_point_interactive(aes(tooltip = text_woq), 
                         colour = "lightblue", alpha = 0.5, size = 2) +
  geom_smooth(se = FALSE, colour = "orange", alpha = 0.5) +
  theme_minimal()  +
  labs(x = "Date", y = "# retweets", 
       title = "Number of retweets of tweets \n'@StatModeling' over time") +
  scale_y_continuous(trans = "log2",
                     breaks = c(1, 2, 3, 5, 9, 17, 33, 65, 129, 257, 513),
                     labels = c(0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512)) +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        axis.title = element_text(size = 20),
        axis.text = element_text(size = 15),
        title = element_text(size = 22)) 

ggiraph(code = print(gelman_int), width = 1 )

The above code will work on your computer, but to have multiple interactive graphs on a website based on blogdown, one needs to make the interactive graps stand-alone and include them as a an hmtlwidget through frameWidget. [Basically, when you have multiple interactive ggiraph graphs, the output of those graphs will be written to the same folders (that have to do with JavaScript libraries), meaning that there will be annoying overlap because some files have the same name, and the interactive graphs–all of them– will use the most recent version. Tres annoying.]

frameWidget(
  ggiraph(code = print(gelman_int), width = 0.8 )
  )
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Boom an interactive graph! When you hover above the points you’ll see the contents of the tweet!

plot.ly

What happens if we use plotly:

library(plotly)
ggplotly(gelman_plot)

The plot.ly one also looks good! It doesn’t seem to do the dynamic sizing as the ggiraph one but it is a bit easier in use. Let’s see if we can also get the tweet-text in the plotly graph

gelman_plus_text <- ggplot(gelman, aes(x = created_at, y = retweet_count + 1,
                                       text = paste("tweet:", text))) +
  geom_point(colour = "lightblue", alpha = 0.5, size = 2) +
  geom_smooth(aes(text = ""),
              se = FALSE, colour = "orange", alpha = 0.5) +
  theme_minimal()  +
  labs(x = "Date", y = "# retweets", 
       title = "Number of retweets of tweets \n'@StatModeling' over time") +
  scale_y_continuous(trans = "log2",
                     breaks = c(1, 2, 3, 5, 9, 17, 33, 65, 129, 257, 513),
                     labels = c(0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512)) +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        axis.title = element_text(size = 20),
        axis.text = element_text(size = 15),
        title = element_text(size = 22)) 
ggplotly(gelman_plus_text)

Hovering over the highest datapoint, reveals that the tweet was about power and interaction. Let’s see the full text of the tweet:

gelman %>% 
  filter(retweet_count == max(retweet_count, na.rm = TRUE)) %>%
  pull(text)
## [1] "You need 16 times the sample size to estimate an interaction than to estimate a main effect https://t.co/JzGyOIIDOY"