4 min read

lexRankr & Twitter: find a user's most representative tweets

Packages Used

library(lexRankr)
library(tidyverse)
library(stringr)
library(httr)
library(jsonlite)

In this post we’ll get tweets from twitter using the twitter API and then analyze the tweets using lexRankr in order to find a user’s most representative tweets. LexRankr is an R implementation of the Lexrank Algorithm which can be used for extractive text summarization.

If you don’t care about interacting with the twitter api you can jump to the lexrank analysis.

Get user tweets

Before we can analyze tweets we’ll need some tweets to analyze. We’ll be using Twitter’s API, and you’ll need to set up an account to get all the keys needed for the api. The credentials needed for the api are: consumer key, consumer secret, token, and token secret. Once we have the keys, we’ll set them as environment variables to use in for the rest of the code in this post. Below is how to set up your credentials to use the twitter api in this vignette.

## set api tokens/keys/secrets as environment vars
# Sys.setenv(cons_key     = 'my_cons_key')
# Sys.setenv(cons_secret  = 'my_cons_sec')
# Sys.setenv(token        = 'my_token')
# Sys.setenv(token_secret = 'my_token_sec')

#sign oauth
auth <- httr::oauth_app("twitter",
                        key=Sys.getenv("cons_key"),
                        secret=Sys.getenv("cons_secret"))
sig  <- httr::sign_oauth1.0(auth, 
                            token=Sys.getenv("token"), 
                            token_secret=Sys.getenv("token_secret"))

Now that we have our credentials set up, we’ll use a custom function to get a user’s tweets. The function definition is a little lengthy to be in the middle of a blog post, but the code for the function get_timeline_df can be seen here. The function takes a user’s twitter handle, the number of tweets to get from the api, and the credentials we just set up; it returns a dataframe with the columns created_at, favorite_count, retweet_count, text.

We can now use our function to gather a user’s tweets. For an example lets use one of the most famous twitter accounts as of late: @realDonaldTrump.

tweets_df <- get_timeline_df("realDonaldTrump", 600, sig) %>% 
  #clean out newlines for display
  mutate(text = str_replace_all(text, "\n", " "))

tweets_df %>% 
  mutate(Date=strptime(created_at, 
                       format="%a %b %d %H:%M:%S +0000 %Y") %>% 
           str_sub(end=10)) %>% 
  arrange(desc(Date)) %>% 
  head(n=3) %>% 
  select(`Tweet Text`=text, Date) %>% 
  knitr::kable()
Tweet Text Date
Honored to meet this years @SenateYouth delegates w/ @VP Pence in the East Room of the @WhiteHouse. Congratulations https://t.co/oQIx7LybCV 2017-03-09
‘U.S. Consumer Comfort Just Reached Its Highest Level in a Decade’ https://t.co/S8nZgmeMMV https://t.co/xC0piRa6eP 2017-03-09
Despite what you hear in the press, healthcare is coming along great. We are talking to many groups and it will end in a beautiful picture! 2017-03-09

Lexrank Analysis

We now have a dataframe that contains a column of tweets. This column of tweets will be the subject of the rest of the analysis. With the data in this format, we only need to call the bind_lexrank function to apply the lexrank algorithm to the tweets. The function will add a column of lexrank scores. The higher the lexrank score the more representative the tweet is of the tweets that we downloaded.

note: typically one would parse documents into sentences before applying lexrank (?unnest_sentences); however we will equate tweets to sentences for this analysis

tweets_df %>% 
  bind_lexrank(text, id, level="sentences") %>% 
  arrange(desc(lexrank)) %>% 
  head(n=5) %>% 
  select(text, lexrank) %>% 
  knitr::kable(caption = "Most Representative @realDonaldTrump Tweets")
Table 1: Most Representative @realDonaldTrump Tweets
text lexrank
MAKE AMERICA GREAT AGAIN! 0.0078031
Well, the New Year begins. We will, together, MAKE AMERICA GREAT AGAIN! 0.0077104
HAPPY PRESIDENTS DAY - MAKE AMERICA GREAT AGAIN! 0.0068083
Because the ban was lifted by a judge, many very bad and dangerous people may be pouring into our country. A terrible decision 0.0058402
SEE YOU IN COURT, THE SECURITY OF OUR NATION IS AT STAKE! 0.0056818

Repeating tweetRank analysis for other users

With our get_timeline_df function we can easily repeat this analysis for other users. Below we repeat the whole analysis in a single magrittr pipeline.

get_timeline_df("dog_rates", 600, sig) %>% 
  mutate(text = str_replace_all(text, "\n", " ")) %>% 
  bind_lexrank(text, id, level="sentences") %>% 
  arrange(desc(lexrank)) %>% 
  head(n=5) %>% 
  select(text, lexrank) %>% 
  knitr::kable(caption = "Most Representative @dog_rates Tweets")
Table 2: Most Representative @dog_rates Tweets
text lexrank
Please keep loving 0.0106845
Here we h*ckin go 0.0093844
There’s still time 0.0093457
@GoodDogsGame …h*ck 0.0076048
@darth THAT’S WHAT I SAID WHEN THEY LET ME MAKE A BOOK 0.0073849



comments powered by Disqus