Adam Spannbauer

Programmer/Data Scientist/Instructor・Mostly write Python & R・Big fan of OpenCV & p5js


lexRankr & Twitter: find a user's most representative tweets

Published Mar 10, 2017

Packages Used


In this post we’ll get tweets from twitter using the twitter API and then analyze the tweets using lexRankr in order to find a user’s most representative tweets. LexRankr is an R implementation of the Lexrank Algorithm which can be used for extractive text summarization.

If you don’t care about interacting with the twitter api you can jump to the lexrank analysis.

Get user tweets

Before we can analyze tweets we’ll need some tweets to analyze. We’ll be using Twitter’s API, and you’ll need to set up an account to get all the keys needed for the api. The credentials needed for the api are: consumer key, consumer secret, token, and token secret. Once we have the keys, we’ll set them as environment variables to use in for the rest of the code in this post. Below is how to set up your credentials to use the twitter api in this vignette.

## set api tokens/keys/secrets as environment vars
# Sys.setenv(cons_key     = 'my_cons_key')
# Sys.setenv(cons_secret  = 'my_cons_sec')
# Sys.setenv(token        = 'my_token')
# Sys.setenv(token_secret = 'my_token_sec')

#sign oauth
auth <- httr::oauth_app("twitter",
sig  <- httr::sign_oauth1.0(auth, 

Now that we have our credentials set up, we’ll use a custom function to get a user’s tweets. The function definition is a little lengthy to be in the middle of a blog post, but the code for the function get_timeline_df can be seen here. The function takes a user’s twitter handle, the number of tweets to get from the api, and the credentials we just set up; it returns a dataframe with the columns created_at, favorite_count, retweet_count, text.

We can now use our function to gather a user’s tweets. For an example lets use one of the most famous twitter accounts as of late: ‘@realDonaldTrump’.

tweets_df <- get_timeline_df("realDonaldTrump", 600, sig) %>% 
  #clean out newlines for display
  mutate(text = str_replace_all(text, "\n", " "))

tweets_df %>% 
                       format="%a %b %d %H:%M:%S +0000 %Y") %>% 
           str_sub(end=10)) %>% 
  arrange(desc(Date)) %>% 
  head(n=3) %>% 
  select(`Tweet Text`=text, Date) %>% 
text created_at
Yes, it is true - Carlos Slim, the great businessman from Mexico, called me about getting together for a meeting. We met, HE IS A GREAT GUY! Tue Dec 20 20:27:57 +0000 2016
especially how to get people, even with an unlimited budget, out to vote in the vital swing states ( and more). They focused on wrong states Tue Dec 20 13:09:18 +0000 2016
Bill Clinton stated that I called him after the election. Wrong, he called me (with a very nice congratulations). He “doesn’t know much” … Tue Dec 20 13:03:59 +0000 2016

Lexrank Analysis

We now have a dataframe that contains a column of tweets. This column of tweets will be the subject of the rest of the analysis. With the data in this format, we only need to call the bind_lexrank function to apply the lexrank algorithm to the tweets. The function will add a column of lexrank scores. The higher the lexrank score the more representative the tweet is of the tweets that we downloaded.

note: typically one would parse documents into sentences before applying lexrank (?unnest_sentences); however we will equate tweets to sentences for this analysis

tweets_df %>% 
  bind_lexrank(text, id, level="sentences") %>% 
  arrange(desc(lexrank)) %>% 
  head(n=5) %>% 
  select(text, lexrank) %>% 
  knitr::kable(caption = "Most Representative @realDonaldTrump Tweets")
text lexrank
Well, the New Year begins. We will, together, MAKE AMERICA GREAT AGAIN! 0.0085258
Happy Thanksgiving to everyone. We will, together, MAKE AMERICA GREAT AGAIN! 0.0060486
Hopefully, all supporters, and those who want to MAKE AMERICA GREAT AGAIN, will go to D.C. on January 20th. It will be a GREAT SHOW! 0.0059713

Repeating tweetRank analysis for other users

With our get_timeline_df function we can easily repeat this analysis for other users. Below we repeat the whole analysis in a single magrittr pipeline.

get_timeline_df("dog_rates", 600, sig) %>% 
  mutate(text = str_replace_all(text, "\n", " ")) %>% 
  bind_lexrank(text, id, level="sentences") %>% 
  arrange(desc(lexrank)) %>% 
  head(n=5) %>% 
  select(text, lexrank) %>% 
  knitr::kable(caption = "Most Representative @dog_rates Tweets")
text lexrank
@Lin_Manuel good day good dog 0.0167123
Please keep loving 0.0099864
Here we h*ckin go 0.0085708
Last day to get anything from our Valentine’s Collection by Valentine’s Day! Shop: 0.0077583
Even if I tried (which I would never), I’d last like 17 seconds 0.0073899