Amelia McNamara
August 16, 2016
We’ll go through some of Scott, Karthik, and Garrett’s useR tutorial. I’ll flip through the API stuff, and we’ll focus on scraping.
We’re switching over to the useR tutorial by Scott, Karthik, and Garrett.
See it here: useR tutorial.
This is the data I mentioned from Kaylin Walker’s analysis.
(I got the URL right this time– notice it starts with raw
.)
library(RCurl)
library(readr)
webData <- getURL("https://raw.githubusercontent.com/walkerkq/musiclyrics/master/billboard_lyrics_1964-2015.csv")
lyrics <- read_csv(webData)
dim(lyrics)
## [1] 5100 6
library(dplyr)
library(stringr)
beatles <- lyrics %>%
filter(str_detect(Artist, "beatles"))
dim(beatles)
## [1] 18 6
(How much smaller do you think love
is than lyrics
? How much smaller is it really?)
love <- lyrics %>%
filter(str_detect(Lyrics, "lov"))
dim(love)
## [1] 3032 6
David Robinson wrote this great blog post about Trump’s tweets. It’s also a great walkthrough of some text analysis! We’re going to try it on our own data.
library(tidytext)
lyricwords <- lyrics %>%
unnest_tokens(word, Lyrics, token = "words") %>%
filter(!word %in% stop_words$word,str_detect(word, "[a-z]"))
lyricwords %>%
select(Song, Artist, word)
## # A tibble: 579,910 x 3
## Song Artist word
## <chr> <chr> <chr>
## 1 wooly bully sam the sham and the pharaohs sam
## 2 wooly bully sam the sham and the pharaohs sham
## 3 wooly bully sam the sham and the pharaohs miscellaneous
## 4 wooly bully sam the sham and the pharaohs wooly
## 5 wooly bully sam the sham and the pharaohs bully
## 6 wooly bully sam the sham and the pharaohs wooly
## 7 wooly bully sam the sham and the pharaohs bully
## 8 wooly bully sam the sham and the pharaohs sam
## 9 wooly bully sam the sham and the pharaohs sham
## 10 wooly bully sam the sham and the pharaohs pharaohs
## # ... with 579,900 more rows
library(ggplot2)
wordcounts <- lyricwords %>%
group_by(word) %>%
summarize(uses = n())
wordcounts <- wordcounts %>%
arrange(desc(uses)) %>%
slice(1:20)
wordcounts %>%
ggplot() + geom_bar(aes(x=reorder(word, uses), y=uses),stat = "identity")
# ggplot() + geom_bar(aes(x=word, y=uses),stat = "identity")
data(stop_words)
head(stop_words)
## # A tibble: 6 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
morewords <- data.frame(word = c("im", "aint", "dont"), lexicon = "MM")
lyricwords <- lyricwords %>%
filter(!word %in% morewords$word)
What about the most popular by decade?
decadelyrics <- lyricwords %>%
mutate(decade = (Year %/% 10) * 10)
wordcounts <- decadelyrics %>%
group_by(word, decade) %>%
summarize(uses = n())
popular <- wordcounts %>%
group_by(decade) %>%
slice(which.max(uses))
wordcounts %>%
arrange(decade, desc(uses))
## Source: local data frame [69,679 x 3]
## Groups: word [41,335]
##
## word decade uses
## <chr> <dbl> <int>
## 1 love 1960 1176
## 2 baby 1960 783
## 3 yeah 1960 387
## 4 youre 1960 357
## 5 girl 1960 322
## 6 time 1960 305
## 7 ill 1960 274
## 8 gonna 1960 256
## 9 hey 1960 215
## 10 day 1960 208
## # ... with 69,669 more rows
nrc <- sentiments %>%
filter(lexicon == "nrc") %>%
dplyr::select(word, sentiment)
years <- lyricwords %>%
group_by(Year, Song, word) %>%
mutate(total_words = n()) %>%
ungroup() %>%
distinct(Year, word, total_words)
by_source_sentiment <- years %>%
inner_join(nrc, by = "word") %>%
group_by(Year, sentiment) %>%
summarize(total = sum(total_words)) %>%
group_by(Year) %>%
slice(which.max(total))
by_source_sentiment %>%
arrange(desc(total))
## Source: local data frame [51 x 3]
## Groups: Year [51]
##
## Year sentiment total
## <int> <chr> <int>
## 1 2007 positive 1506
## 2 2003 positive 1483
## 3 2002 positive 1429
## 4 2006 positive 1418
## 5 2009 positive 1408
## 6 2010 positive 1354
## 7 2001 positive 1349
## 8 1991 positive 1280
## 9 2004 negative 1236
## 10 2015 positive 1226
## # ... with 41 more rows
# Doesn't make sense anymore
# by_source_sentiment <- by_source_sentiment %>%
# mutate(binom = if_else(sentiment =="positive",1,0))
p1 <- ggplot(by_source_sentiment, aes(x=Year, y=sentiment)) + geom_point()
p1
# p2 <- ggplot(by_source_sentiment, aes(x=Year, y=binom)) + geom_point() + geom_smooth(aes(y=binom), method="glm", method.args = list(family = "binomial"), se=FALSE, fullrange=TRUE)+xlim(1960, 2040)
# p2
lyrics <- lyrics %>%
mutate(lyrchar = str_length(Lyrics))
lettery <- lyrics %>%
group_by(Year) %>%
summarize(songlength = mean(lyrchar, na.rm=TRUE))
ggplot(lettery) + geom_line(aes(x=Year, y=songlength)) + ylab("Number of letters in lyrics")
wordy <- lyrics %>%
unnest_tokens(word, Lyrics, token = "words") %>%
group_by(Song, Year) %>%
summarize(length=n()) %>%
group_by(Year) %>%
summarize(songlength = mean(length, na.rm=TRUE))
ggplot(wordy) + geom_line(aes(x=Year, y=songlength)) + ylab("Number of words in lyrics")
repetitive <- lyricwords %>%
group_by(Artist, Song, word) %>%
summarize(n=n()) %>%
arrange(desc(n))
repetitive %>% select(word, n, Song, Artist)
## Source: local data frame [256,320 x 4]
## Groups: Artist, Song [4,671]
##
## word n Song
## <chr> <int> <chr>
## 1 dit 180 december 1963 oh what a night
## 2 thoia 156 thoia thoing
## 3 da 150 be my lover
## 4 bum 148 disturbia
## 5 la 141 la la la
## 6 la 140 nothin
## 7 shake 138 shake it off
## 8 bay 136 a bay bay
## 9 na 132 la la la
## 10 na 132 gettin jiggy wit it
## # ... with 256,310 more rows, and 1 more variables: Artist <chr>
uniques <- lyricwords %>%
filter(!str_detect(Artist, "featuring")) %>%
filter(word != "instrumental") %>%
group_by(Song, Artist) %>%
summarize(n = length(unique(word))) %>%
arrange(desc(n))
uniques
## Source: local data frame [4,161 x 3]
## Groups: Song [3,929]
##
## Song Artist n
## <chr> <chr> <int>
## 1 one more chance the notorious big 237
## 2 i got 5 on it luniz 233
## 3 they want efx das efx 232
## 4 deja vu uptown baby lord tariq and peter gunz 229
## 5 oochie wally nas and bravehearts 227
## 6 american pie don mclean 219
## 7 ghetto cowboy mo thugs 219
## 8 sing for the moment eminem 219
## 9 hypnotize the notorious big 218
## 10 my band d12 216
## # ... with 4,151 more rows
Jordan gave us already-counted words from Project Gutenberg books!
Either: go to a url, like http://www.science.smith.edu/~jcrouser/data/burton-arabian-363.txt and change the .txt
to .csv
to download data.
arabian <- read_csv("burton-arabian-363.csv")
Or
webData <- getURL("http://www.science.smith.edu/~jcrouser/data/alice.csv")
alice <- read_csv(webData)