If you work in analytics or data science, like we do, you are familiar with the fact that data is being generated all the time at ever faster rates. (You may even be a little weary of people pontificating about this fact.) Analysts are often trained to handle tabular or rectangular data that is mostly numeric, but much of the data proliferating today is unstructured and typically text-heavy. Many of us who work in analytic fields are not trained in even simple interpretation of natural language.
We developed a new R package, tidytext (Silge and Robinson 2016), because we were familiar with many methods for data wrangling and visualization, but couldn’t easily apply these same methods to text. We found that using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. By treating text as data frames of words, we can manipulate, summarize, and visualize the characteristics of text easily and integrate natural language processing into effective workflows we were already using.
The tools provided by the tidytext package are relatively simple; what is important is the possible applications. Thus, this book provides compelling examples of real text mining problems.
Really comprehensive book about text mining with R and tidy. While it is understood that some R and tidy knowledge are required to work out the examples of the book, at around the TF/IDF chapter I started to feel that I was spending more time checking out google to see what that specific R function was doing, than to fully grasp the theoretical concepts applied to the cases. That made me lose interest and wanting to find other references. But I finally found the time to finish it and I have to say that all in all, this is a good book to see how to handle basic-to-medium use cases of text mining with R. I believe it may become a reference book for me when trying to work out my own datasets.
Great code examples! Easy to emulate, shows the necessary data cleaning and preprocessing and gives good tips for what to do in other contexts. You'll need to be already familiar with R and the dplyr package to get anything out of this book, though.
If you don't know R or dplyr and want to jump straight in to natural language processing, I'd instead recommend starting with the vignettes for the tm or quanteda packages.
It covers the basics (sentiment analysis, tf-idf, n-gram, topic modelling, and visualization) well and the chapters on case studies are pretty helpful. The use of literature (Jane Austen's novels and more) as data also makes it more engaging to a literary minded reader.
It's just when the author says "slightly familiar with dplyr and ggplot2" on the preface, she means she is not going to explain any codes relating to these two packages. Compared to all those annotated-line-by-line codes in other online tutorials, this book may not be that accessible to a beginner.
On topic modelling, you may want to google how to determine the number of topics as more systematic approaches to such determination are not covered.
Disclaimer: I am not an expert on text mining, but I do have ~8 years of data science experience.
This was a very nice introduction to doing it in R, and the examples were very interesting too. In general, I recommend books by these authors.
My only complaint is that they did not go into details about how they scraped Twitter posts. The API is quite annoying and limited, so one might have to do some regular webscraping. Guess I should read a textbook on that next.
I enjoyed working through the book but it is a bit dated at this point and has some areas that are not functioning due to outdated packages. At times I had to go to the website and then review what they had updated on website. Also there are times where they don’t have code set up for a user to actually execute it. For example, the code related to the twitter files were a bit confusing. I had to go to the github repository to actually download the data and this should have been explained since the book really is a mixture of coding and commentary.
Although this book is no longer the most up-to-date book on text mining. I really like a lot of the plots (ggplots) in this book, in particular for exploratory analysis. You can inspect the outputs at each stage, and visuals are great ways to make sense of the text data and communicate your findings. I will definitely reuse the plots in this book for further work.
The examples are interesting and very easy to follow. If you have any problem applying the techniques to your data set, just a quick search would lead you to the solutions!
Excellent coverage of taking a tidy approach to text analysis, with a generous number of worked examples. The one drawback is that much of the code used requires at least an intermediate-level working knowledge of R.
Good overview of the tidytext library in R. Note the end-all of text analyses, but a good place to begin. I now need to get something to do some analyses on...
Awesome book - with great step by step code to follow. The author's clearly explain analytical questions and walk through their analysis. Its so good I read it twice.