Jump to ratings and reviews
Rate this book

Clean Data - Data Science Strategies for Tackling Dirty Data by Megan Squire

Rate this book
Is much of your time spent doing tedious tasks such as cleaning dirty data, accounting for lost data, and preparing data to be used by others? If so, then having the right tools makes a critical difference, and will be a great investment as you grow your data science expertise. The book starts by highlighting the importance of data cleaning in data science, and will show you how to reap rewards from reforming your cleaning process. Next, you will cement your knowledge of the basic concepts that the rest of the book relies file formats, data types, and character encodings. You will also learn how to extract and clean data stored in RDBMS, web files, and PDF documents, through practical examples. At the end of the book, you will be given a chance to tackle a couple of real-world projects. Megan Squire is a professor of computing sciences at Elon University. She has been collecting and cleaning dirty data for two decades. She is also the leader of FLOSSmole.org, a research project to collect data and analyze it in order to learn how free, libre, and open source software is made.

Paperback

First published May 1, 2015

3 people are currently reading
14 people want to read

About the author

Megan Squire

7 books

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
1 (9%)
4 stars
2 (18%)
3 stars
6 (54%)
2 stars
1 (9%)
1 star
1 (9%)
Displaying 1 - 5 of 5 reviews
Profile Image for Daniel Noventa.
322 reviews1 follower
May 4, 2016
Decent enough as an overview of cleaning data. Lacks a lot of practical information and doesn't focus on a single language. Would've been nice for consistency. I would have preferred a more generic brushing over topics like using excel and text search, but for a complete beginner that may be useful.
Profile Image for Claudio Moreno.
2 reviews
December 22, 2018
This book is okay, I was looking for a book that came with a framework that helped me think about cleaning and preparing data in a consistent and repeatable way. Instead I got the beginnings of a framework and some observations and tips that are bread and butter for anyone that works with data for even a short time. Having said that, this book did give me the basics for making my own framework and did give me some insights into the way to make the initial treatment of data more generic
600 reviews11 followers
July 12, 2015
The introduction in Clean Data is well done and explains the need to work along scientific rules. Only when one rigorously documents all the actions one does with data one can reproduce the outcome later on. The process described by Megan Squire sound well thought through and should help with the more complex work described in the book.

The next chapters are all about extracting data and transforming the input into the required output. That part is on a low technical level and talks about character encodings and file formats. All things you need to work with data, but the high level introduction in chapter one is now abandoned until you reach the big projects in chapter 9 and 10.

The big projects where unfortunately not the strong parts of the book. Where all the parts should have come together is now another mix from the high level data science process with low level inner workings of Twitter and StackOverflow. It’s hard to keep the levels apart when there is so much going on. If Twitter and StackOverflow would have been introduced in a chapter on their own, then the last two chapters could focus on the data cleaning part. As it is now, you may need to do that work for yourself, what harms the idea of the book.

However, this book is the best I read from Packt Publishing in a long time. If you are interested in working with data from different sources, you still should buy the book. The technical part is done very well and will help you a lot.

Profile Image for Jose Manuel.
241 reviews4 followers
June 12, 2015
La autora establece la limpieza de datos previa a cualquier paso de analítica como el más importante proceso a tener en cuenta a la hora de presentar resultados con la calidad adecuada, en el primer capítulo.

El resto de los capítulos se dedica a desmontar lo que ha dicho en el primero.
Insustancial, con pocas aportaciones de valor, sin entrar en profundidad en nada. Un par de proyectos de data scraping, cuatro formateos de fechas, y a correr.

Desde luego este no es un libro que debas leer si llevas tiempo en esto de las bases de datos porque a buen seguro has tenido que hacer cosas muchisimo más complejas.
Profile Image for Richard Pavlovsky.
84 reviews2 followers
February 4, 2019
A decent book, some good strategies. Learned a few tools and approaches. Not revolutionary for me.
Displaying 1 - 5 of 5 reviews

Can't find what you're looking for?

Get help and learn more about the design.