Jump to ratings and reviews
Rate this book

Web Scraping with Python: Collecting Data from the Modern Web

Rate this book
Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.

Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as analyzing raw data or using scrapers for frontend website testing. Code samples are available to help you understand the concepts in practice.

Learn how to parse complicated HTML pagesTraverse multiple pages and sitesGet a general overview of APIs and how they workLearn several methods for storing the data you scrapeDownload, read, and extract data from documentsUse tools and techniques to clean badly formatted dataRead and write natural languagesCrawl through forms and loginsUnderstand how to scrape JavaScriptLearn image processing and text recognition

378 pages, Kindle Edition

First published April 25, 2015

236 people are currently reading
854 people want to read

About the author

Ryan Mitchell

71 books11 followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
155 (38%)
4 stars
174 (43%)
3 stars
68 (16%)
2 stars
3 (<1%)
1 star
4 (<1%)
Displaying 1 - 30 of 42 reviews
37 reviews
November 16, 2016
Since I started the semester and I have been reading internet scraping and network security books.
All the books use the example of two arbitrary people Alice and Bob exchanging information.And these examples have been getting better and funnier and weirder.
Somehow,I don't know why,but it's maybe because I love reading books or I love fiction,my mind has been looking for patterns in these books between Bob and Alice.
My conclusion is that these two are government spies and are knee-deep in cover and are trying to get out important information without letting cluing in on their marks.
Also,while typing that,I lmao'd like a hundred times because I'm saying such BS.


But,this book was brilliant.The information was spot-on and wasn't repetitive.It was very helpful and it was one of the most helpful books around.
Profile Image for Cliff Chew.
121 reviews10 followers
May 27, 2016
If you ever want to collect amounts of data off the Internet through Web Scraping, please read this book. If you have done some web scraping, this book provides extremely useful nuggets of information to further enhance your web scraping capabilities. Faced some web scrapping blocker practices? This book has a great section on how to make your scrapper look more "human"!

To balance things out, the author even included a section on the ethics of web scrapping, which is something that ever web scrapper should understand!

I rarely give 5 stars, but this book really took it all the way there. Truly a beautiful soup book!
Profile Image for Sebastian.
191 reviews9 followers
March 15, 2020
This is a great text spanning most of the tools, methods and philosophies underpinning web scraping.

It's main problem is a lack of identity: is it teaching web scraping to those with one or two simple tasks, looking to just dip their toe in, or those looking to build production quality web scrapers for large scale tasks? As such it jumps to and fro in the tools it suggests. The start of the book seems lightweight and much of it is replaced by recommendations later in the text. This could be made much clearer from the start.

Having said that, Mitchell's textbook is fairly thorough on the topic, and rewards those who persevere through the start with the more nuanced sides of web scraping (multithreading/processing, solving captchas, finding APIs).
Profile Image for Mikhail.
66 reviews12 followers
November 14, 2020
So, I was getting heavily armadildoed by the Text Mining course; luckily, Mitchell's pangolin saved me.

The book is written very nicely and covers all the imaginable subfields of scraping.
Profile Image for Joshua Hruzik.
17 reviews6 followers
March 14, 2017
The books gives a good general introduction to BeautifulSoup (which is used for webscraping). However, the focus is too heavily skewed towards less important topics. I would have loved to get more details on BeautifulSoup functions and not about data import to csv etc. since most readers would already have some experience with these sort of tasks.
Profile Image for Nickolai.
889 reviews8 followers
February 10, 2022
Решил прочитать эту книгу после просмотра мини-курса по скрапингу от Р.Митчелл на LinkedIn Learning. Видеокурс просто замечательный, но довольно короткий, поэтому хотелось углубить знания. В целом, книга скорее разочаровала. Первые четыре главы были хороши, а потом многое испортилось. Основных проблем две, но они затрагивают почти все последующие главы. Во-первых, автор останавливается на темах, лишь косвенно относящихся к скрапингу, например nltk, обработка pdf и doc файлов. Во-вторых, многие действительно интересные и нужные темы раскрываются лишь мазками, предлагая читателю дальнейшее их самостоятельное изучение. В итоге из всей книги нашел для себя полезными не более сотни строчек кода. А остальное придется почерпнуть где-нибудь еще.
Profile Image for Giacomo Debidda.
29 reviews
October 19, 2020
Good introductory book on web scraping, but needs an update.

This book does a really good job describing the main techniques and strategies for web crawling and web scraping. Unfortunately, most of the technologies and libraries used in this book are quite outdated today, so if you want to follow the exercises you will need to use different libraries (which might not necessarily be a bad thing).
4 reviews1 follower
May 17, 2016
A solid overview of web scraping with python. Python is currently the most widely used language for web scraping, and this book gives an overview of how to do it. There are minor errors throughout the text, but the author stated she will fix them in the next edition. If you want a book to read through on scraping rather than exercising your Google search skills, this is the book to get.
Profile Image for Hadiana Sliwa.
67 reviews8 followers
March 14, 2021
Great book for those who want to learn about scraping data from internet and ethics behind it.

The book is not for beginners completely but a bit of background in python or programming in general would work.
Profile Image for Leonardo.
Author 1 book79 followers
November 12, 2018
Excelente libro, completo y bien explicado. Creo que puede ser una buena iniciación al scraping para cualquiera que tenga un poco de conocimiento de Python. Me sorprendió que los temas que cubre fueron casi exactamente a los que me fui enfrentando por mi cuenta tratando de resolver los problemas que se me presentaban a la hora de buscar información en internet. Hubiera sido de gran ayuda arrancar por acá, aunque tal vez no hubiera entendido nada si hubiera sido así.

Es de gran ayuda la página web propia del libro, y el GitHub con el código.

Es un poco autobombo de los otros libros de O'Reilly, pero realmente parecen valer la pena. Me quedo con ganas de leer más sobre scraping, big data, meterme con algo de machine learning (incluso llegar alguna vez a deep learning). También me anima a leer alguna vez de corrido algún libro sobre VBA. También me queda pendiente algo más de NLP y entender mejor el MySQL. Como sea, fue un buen pantallazo.

Creo que me faltaron ideas concretas de en donde aplicar lo aprendido. Pero creo que cuando me enfrente la próxima a un problema real voy a estar mejor parado, con más ideas desde las que partir.
15 reviews7 followers
May 31, 2019
This book contains wisdom and methods that have been refined by the author after having to webscrape for what might be years. The starting few chapters of the book, while introducing new things, can often feel like a cookbook, which the author finds is a concise way to write code to minimise the work. While those snippets of code can be a boon for some, for me, they took away the creativity of coding. But I will go back to see them once I have had years of experience in scraping to realise what value they hold.

The second half of the book deals with topics I had never imagined could be a webscraping book. And they are amazing and opens up your mind to the extent of possibilities you can go obtain that data that you desire. I think this book would have been perfect if there were code exercises to solve after all relevant chapters.
Profile Image for Lord Farquaard.
8 reviews
September 16, 2022
Genuinely useful book that can still teach basic HTML webscraping, the underlying healthy practices and serve as an introduction to more advanced topics. So it's still worth picking up. However, since its release it's become annoyingly outdated.

PhantomJS was discontinued in 2017, thus Selenium (covered and used with PhantomJS in this book) no longer supports it, and to therefore download it one must step through a few more hurdles. Personally I just keeled over to the headless Chrome driver which seems to have emerged since the release of this book.

The syntax for Selenium has also changed, so the examples involving it won't work without modification - which defeats the purpose of learning it from this book - because by the time you've learnt the correct syntax for Selenium you wouldn't need the text anyway.
227 reviews1 follower
February 12, 2020
Przydatna książka, w której jest opisane, jak sprawnie wyciągać dane ze stron www.
Do ekstrakcji danych autorka głównie się skupia na bibliotekach beatiful soup i selenium języka Python.
Przy okazji poznajemy wyrażenia regularne i sposoby łączenia/zapisywania/itd z bazą MySql.
Na koniec jest opisana w bardzo ciekawy sposób legalność ekstrakcji danych.
Książkę będę traktować jako pomoc w swoich projektach.

Polecam.
Profile Image for Mikhail.
332 reviews5 followers
July 29, 2021
Формат: Книга Язык: Английский
Прочитал книгу в рамках расширения скиллов в Питоне. Наверно как пособие через такую призму книга не очень релевантна, но для расширения кругозора по инструментам и технологиям скрэппинга сайтов вполне интересна. О потраченном на нее времени не пожалел, хотя пока не уверен пригодятся ли мне новые знания на практике.
К перечтению - возможно, если потребуется обновить знания или вспомнить детали по некоторым библиотекам указанным в книге.
This entire review has been hidden because of spoilers.
Profile Image for Ethan Swan.
65 reviews
March 17, 2021
For someone with Python skills but a limited understanding of and skill in web scraping, this is a fantastic book. It covers the basics of a huge range of techniques (HTML wrangling, web APIs, headless browsers, testing) and also comes with some thoughtful discussions, such as the ethics of web scraping. Highly recommend.
Profile Image for aaron.
35 reviews6 followers
January 10, 2025
good introduction or rather along the way reading. it seems like the author would share much more but maybe in a youtube video or substack post where freedom of speech is more applicable. here they have to remind too often what good boys and girls we should be while scraping
was reading it with web scraping by chapagain
73 reviews43 followers
March 22, 2017
A nice introduction to the basics of scraping. Reading this before your first scraping project will probably save you a lot of time and frustration--it's basically a compendium of the basics plus everything you wouldn't know how to search Stack Overflow for. It covers the basics (just grabbing simple HTML and parsing with BeautifulSoup) and touches on more advanced topics (using a headless browser like PhantomJS to parse modern, AJAX-y pages).

If you're more experienced, I'd recommend flipping through it quickly to see if you spot anything you didn't already know. It filled in a few gaps for me.
600 reviews11 followers
March 5, 2023
Probably the best book on web scraping currently available. It not only covers how to handle HTML, but also binary formats like PDF and Word. There are many cautions on how to not shoot in your foot with an automated script that will help you a lot.
Profile Image for Mohamed Nijadi.
7 reviews1 follower
September 27, 2024
I read it while doing a project and it really gave new perspectives and insights that helped me tweak my scrapers as I was reading more.
I recommend this book, especially if you have a bit of knowledge about the tools you are using but never done any medium to large projects.
Profile Image for Akash Nidhi P S.
41 reviews3 followers
January 30, 2018
A decent book to intro to webscraping, gives highlevel overall view of the webscraping world.
Profile Image for Ferhat Culfaz.
268 reviews18 followers
February 5, 2018
Good introduction to web scraping giving you all the tools and relevant libraries you need depending on your application.
Profile Image for Ed Terrell.
498 reviews26 followers
April 26, 2018
Well written, hands on analysis of how the web works and how to extract information from it--even when it appears in multiple sites and multiple forms. Very inciteful!
Profile Image for Hasan Basri AKIRMAK.
27 reviews9 followers
June 8, 2018
Practical guide

Practical guide on scraping tools, libraries for text and image data processing as well as do’s don’t do’s for a project.
Profile Image for Marcus Österberg.
Author 9 books15 followers
October 5, 2018
Bra bok men lite irriterande att det slutliga kodexemplet av ngrams inte fungerar (också kollat bokens kod på Github utan framgång).
Profile Image for ana silva.
6 reviews1 follower
March 20, 2019
A really good introduction to web scraping with Python, this book has saved me a lot time writing my first scraping project. (Also, loved the War and Peace references).
2 reviews
May 9, 2021
Clear explanation, not wordy, suitable for a beginner.
Displaying 1 - 30 of 42 reviews

Can't find what you're looking for?

Get help and learn more about the design.