Michael Driscoll's Blog, page 91
August 15, 2016
PyDev of the Week: Harry Percival
This week we welcome Harry Percival (@hjwp) as our PyDev of the Week! Harry is the author of Test-Driven Development with Python. You can visit his website to learn more about the book and even read it for free! Harry is also a programmer at PythonAnywhere which allows you to host, run and code Python in the cloud. Let’s take a few moments to learn more about our fellow Pythoneer!
Can you tell us a little about yourself (hobbies, education, etc):
Although my childhood started out on a promisingly nerdy path — programming in Basic on French Thomson TO-7s whose keys go “boop” when you press them — things went rather wrong and I ended up as a management consultant, which left me with an enduring fondness for spreadsheets and dislike for expressions like “leverage”, “utilize” (just say use!) and “going forwards”.
I did my CS degree by distance learning while living in Italy — because my wife is an opera singer, and she wanted to learn Italian. We lived by the beach. It was rough. Actually that’s the reason for my second job: I moonlight as captions operator for popupopera.co.uk. Give me half a chance and I’ll bend your ear about using reveal.js and websockets to stream opera captions to mobile devices. Also I just had a baby.
Why did you start using Python?
My university course, as well as teaching me to watch out for incompatibility issues between Netscape Navigator and Internet Explorer (oh yes, bang up to date, this was 5 years ago), also gave me a valuable grounding in PHP and Java, so you can imagine I was glad when I got my first contracting gig, and the hosting provider told me they recommended Python and Django. So I taught myself the basics of those one Christmas, and never looked back!
What other programming languages do you know and which is your favorite?
As a onetime spreadsheet hacker I’ve done my share of VBA, but I’m also a certified Lotus Notes professional, yes indeed! There’ll always be a place in my heart for LotusScript. You know I actually (in retrospect, exceedingly foolishly) created and released a Lotus Notes email that would automatically forward itself on opening, to people from the user’s address book — prior art to the Melissa virus, if I do say so myself. This was back on an internship at IBM back in 1998, and I am *very* glad I never got in more trouble for that!
From my degree I remember Prolog with fondness and enduring curiosity (did you know there’s still a Prolog community out there? There’s a Prolog web framework!).
More recently I’ve dabbled in Clojure, which feels like it definitely encourages a different way of thinking about coding, I definitely recommend it.
What projects are you working on now?
PythonAnywhere is the main thing, there’s rewriting my wife’s website with Wagtail CMS, and preparing the second edition of my book, with hints of passwordless auth and rest-api-ajax testing stuff. That’s got to be enough, surely!
Which Python libraries are your favorite (core or 3rd party)?
I am reminded of the time I spent days writing a text reflowing function to wrap text at 80 characters, only to discover textwrap.wrap. And that’s despite using textwrap.dedent on an almost daily basis at work. It somehow didn’t occur to me that a module called textwrap might contain, oh, I don’t know, a tool to wrap text. So, always look to the standard library first folks!
Other than that, I’ve got to give a shout out to @tartley’s rerun, for running your unit tests automatically on file change. Could do with an inotify integration, if anyone’s looking for a little project…
Where do you see Python going as a programming language?
It’s been great to see the momentum building to a tipping point for Python 3 over the last few years. It didn’t seem it would ever happen but now it really feels like we’re there.
One ominous thing on the horizon is the whole type hinting thing — I am sometimes scared that it’s going to make the language harder to understand for beginners, but when I take a step back, I think I can trust the Python community to do the right thing here. I imagine conventions and customs will evolve in such a way that type hints will stay fairly well hidden from beginners, and Python’s learning curve will stay smooth.
The other big thing I think will have interesting effects is Python’s increasing adoption as the default teaching language. Look at US and UK efforts to teach coding in high schools and elementary schools! We’re going to move from a world where Python was a bit of a wacky, niche, choice, to it being the standard language that everyone learned at school, and I wonder how that’ll trickle down through the culture of Python…
Is there anything else you’d like to say?
Being British, it’s hard for me to say words like “community” with a straight face, but, honestly, Python really taught me the meaning of the term. If you’re a Python programmer, don’t miss out on the chance to get more involved in the community — find a local meetup, join the mailing lists, and come to the conferences, you’ll love it.
I also have to express my love and admiration for DjangoGirls and all the other diversity efforts, pyladies, transcode, et al. One of the main downsides of the community, as it is, is that it’s so overwhelmingly male. Diversity efforts, they’re great for all the usual reasons about increasing the talent pool and widening our perspective and ability to deal with problems, but, honestly, on a personal level, it’s just so nice to be able to get away from having to be so blokey all the time — we’re all so invested in being smart and being right and it makes us argumentative, and it’s just such a relief to sometimes be away from that and maybe have environments which encourage a different kind of interaction… emoji and all! And anyone that knows me knows I need all the help I can get in toning down my argumentative side
August 8, 2016
PyDev of the Week: Ben Nuttall
This week we welcome Ben Nuttall (@ben_nuttall) as our PyDev of the Week. Ben is a Raspberry Pi Community Manager http://bennuttall.com/ and is the creator of GPIO Zero, which is a simple interface to the GPIO components on the Raspberry Pi. You should also check out his website to see what Ben is up to. Let’s take a few moments to get to know Ben better!
Can you tell us a little about yourself (hobbies, education, etc):
As well as programming I like whitewater kayaking, and a mix of other outdoor pursuits. Also I recently got into photography (and have lots to learn).
I studied Mathematics and Computing at university, and then worked in the software industry for a couple of years. Around this time I was getting more involved in my local tech community, attending user groups and conferences. I set up a community event for Raspberry Pi (called Raspberry Jam), and did some events with schools to help teachers them use Raspberry Pi in their classrooms, and to engage kids in digital making.
Through community work like this I ended up getting hired by the Raspberry Pi Foundation to do development and outreach. We formed an Education team and started running teacher training programmes and other workshops. The Foundation’s getting bigger now as we try to reach more people around the world through our education programmes, and make more of an impact. I now work as Raspberry Pi Community Manager, supporting Jams and other community efforts including some great open source projects.
Why did you start using Python?
A friend showed me Python in 2011, he demonstrated some simple examples and I was impressed by the straightforward nature of the syntax. At the time I’d used a lot of PHP, Java and Matlab, so I was used to more verbose syntax. Python just seemed like a brilliant general purpose language. It soon because my language of choice for simple tasks, and through doing those, and working through pythonchallenge.com, I soon picked up a good all-round skill set.
When the Raspberry Pi came along it became much more relevant to me as I learned to do physical computing projects with the Pi’s GPIO pins. I learned a lot by thinking up project ideas for the learning resources we put together, and by building projects from specification for other members of the team.
What other programming languages do you know and which is your favorite?
I’ve a decent grasp of PHP, Ruby and JavaScript, Java and C#. I guess Ruby would be my language of choice if Python didn’t exist – it’s just as nice to use in many ways, just a different style. I quite like the look of Julia, but I have limited experience using it.
What projects are you working on now?
At the end of last year I started a new library for controlling GPIO components on Raspberry Pi. The previous de-facto library for doing that was very low-level, just allowing you to turn individual pins high and low, and to read the state of pins, and required a lot of boilerplate code to get started, which made it hard to teach with. I created GPIO Zero, named in reference to PyGame Zero (a zero boilerplate wrapper for PyGame). The project developed really quickly and we released v1.0 within a couple of months. It’s going really well and I’ve had some great feedback. I’ve had a lot of help from Dave Jones, who maintains the picamera library for Raspberry Pi.
We currently have two Raspberry Pis on the International Space Station, running Python programs written by school kids in the UK. It’s part of an outreach programme called Astro Pi, between Raspberry Pi and the UK Space Agency with British ESA Astronaut Tim Peake.
Which Python libraries are your favorite (core or 3rd party)?
I love the power of itertools, there are some very handy functions in there I use regularly. I really like Daniel Pope’s PyGame Zero – it’s achieved a lot by removing boilerplate and making it straightforward to make progress in creating a game. It also inspired me and others to create similar libraries, particularly for use in education.
Where do you see Python going as a programming language?
Python is pretty huge in education. It’s a popular teaching language as it’s easy to read and write, and extends into pretty much everywhere – games, desktop applications, websites, physical computing, robotics, science, maths and more. It’s interesting to think that if every kid has the opportunity to learn Python, this will have a big impact on the adaptation of the language in future.
I wish we had a better editor for beginners. IDLE is not fit for purpose. There’s a great community project called Mu, which has been designed for use in education. It still needs a bit of work (it currently only works with micropython on the micro:bit device) but I’d like to see it continue. There exist similar problems in the classroom, that most developers aren’t aware of, like the difficulty of installing libraries. I recommend library maintainers read the article ‘Scratch is the new PowerPoint’ by UK teacher Laura Dixon.
Is there anything else you’d like to say?
I maintain a library called pyjokes – one line programmer jokes as a service. You can pip install it to access jokes on the command line, or to use the jokes in your own Python project. We also have a Twitter bot you can follow: @pyjokes_bot.
Thanks for doing the interview!
August 4, 2016
A Simple Intro to Web Scraping with Python
Web scraping is where a programmer will write an application to download web pages and parse out specific information from them. Usually when you are scraping data you will need to make your application navigate the website programmatically. In this chapter, we will learn how to download files from the internet and parse them if need be. We will also learn how to create a simple spider that we can use to crawl a website.
Tips for Scraping
There are a few tips that we need to go over before we start scraping.
Always check the website’s terms and conditions before you scrape them. They usually have terms that limit how often you can scrape or what you can you scrape
Because your script will run much faster than a human can browse, make sure you don’t hammer their website with lots of requests. This may even be covered in the terms and conditions of the website.
You can get into legal trouble if you overload a website with your requests or you attempt to use it in a way that violates the terms and conditions you agreed to.
Websites change all the time, so your scraper will break some day. Know this: You will have to maintain your scraper if you want it to keep working.
Unfortunately the data you get from websites can be a mess. As with any data parsing activity, you will need to clean it up to make it useful to you.
With that out of the way, let’s start scraping!
Preparing to Scrape
Before we can start scraping, we need to figure out what we want to do. We will be using my blog for this example. Our task will be to scrape the titles and links to the articles on the front page of this blog. You can use Python’s urllib2 module to download the HTML that we need to parse or you can use the requests library. For this example, I’ll be using requests.
Most websites nowadays have pretty complex HTML. Fortunately most browsers provide tools to make figuring out where website elements are quite trivial. For example, if you open my blog in chrome, you can right click on any of the article titles and click the Inspect menu option (see below):
Once you’ve clicked that, you will see a sidebar appear that highlights the tag that contains the title. It looks like this:
The Mozilla Firefox browser has Developer tools that you can enable on a per page basis that includes an Inspector you can use in much the same way as we did in Chrome. Regardless which web browser you end up using, you will quickly see that the h1 tag is the one we need to look for. Now that we know what we want to parse, we can learn how to do so!
BeautifulSoup
One of the most popular HTML parsers for Python is called BeautifulSoup. It’s been around for quite some time and is known for being able to handle malformed HTML well. To install it for Python 3, all you need to do is the following:
pip install beautifulsoup4
If everything worked correctly, you should now have BeautifulSoup installed. When passing BeautifulSoup some HTML to parse, you can specify a tree builder. For this example we will use html.parser, because it is included with Python. If you’d like something faster, you can install lxml.
Let’s actually take a look at some code to see how this all works:
import requests
from bs4 import BeautifulSoup
url = 'http://www.blog.pythonlibrary.org/'
def get_articles():
"""
Get the articles from the front page of the blog
"""
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
pages = soup.findAll('h1')
articles = {i.a['href']: i.text.strip()
for i in pages if i.a}
for article in articles:
s = '{title}: {url}'.format(
title=articles[article],
url=article)
print(s)
return articles
if __name__ == '__main__':
articles = get_articles()
Here we do out imports and set up what URL we are going to use. Then we create a function where the magic happens. We use the requests library to get the URL and then pull the HTML out as a string using the request object’s text property. Then we pass the HTML to BeautifulSoup which turns it into a nice object. After that, we ask BeautifulSoup to find all the instances of h1 and then use a dictionary comprehension to extract the title and URL. We then print that information to stdout and return the dictionary.
Let’s try to scrape another website. This time we will look at Twitter and use my blog’s account: mousevspython. We will try to scrape what I have tweeted recently. You will need to follow the same steps as before by right-clicking on a tweet and inspecting it to figure out what we need to do. In this case, we need to look for the ‘li’ tag and the js-stream-item class. Let’s take a look:
import requests
from bs4 import BeautifulSoup
url = 'https://twitter.com/mousevspython'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
tweets = soup.findAll('li', 'js-stream-item')
for item in range(len(soup.find_all('p', 'TweetTextSize'))):
tweet_text = tweets[item].get_text()
print(tweet_text)
dt = tweets[item].find('a', 'tweet-timestamp')
print('This was tweeted on ' + dt)
As before, we use BeautifulSoup’s findAll command to grab all the instances that match our search criteria. Then we also look for the paragraph tag (i.e. ‘p’) and the TweetTextSize class and loop over the results. You will note that we used find_all here. Just so we’re clear, findAll is an alias of find_all, so they do the exact same thing. Anyway, we loop over those results and grab the tweet text and the tweet timestamp and print them out.
You would think that there might be an easier way to do this sort of thing and there is. Some websites provide a developer API that you can use to access their website’s data. Twitter has a nice one that requires a consumer key and a secret. We will actually be looking at how to use that API and a couple of others in the next chapter.
Let’s move on and learn how to write a spider!
Scrapy
Scrapy is a framework that you can use for crawling websites and extracting (i.e. scraping) data. It can also be used to extract data via a website’s API or as a general purpose web crawler. To install Scrapy, all you need is pip:
pip install scrapy
According to Scrapy’s documentation, you will also need lxml and OpenSSL installed. We are going to use Scrapy to do the same thing that we used BeautifulSoup for, which was scraping the title and link of the articles on my blog’s front page. To get started, all you need to do open up a terminal and change directories to the one that you want to store our project in. Then run the following command:
scrapy startproject blog_scraper
This will create a directory named blog_scraper in the current directory which will contain the following items:
Another nested blog_scraper folder
scrapy.cfg
Inside of the second blog_scraper folder is where the good stuff is:
A spiders folder
__init__.py
items.py
pipelines.py
settings.py
We can go with the defaults for everything except items.py. So open up items.py in your favorite Python editor and add the following code:
import scrapy
class BlogScraperItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
What we are doing here is creating a class that models what it is that we want to capture, which in this case is a series of titles and links. This is kind of like SQLAlchemy’s model system in which we would create a model of a database. In Scrapy, we create a model of the data we want to scrape.
Next we need to create a spider, so change directories into the spiders directory and create a Python file there. Let’s just call it blog.py. Put the following code inside of your newly created file:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from ..items import BlogScraperItem
class MyBlogSpider(BaseSpider):
name = 'mouse'
start_urls = ['http://blog.pythonlibrary.org']
def parse(self, response):
selector = Selector(response)
blog_titles = selector.xpath("//h1[@class='entry-title']")
selections = []
for data in blog_titles:
selection = BlogScraperItem()
selection['title'] = data.xpath("a/text()").extract()
selection['link'] = data.xpath("a/@href").extract()
selections.append(selection)
return selections
Here we just import the BaseSpider class and a Selector class. We also import our BlogScraperItem class that we created earlier. Then we subclass BaseSpider and name our spider mouse since the name of my blog is The Mouse Vs the Python. We also give it a start URL. Note that this is a list which means that you could give this spider multiple start URLs. The most important piece is our parse function. It will take the responses it gets from the website and parse them.
Scrapy supports using CSS expressions or XPath for selecting certain parts of an HTML document. This basically tells Scrapy what it is that we want to scrape. XPath is a bit harder to read, but it’s also the most powerful, so we’ll be using it for this example. To grab the titles, we can use Google Chrome’s Inspector tool to figure out that the titles are located inside an h1 tag with a class name of entry-title.
The selector returns an a SelectorList instance that we can iterate over. This allows us to continue to make xpath queries on each item in this special list, so we can extract the title text and the link. We also create a new instance of our BlogScraperItem and insert the title and link that we extracted into that new object. Finally we append our newly scraped data into a list which we return when we’re done.
To run this code, go back up to the top level folder which contained the nested blog_scraper folder and config file and run the following command:
scrapy crawl mouse
You will notice that we are telling Scrapy to crawl using the mouse spider that we created. This command will cause a lot of output to be printed to your screen. Fortunately, Scrapy supports exporting the data into various formats such as CSV, JSON and XML. Let’s export the data we scraped using the CSV format:
scrapy crawl mouse -o articles.csv -t csv
You will still see a lot of output generated to stdout, but the title and link will be saved to disk in a file called articles.csv.
Most crawlers are set up to follow links and crawl the entire website or a series of websites. The crawler in this website wasn’t created that way, but that would be a fun enhancement that you can add on your own.
Wrapping Up
Scraping data from the internet is challenging and fun. Python has many libraries that can make this chore quite easy. We learned about how we can use BeautifulSoup to scrape data from a blog and from Twitter. Then we learned about one of the most popular libraries for creating a web crawler / scraper in Python: Scrapy. We barely scratched the surface of what these libraries can do, so you are encouraged to spend some time reading their respective documentation for further details.
Related Reading
Idiot Inside – Get android app downloads count and rating from Google Play Store
The Scraping Hub – Data Extraction with Scrapy and Python 3
Dan Nguyen – Python 3 web-scraping examples with public data
First Web Scraper tutorial
Beginner’s guide to Web Scraping in Python (using BeautifulSoup)
Greg Reda – Web Scraping 101 with Python
Miguel Grinberg – Easy Web Scraping with Python
The Hitchhiker’s Guide to Python – HTML Scraping
Real Python – Web Scraping and Crawling With Scrapy and MongoDB
August 3, 2016
Python 3 Concurrency – The concurrent.futures Module
The concurrent.futures module was added in Python 3.2. According to the Python documentation it provides the developer with a high-level interface for asynchronously executing callables. Basically concurrent.futures is an abstraction layer on top of Python’s threading and multiprocessing modules that simplifies using them. However it should be noted that while the abstraction layer simplifies the usage of these modules, it also removes a lot of their flexibility, so if you need to do something custom, then this might not be the best module for you.
Concurrent.futures includes an abstract class called Executor. It cannot be used directly though, so you will need to use one of its two subclasses: ThreadPoolExecutor or ProcessPoolExecutor. As you’ve probably guessed, these two subclasses are mapped to Python’s threading and multiprocessing APIs respectively. Both of these subclasses will provide a pool that you can put threads or processes into.
The term future has a special meaning in computer science. It refers to a construct that can be used for synchronization when using concurrent programming techniques. The future is actually a way to describe the result of a process or thread before it has finished processing. I like to think of them as a pending result.
Creating a Pool
Creating a pool of workers is extremely easy when you’re using the concurrent.futures module. Let’s start out by rewriting our downloading code from my asyncio article so that it now uses the concurrent.futures module. Here’s my version:
import os
import urllib.request
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
def downloader(url):
"""
Downloads the specified URL and saves it to disk
"""
req = urllib.request.urlopen(url)
filename = os.path.basename(url)
ext = os.path.splitext(url)[1]
if not ext:
raise RuntimeError('URL does not contain an extension')
with open(filename, 'wb') as file_handle:
while True:
chunk = req.read(1024)
if not chunk:
break
file_handle.write(chunk)
msg = 'Finished downloading {filename}'.format(filename=filename)
return msg
def main(urls):
"""
Create a thread pool and download specified urls
"""
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(downloader, url) for url in urls]
for future in as_completed(futures):
print(future.result())
if __name__ == '__main__':
urls = ["http://www.irs.gov/pub/irs-pdf/f1040.pdf",
"http://www.irs.gov/pub/irs-pdf/f1040a...",
"http://www.irs.gov/pub/irs-pdf/f1040e...",
"http://www.irs.gov/pub/irs-pdf/f1040e...",
"http://www.irs.gov/pub/irs-pdf/f1040s..."]
main(urls)
First off we do the imports that we need. Then we create our downloader function. I went ahead and updated it slightly so it checks to see if the URL has an extension on the end of it. If it doesn’t, then we’ll raise a RuntimeError. Next we create a main function, which is where the thread pool gets instantiated. You can actually use Python’s with statement with the ThreadPoolExecutor and the ProcessPoolExecutor, which is pretty handy.
Anyway, we set our pool so that it has five workers. Then we use a list comprehension to create a group of futures (or jobs) and finally we call the as_complete function. This handy function is an iterator that yields the futures as they complete. When they complete, we print out the result, which is a string that was returned from our downloader function.
If the function we were using was very computation intensive, then we could easily swap out ThreadPoolExecutor for ProcessPoolExecutor and only have a one line code change.
We can clean this code up a bit by using the concurrent.future’s map method. Let’s rewrite our pool code slightly to take advantage of this:
import os
import urllib.request
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
def downloader(url):
"""
Downloads the specified URL and saves it to disk
"""
req = urllib.request.urlopen(url)
filename = os.path.basename(url)
ext = os.path.splitext(url)[1]
if not ext:
raise RuntimeError('URL does not contain an extension')
with open(filename, 'wb') as file_handle:
while True:
chunk = req.read(1024)
if not chunk:
break
file_handle.write(chunk)
msg = 'Finished downloading {filename}'.format(filename=filename)
return msg
def main(urls):
"""
Create a thread pool and download specified urls
"""
with ThreadPoolExecutor(max_workers=5) as executor:
return executor.map(downloader, urls, timeout=60)
if __name__ == '__main__':
urls = ["http://www.irs.gov/pub/irs-pdf/f1040.pdf",
"http://www.irs.gov/pub/irs-pdf/f1040a...",
"http://www.irs.gov/pub/irs-pdf/f1040e...",
"http://www.irs.gov/pub/irs-pdf/f1040e...",
"http://www.irs.gov/pub/irs-pdf/f1040s..."]
results = main(urls)
for result in results:
print(result)
The primary difference here is in the main function, which has been reduced by two lines of code. The map method is just like Python’s map in that it takes a function and an iterable and then calls the function for each item in the iterable. You can also add a timeout for each of your threads so that if one of them hangs, it will get stopped. Lastly, starting in Python 3.5, they added a chunksize argument, which can help performance when using the Thread pool when you have a very large iterable. However if you happen to be using the Process pool, the chunksize will have no effect.
Deadlocks
One of the pitfalls to the concurrent.futures module is that you can accidentally create deadlocks when the caller to associate with a Future is also waiting on the results of another future. This sounds kind of confusing, so let’s look at an example:
from concurrent.futures import ThreadPoolExecutor
def wait_forever():
"""
This function will wait forever if there's only one
thread assigned to the pool
"""
my_future = executor.submit(zip, [1, 2, 3], [4, 5, 6])
result = my_future.result()
print(result)
if __name__ == '__main__':
executor = ThreadPoolExecutor(max_workers=1)
executor.submit(wait_forever)
Here we import the ThreadPoolExecutor class and create an instance of it. Take note that we set its maximum number of workers to one thread. Then we submit our function, wait_forever. Inside of our function, we submit another job to the thread pool that is supposed to zip two lists together, get the result of that operation and print it out. However we’ve just created a deadlock! The reason is that we are having one Future wait for another Future to finish. Basically we want a pending operation to wait on another pending operation which doesn’t work very well.
Let’s rewrite the code a bit to make it work:
from concurrent.futures import ThreadPoolExecutor
def wait_forever():
"""
This function will wait forever if there's only one
thread assigned to the pool
"""
my_future = executor.submit(zip, [1, 2, 3], [4, 5, 6])
return my_future
if __name__ == '__main__':
executor = ThreadPoolExecutor(max_workers=3)
fut = executor.submit(wait_forever)
result = fut.result()
print(list(result.result()))
In this case, we just return the inner future from the function and then ask for its result. The result of calling result on our returned future is another future that actually. If we call the result method on this nested future, we get a zip object back, so to find out what the actual result is, we wrap the zip with Python’s list function and print it out.
Wrapping Up
Now you have another neat concurrency tool to use. You can easily create thread or process pools depending on your needs. Should you need to run a process that is network or I/O bound, you can use the thread pool class. If you have a computationally heavy task, then you’ll want to use the process pool class instead. Just be careful of calling futures incorrectly or you might get a deadlock.
Related Reading
Python 3 documentation on the concurrent.futures library
Python Adventures: concurrent.futures
Python: A quick introduction to the concurrent.futures module
Eli Bendersky: Python – paralellizing CPU-bound tasks with concurrent.futures
August 2, 2016
Python 201: A multiprocessing tutorial
The multiprocessing module was added to Python in version 2.6. It was originally defined in PEP 371 by Jesse Noller and Richard Oudkerk. The multiprocessing module allows you to spawn processes in much that same manner than you can spawn threads with the threading module. The idea here is that because you are now spawning processes, you can avoid the Global Interpreter Lock (GIL) and take full advantages of multiple processors on a machine.
The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs. We will be looking at Pool in a later section. We will start with the multiprocessing module’s Process class.
Getting started with multiprocessing
The Process class is very similar to the threading module’s Thread class. Let’s try creating a series of processes that call the same function and see how that works:
import os
from multiprocessing import Process
def doubler(number):
"""
A doubling function that can be used by a process
"""
result = number * 2
proc = os.getpid()
print('{0} doubled to {1} by process id: {2}'.format(
number, result, proc))
if __name__ == '__main__':
numbers = [5, 10, 15, 20, 25]
procs = []
for index, number in enumerate(numbers):
proc = Process(target=doubler, args=(number,))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
For this example, we import Process and create a doubler function. Inside the function, we double the number that was passed in. We also use Python’s os module to get the current process’s ID (or pid). This will tell us which process is calling the function. Then in the block of code at the bottom, we create a series of Processes and start them. The very last loop just calls the join() method on each process, which tells Python to wait for the process to terminate. If you need to stop a process, you can call its terminate() method.
When you run this code, you should see output that is similar to the following:
5 doubled to 10 by process id: 10468
10 doubled to 20 by process id: 10469
15 doubled to 30 by process id: 10470
20 doubled to 40 by process id: 10471
25 doubled to 50 by process id: 10472
Sometimes it’s nicer to have a more human readable name for your process though. Fortunately, the Process class does allow you to access the same of your process. Let’s take a look:
import os
from multiprocessing import Process, current_process
def doubler(number):
"""
A doubling function that can be used by a process
"""
result = number * 2
proc_name = current_process().name
print('{0} doubled to {1} by: {2}'.format(
number, result, proc_name))
if __name__ == '__main__':
numbers = [5, 10, 15, 20, 25]
procs = []
proc = Process(target=doubler, args=(5,))
for index, number in enumerate(numbers):
proc = Process(target=doubler, args=(number,))
procs.append(proc)
proc.start()
proc = Process(target=doubler, name='Test', args=(2,))
proc.start()
procs.append(proc)
for proc in procs:
proc.join()
This time around, we import something extra: current_process. The current_process is basically the same thing as the threading module’s current_thread. We use it to grab the name of the thread that is calling our function. You will note that for the first five processes, we don’t set a name. Then for the sixth, we set the process name to “Test”. Let’s see what we get for output:
5 doubled to 10 by: Process-2
10 doubled to 20 by: Process-3
15 doubled to 30 by: Process-4
20 doubled to 40 by: Process-5
25 doubled to 50 by: Process-6
2 doubled to 4 by: Test
The output demonstrates that the multiprocessing module assigns a number to each process as a part of its name by default. Of course, when we specify a name, a number isn’t going to get added to it.
Locks
The multiprocessing module supports locks in much the same way as the threading module does. All you need to do is import Lock, acquire it, do something and release it. Let’s take a look:
from multiprocessing import Process, Lock
def printer(item, lock):
"""
Prints out the item that was passed in
"""
lock.acquire()
try:
print(item)
finally:
lock.release()
if __name__ == '__main__':
lock = Lock()
items = ['tango', 'foxtrot', 10]
for item in items:
p = Process(target=printer, args=(item, lock))
p.start()
Here we create a simple printing function that prints whatever you pass to it. To prevent the threads from interfering with each other, we use a Lock object. This code will loop over our list of three items and create a process for each of them. Each process will call our function and pass it one of the items from the iterable. Because we’re using locks, the next process in line will wait for the lock to release before it can continue.
Logging
Logging processes is a little different than logging threads. The reason for this is that Python’s logging packages doesn’t use process shared locks, so it’s possible for you to end up with messages from different processes getting mixed up. Let’s try adding basic logging to the previous example. Here’s the code:
import logging
import multiprocessing
from multiprocessing import Process, Lock
def printer(item, lock):
"""
Prints out the item that was passed in
"""
lock.acquire()
try:
print(item)
finally:
lock.release()
if __name__ == '__main__':
lock = Lock()
items = ['tango', 'foxtrot', 10]
multiprocessing.log_to_stderr()
logger = multiprocessing.get_logger()
logger.setLevel(logging.INFO)
for item in items:
p = Process(target=printer, args=(item, lock))
p.start()
The simplest way to log is to send it all to stderr. We can do this by calling the log_to_stderr() function. Then we call the get_logger function to get access to a logger and set its logging level to INFO. The rest of the code is the same. I will note that I’m not calling the join() method here. Instead, the parent thread (i.e. your script) will call join() implicitly when it exits.
When you do this, you should get output like the following:
[INFO/Process-1] child process calling self.run()
tango
[INFO/Process-1] process shutting down
[INFO/Process-1] process exiting with exitcode 0
[INFO/Process-2] child process calling self.run()
[INFO/MainProcess] process shutting down
foxtrot
[INFO/Process-2] process shutting down
[INFO/Process-3] child process calling self.run()
[INFO/Process-2] process exiting with exitcode 0
10
[INFO/MainProcess] calling join() for process Process-3
[INFO/Process-3] process shutting down
[INFO/Process-3] process exiting with exitcode 0
[INFO/MainProcess] calling join() for process Process-2
Now if you want to save the log to disk, then it gets a little trickier. You can read about that topic in Python’s logging Cookbook.
The Pool Class
The Pool class is used to represent a pool of worker processes. It has methods which can allow you to offload tasks to the worker processes. Let’s look at a really simple example:
from multiprocessing import Pool
def doubler(number):
return number * 2
if __name__ == '__main__':
numbers = [5, 10, 20]
pool = Pool(processes=3)
print(pool.map(doubler, numbers))
Basically what’s happening here is that we create an instance of Pool and tell it to create three worker processes. Then we use the map method to map a function and an iterable to each process. Finally we print the result, which in this case is actually a list: [10, 20, 40].
You can also get the result of your process in a pool by using the apply_async method:
from multiprocessing import Pool
def doubler(number):
return number * 2
if __name__ == '__main__':
pool = Pool(processes=3)
result = pool.apply_async(doubler, (25,))
print(result.get(timeout=1))
What this allows us to do is actually ask for the result of the process. That is what the get function is all about. It tries to get our result. You will note that we also have a timeout set just in case something happened to the function we were calling. We don’t want it to block indefinitely after all.
Process Communication
When it comes to communicating between processes, the multiprocessing modules has two primary methods: Queues and Pipes. The Queue implementation is actually both thread and process safe. Let’s take a look at a fairly simple example that’s based on the Queue code from one of my threading articles:
from multiprocessing import Process, Queue
sentinel = -1
def creator(data, q):
"""
Creates data to be consumed and waits for the consumer
to finish processing
"""
print('Creating data and putting it on the queue')
for item in data:
q.put(item)
def my_consumer(q):
"""
Consumes some data and works on it
In this case, all it does is double the input
"""
while True:
data = q.get()
print('data found to be processed: {}'.format(data))
processed = data * 2
print(processed)
if data is sentinel:
break
if __name__ == '__main__':
q = Queue()
data = [5, 10, 13, -1]
process_one = Process(target=creator, args=(data, q))
process_two = Process(target=my_consumer, args=(q,))
process_one.start()
process_two.start()
q.close()
q.join_thread()
process_one.join()
process_two.join()
Here we just need to import Queue and Process. Then we two functions, one to create data and add it to the queue and the second to consume the data and process it. Adding data to the Queue is done by using the Queue’s put() method whereas getting data from the Queue is done via the get method. The last chunk of code just creates the Queue object and a couple of Processes and then runs them. You will note that we call join() on our process objects rather than the Queue itself.
Wrapping Up
We have a lot of material here. You have learned how to use the multiprocessing module to target regular functions, communicate between processes using Queues, naming threads and much more. There is also a lot more in the Python documentation that isn’t even touched in this article, so be sure to dive into that as well. In the meantime, you now know how to utilize all your computer’s processing power with Python!
Related Reading
The Python documentation on the multiprocessing module
Python Module of the Week: multiprocessing
Python Concurrency – Porting a Queue to multiprocessing
August 1, 2016
PyDev of the Week: Cory Benfield
This week we welcome Cory Benfield (@lukasaoz) as our PyDev of the Week! Cory is a core developer of the Python language, specifically on urllib3 as well as a core developer of the requests package. He is also the lead maintainer of the Hyper Project which is a set of related projects that provide HTTP/2 functionality to Python projects. Let’s spend some time getting to know Cory better!
Can you tell us a little about yourself (hobbies, education, etc):
I studied Physics at university in Scotland, originally with a bit of an eye towards ending up somewhere on the more practical end of professional physics: maybe doing something like medical physics. Before my degree, I’d never written a program in my life. However, I’d been a computer “power user” for a long time: I’d owned PCs and Macs and was comfortable in Linux as well as Windows and OS X, and I’ve been building PCs since I was 15. So when I found myself in a computational physics module staring at a Mathematica screen, something clicked in my brain and I realised that this was something I could be really happy doing. So I graduated with my masters degree in Physics and with a job offer to write software for the telecommunications industry.
When I’m not writing software, I like doing lots of different things to relax. I play video games (like most of the world now does), but I love watching great movies and TV shows. I also greatly enjoy cooking, reading, and spending time with people. Sometimes I even write: I’m not good at the discipline required to write regularly, but I think my writing is ok!
Why did you start using Python?
I’d just started using Mathematica for my Physics work in college, and while Mathematica is great it’s not really a useful tool for building utilities. So I glanced around for other languages, and it didn’t take long playing with C and Perl before Python seemed like the most obvious thing in the world to me. So I wrote some web scraping tools in Python, and that dragged me into the Python open source community. I’ve never looked back!
What other programming languages do you know and which is your favourite?
I’m the kind of person who likes to pick up and play with new languages for fun, so “know” gets a bit tricky here. I’d say I am most comfortable with Python, and then in roughly descending order of how idiomatic I am with the language: C, Go, Rust, Javascript, Swift, C#, Java, Objective C. I have dabbled a bit in the strongly functional languages, like Haskell, but wouldn’t say I “know” Haskell yet because I haven’t used it in anger yet. I also like Lisps, though again I haven’t written much Lisp code in anger: with that said, Hy is obviously my favourite lisp because it lets me use all the Python libraries I’ve already gotten used to!
What projects are you working on now?
Right now I’m spread pretty thin, but my major focus is HTTP and HTTP/2 in Python. That means that my time is spread roughly amongst maintaining several popular Python libraries: Requests and urllib3 being the most notable and the biggest drains on my time. Alongside those I maintain the only HTTP/2 libraries in Python, including hyper-h2, which is used as the basis for a number of HTTP/2 implementations in Python. Finally, I’m tangentially involved in a lot of other OSS Python projects, like the Python Cryptographic Authority and Twisted.
Which Python libraries are your favorite (core or 3rd party)?
Excluding my own, of course!
In the Python standard library I really like itertools: generators have a real appeal to me and itertools makes it possible to work with them in a really flexible way. Outside of the standard library, I think CFFI is awesome: being able to call into C regardless of how your VM is implemented is really powerful, and it removes one of the largest barriers to using PyPy, which I’d argue is one of the greatest assets of the Python community.
Where do you see Python going as a programming language?
It’s common in the Python community to worry about whether languages like Go are eating our lunch. I don’t buy into that rhetoric. Python is a great language: it’s clean, it’s easy to read, and when combined with tools like PyPy it’s more than fast enough for the majority of use-cases.
For now, I see Python consolidating, and building out tools that allow Python to integrate more tightly with other languages. Moving towards CFFI and away from writing explicit C extensions is a big part of this, as it allows Python programmers to combine the best of Python (e.g. PyPy) with the best of other languages (by calling into compiled DLLs without worrying about the specific Python implementation being used). Applications in the future will increasingly involve smaller components calling between each other, either via RPC mechanisms like HTTP APIs, or via direct function calls using things like CFFI. Python is well suited to be a part of that environment.
What is your take on the current market for Python programmers?
As far as I can tell, Python programmers are extremely employable. Python is a well established language that many companies would be willing to bet their businesses on, and that gives Python programmers a great deal of power and flexibility in the industry. Python skill is a great notch in the belt for anyone job hunting.
Is there anything else you’d like to say?
Only that if people are interested in getting involved in Open Source software, Python is almost certainly the language to do it in: there are lots of great projects floating around that would love more contribution from the community, and getting involved is a great way to give back.
Thanks for doing the interview!
July 28, 2016
Python 201: A Tutorial on Threads
The threading module was first introduced in Python 1.5.2 as an enhancement of the low-level thread module. The threading module makes working with threads much easier and allows the program to run multiple operations at once.
Note that the threads in Python work best with I/O operations, such as downloading resources from the Internet or reading files and directories on your computer. If you need to do something that will be CPU intensive, then you will want to look at Python’s multiprocessing module instead. The reason for this is that Python has the Global Interpreter Lock (GIL) that basically makes all threads run inside of one master thread. Because of this, when you go to run multiple CPU intensive operations with threads, you may find that it actually runs slower. So we will be focusing on what threads do best: I/O operations!
Intro to Threads
A thread let’s you run a piece of long running code as if it were a separate program. It’s kind of like calling subprocess except that you are calling a function or class instead of a separate program. I always find it helpful to look at a concrete example. Let’s take a look at something that’s really simple:
import threading
def doubler(number):
"""
A function that can be used by a thread
"""
print(threading.currentThread().getName() '\n')
print(number * 2)
print()
if __name__ == '__main__':
for i in range(5):
my_thread = threading.Thread(target=doubler, args=(i,))
my_thread.start()
Here we import the threading module and create a regular function called doubler. Our function takes a value and doubles it. It also prints out the name of the thread that is calling the function and prints a blank line at the end. Then in the last block of code, we create five threads and start each one in turn. You will note that when we instantiate a thread, we set its target to our doubler function and we also pass an argument to the function. The reason the args parameter looks a bit odd is that we need to pass a sequence to the doubler function and it only takes one argument, so we need to put a comma on the end to actually create a sequence of one.
Note that if you’d like to wait for a thread to terminate, you would need to call its join() method.
When you run this code, you should get the following output:
Thread-1
0
Thread-2
2
Thread-3
4
Thread-4
6
Thread-5
8
Of course, you normally wouldn’t want to print your output to stdout. This can end up being a really jumbled mess when you do. Instead, you should use Python’s logging module. It’s thread-safe and does an excellent job. Let’s modify the example above to use the logging module and name our threads while we’ll at it:
import logging
import threading
def get_logger():
logger = logging.getLogger("threading_example")
logger.setLevel(logging.DEBUG)
fh = logging.FileHandler("threading.log")
fmt = '%(asctime)s - %(threadName)s - %(levelname)s - %(message)s'
formatter = logging.Formatter(fmt)
fh.setFormatter(formatter)
logger.addHandler(fh)
return logger
def doubler(number, logger):
"""
A function that can be used by a thread
"""
logger.debug('doubler function executing')
result = number * 2
logger.debug('doubler function ended with: {}'.format(
result))
if __name__ == '__main__':
logger = get_logger()
thread_names = ['Mike', 'George', 'Wanda', 'Dingbat', 'Nina']
for i in range(5):
my_thread = threading.Thread(
target=doubler, name=thread_names[i], args=(i,logger))
my_thread.start()
The big change in this code is the addition of the get_logger function. This piece of code will create a logger that’s set to the debug level. It will save the log to the current working directory (i.e. where the script is run from) and then we set up the format for each line logged. The format includes the time stamp, the thread name, the logging level and the message logged.
In the doubler function, we change our print statements to logging statements. You will note that we are passing the logger into the doubler function when we create the thread. The reason we do this is that if you instantiated the logging object in each thread, you would end up with multiple logging singletons and your log would have a lot of duplicate lines in it.
Lastly, we name our threads by creating a list of names and then setting each thread to a specific name using the name parameter. When you run this code, you should get a log file with the following contents:
2016-07-24 20:39:50,055 - Mike - DEBUG - doubler function executing
2016-07-24 20:39:50,055 - Mike - DEBUG - doubler function ended with: 0
2016-07-24 20:39:50,055 - George - DEBUG - doubler function executing
2016-07-24 20:39:50,056 - George - DEBUG - doubler function ended with: 2
2016-07-24 20:39:50,056 - Wanda - DEBUG - doubler function executing
2016-07-24 20:39:50,056 - Wanda - DEBUG - doubler function ended with: 4
2016-07-24 20:39:50,056 - Dingbat - DEBUG - doubler function executing
2016-07-24 20:39:50,057 - Dingbat - DEBUG - doubler function ended with: 6
2016-07-24 20:39:50,057 - Nina - DEBUG - doubler function executing
2016-07-24 20:39:50,057 - Nina - DEBUG - doubler function ended with: 8
That output is pretty self-explanatory, so let’s move on. I want to cover one more topic in this section. Namely, subclassing threading.Thread. Let’s take this last example and instead of calling Thread directly, we’ll create our own custom subclass. Here is the updated code:
import logging
import threading
class MyThread(threading.Thread):
def __init__(self, number, logger):
threading.Thread.__init__(self)
self.number = number
self.logger = logger
def run(self):
"""
Run the thread
"""
logger.debug('Calling doubler')
doubler(self.number, self.logger)
def get_logger():
logger = logging.getLogger("threading_example")
logger.setLevel(logging.DEBUG)
fh = logging.FileHandler("threading_class.log")
fmt = '%(asctime)s - %(threadName)s - %(levelname)s - %(message)s'
formatter = logging.Formatter(fmt)
fh.setFormatter(formatter)
logger.addHandler(fh)
return logger
def doubler(number, logger):
"""
A function that can be used by a thread
"""
logger.debug('doubler function executing')
result = number * 2
logger.debug('doubler function ended with: {}'.format(
result))
if __name__ == '__main__':
logger = get_logger()
thread_names = ['Mike', 'George', 'Wanda', 'Dingbat', 'Nina']
for i in range(5):
thread = MyThread(i, logger)
thread.setName(thread_names[i])
thread.start()
In this example, we just subclassed threading.Thread. We pass in the number that we want to double and the logging object as before. But this time, we set the name of the thread differently by calling setName on the thread object. We still need to call start on each thread, but you will notice that we didn’t need to define that in our subclass. When you call start, it will run your thread by calling the run method. In our class, we call the doubler function to do our processing. The output is pretty much the same except that I added an extra line of output. Go ahead and run it to see what you get.
Locks and Synchronization
When you have more than one thread, then you may find yourself needing to consider how to avoid conflicts. What I mean by this is that you may have a use case where more than one thread will need to access the same resource at the same time. If you don’t think about these issues and plan accordingly, then you will end up with some issues that always happen at the worst of times and usually in production.
The solution is to use locks. A lock is provided by Python’s threading module and can be held by either a single thread or no thread at all. Should a thread try to acquire a lock on a resource that is already locked, that thread will basically pause until the lock is released. Let’s look at a fairly typical example of some code that doesn’t have any locking functionality but that should have it added:
import threading
total = 0
def update_total(amount):
"""
Updates the total by the given amount
"""
global total
total = amount
print (total)
if __name__ == '__main__':
for i in range(10):
my_thread = threading.Thread(
target=update_total, args=(5,))
my_thread.start()
What would make this an even more interesting example would be to add a time.sleep call that is of varying length. Regardless, the issue here is that one thread might call update_total and before it’s done updating it, another thread might call it and attempt to update it too. Depending on the order of operations, the value might only get added to once.
Let’s add a lock to the function. There are two ways to do this. The first way would be to use a try/finally as we want to ensure that the lock is always released. Here’s an example:
import threading
total = 0
lock = threading.Lock()
def update_total(amount):
"""
Updates the total by the given amount
"""
global total
lock.acquire()
try:
total = amount
finally:
lock.release()
print (total)
if __name__ == '__main__':
for i in range(10):
my_thread = threading.Thread(
target=update_total, args=(5,))
my_thread.start()
Here we just acquire the lock before we do anything else. Then we attempt to update the total and finally, we release the lock and print the current total. We can actually eliminate a lot of this boilerplate using Python’s with statement:
import threading
total = 0
lock = threading.Lock()
def update_total(amount):
"""
Updates the total by the given amount
"""
global total
with lock:
total = amount
print (total)
if __name__ == '__main__':
for i in range(10):
my_thread = threading.Thread(
target=update_total, args=(5,))
my_thread.start()
As you can see, we no longer need the try/finally as the context manager that is provided by the with statement does all of that for us.
Of course you will also find yourself writing code where you need multiple threads accessing multiple functions. When you first start writing concurrent code, you might do something like this:
import threading
total = 0
lock = threading.Lock()
def do_something():
lock.acquire()
try:
print('Lock acquired in the do_something function')
finally:
lock.release()
print('Lock released in the do_something function')
return "Done doing something"
def do_something_else():
lock.acquire()
try:
print('Lock acquired in the do_something_else function')
finally:
lock.release()
print('Lock released in the do_something_else function')
return "Finished something else"
if __name__ == '__main__':
result_one = do_something()
result_two = do_something_else()
This works alright in this circumstance, but suppose you have multiple threads calling both of these functions. While one thread is running over the functions, another one could be modifying the data too and you’ll end up with some incorrect results. The problem is that you might not even notice the results are wrong immediately. What’s the solution? Let’s try to figure that out.
A common first thought would be to add a lock around the two function calls. Let’s try modifying the example above to look like the following:
import threading
total = 0
lock = threading.RLock()
def do_something():
with lock:
print('Lock acquired in the do_something function')
print('Lock released in the do_something function')
return "Done doing something"
def do_something_else():
with lock:
print('Lock acquired in the do_something_else function')
print('Lock released in the do_something_else function')
return "Finished something else"
def main():
with lock:
result_one = do_something()
result_two = do_something_else()
print (result_one)
print (result_two)
if __name__ == '__main__':
main()
When you actually go to run this code, you will find that it just hangs. The reason is that we just told the threading module to acquire the lock. So when we call the first function, it finds that the lock is already held and blocks. It will continue to block until the lock is released, which will never happen.
The real solution here is to use a Re-Entrant Lock. Python’s threading module provides one via the RLock function. Just change the line lock = threading.Lock() to lock = threading.RLock() and try re-running the code. Your code should work now!
If you want to try the code above with actual threads, then we can replace the call to main with the following:
if __name__ == '__main__':
for i in range(10):
my_thread = threading.Thread(
target=main)
my_thread.start()
This will run the main function in each thread, which will in turn call the other two functions. You’ll end up with 10 sets of output too.
Timers
The threading module has a neat class called Timer that you can use to represent an action that should take place after a specified amount of time. They actually spin up their own custom thread and are started using the same start() method that a regular thread uses. You can also stop a timer using its cancel method. It should be noted that you can even cancel the timer before it’s even started.
The other day I ran into a use case where I needed to communicate with a subprocess I had started but I needed it to timeout. While there are lots of different approaches to this particular problem, my favorite solution was using the threading module’s Timer class.
For this example, we will look at using the ping command. In Linux, the ping command will run until you kill it. So the Timer class becomes especially handy in Linux-land. Here’s an example:
import subprocess
from threading import Timer
kill = lambda process: process.kill()
cmd = ['ping', 'www.google.com']
ping = subprocess.Popen(
cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
my_timer = Timer(5, kill, [ping])
try:
my_timer.start()
stdout, stderr = ping.communicate()
finally:
my_timer.cancel()
print (str(stdout))
Here we just set up a lambda that we can use to kill the process. Then we start our ping job and create a Timer object. You will note that the first argument is the time in seconds to wait, then the function to call and the argument to pass to the function. In this case, our function is a lambda and we pass it a list of arguments where the list happens to only have one element. If you run this code, it should run for about 5 seconds and then print out the results of the ping.
Other Thread Components
The threading module includes support for other items too. For example, you can create a Semaphore which is one of the oldest synchronization primitives in computer science. Basically, a Semaphore manages an internal counter which will be decremented whenever you call acquire on it and incremented when you call release. The counter is designed in such a way that it cannot go below zero. So if you happen to call acquire when it’s zero, then it will block.
Another useful tool that’s included is the Event. It will allow you to communicate between threads using signals. We will be looking at an example that uses an Event in the next section.
Finally in Python 3.2, the Barrier object was added. The Barrier is a primitive that basically manages a thread pool wherein the threads have to wait for each other. To pass the barrier, the thread needs to call the wait() method which will block until all the threads have made the call. Then it will release all the threads simultaneously.
Thread Communication
There are some use cases where you will want to have your threads communicate with each other. As we mentioned earlier, you can use create an Event for this purpose. But a more common method is to use a Queue. For our example, we’ll actually use both! Let’s see what that looks like:
import threading
from queue import Queue
def creator(data, q):
"""
Creates data to be consumed and waits for the consumer
to finish processing
"""
print('Creating data and putting it on the queue')
for item in data:
evt = threading.Event()
q.put((item, evt))
print('Waiting for data to be doubled')
evt.wait()
def my_consumer(q):
"""
Consumes some data and works on it
In this case, all it does is double the input
"""
while True:
data, evt = q.get()
print('data found to be processed: {}'.format(data))
processed = data * 2
print(processed)
evt.set()
q.task_done()
if __name__ == '__main__':
q = Queue()
data = [5, 10, 13, -1]
thread_one = threading.Thread(target=creator, args=(data, q))
thread_two = threading.Thread(target=my_consumer, args=(q,))
thread_one.start()
thread_two.start()
q.join()
Let’s break this down a bit. First off, we have a creator (AKA a producer) function that we use to create data that we want to work on (or consume). Then we have another function that we use for processing the data that we are calling my_consumer. The creator function will use the Queue’s put method to put the data into the Queue and the consumer will continually check for more data and process it when it becomes available. The Queue handles all the acquires and releases of the locks so you don’t have to.
In this example, we create a list of values that we want to double. Then we create two threads, one for the creator / producer and one for the consumer. You will note that we pass a Queue object to each thread which is the magic behind how the locks get handled. The queue will have the first thread feed data to the second. When the first puts some data into the queue, it also passes in an Event and then waits for the event to finish. Then in the consumer, the data is processed and when it’s done, it calls the set method of the Event which tells the first thread that the second is done processing and it can continue.
The very last line of code call’s the Queue object’s join method which tells the Queue to wait for the threads to finish. The first thread ends when it runs out of items to put into the Queue.
Wrapping Up
We covered a lot of material here. You have learned the following:
The basics of threading
How locking works
What Events are and how they can be used
How to use a Timer
Inter-Thread Communication using Queues / Events
Now that you know how threads are used and what they are good for, I hope you will find many good uses for them in your own code.
Related Reading
Python documentation on the threading module
Eli Bendersky – Python threads: communication and stopping
July 27, 2016
Python: Visualization with Bokeh
The Bokeh package is an interactive visualization library that uses web browsers for its presentation. Its goal is to provide graphics in the vein of D3.js that look elegant and are easy to construct. Bokeh supports large and streaming datasets. You will probably be using this library for creating plots / graphs. One of its primary competitors seems to be Plotly.
Note: This will not be an in-depth tutorial on the Bokeh library as the number of different graphs and visualizations it is capable of is quite large. Instead, the aim of the article is to give you a taste of what this interesting library can do.
Let’s take a moment and get it installed. The easiest way to do so is to use pip or conda. Here’s how you can use pip:
pip install bokeh
This will install Bokeh and all its dependencies. You may want to install Bokeh into a virtualenv because of this, but that’s up to you. Now let’s check out a simple example. Save the following code into a file with whatever name you deem appropriate.
from bokeh.plotting import figure, output_file, show
output_file("/path/to/test.html")
x = range(1, 6)
y = [10, 5, 7, 1, 6]
plot = figure(title='Line example', x_axis_label='x', y_axis_label='y')
plot.line(x, y, legend='Test', line_width=4)
show(plot)
Here we just import a few items from the Bokeh library. We just tell it where to save the output. You will note that the output is HTML. Then we create some values for the x and y axises so we can create the plot. Then we actually create the figure object and give it a title and labels for the two axises. Finally we plot the line, give it a legend and line width and show the plot. The show command will actually open your plot in your default browser. You should end up seeing something like this:
Bokeh also supports the Jupyter Notebook with the only change being that you will need to use output_notebook instead of output_file.
The Bokeh quick start guide has a neat example of a series of sine waves on a grid plot. I reduced the example down a bit to just one sine wave. Note that you will need NumPy installed for the following example to work correctly:
import numpy as np
from bokeh.layouts import gridplot
from bokeh.plotting import figure, output_file, show
N = 100
x = np.linspace(0, 4*np.pi, N)
y0 = np.sin(x)
output_file('sinewave.html')
sine = figure(width=500, plot_height=500, title='Sine')
sine.circle(x, y0, size=10, color="navy", alpha=0.5)
p = gridplot([[sine]], toolbar_location=None)
show(p)
The main difference between this example and the previous one is that we are using NumPy to generate the data points and we’re putting our figure inside of a gridplot instead of just drawing the figure itself. When you run this code, you should end up with a plot that looks like this:
If you don’t like circles, then you’ll be happy to know that Bokeh supports other shapes, such as square, triangle and several others.
Wrapping Up
The Bokeh project is really interesting and provides a simple, easy-to-use API for creating graphs, plots and other visualizations of your data. The documentation is quite well put together and includes lots of examples that showcase what this package can do for you. It is well worth just skimming the documentation so you can see what some of the other graphs look like and how short the code examples are that generate such nice results. My only gripe is that Bokeh doesn’t have a way to save an image file programmatically. This appears to be a long term bug that they have been trying to fix for a couple of years now. Hopefully they find a way to support that feature soon. Otherwise, I thought it was really cool!