Python worker queues

A couple of the jobs I’ve done at work have recently involved a reasonable amount of data collection, this post discusses and a gives a rough outline of how you might go about collecting a lot of data with Python without waiting forever.

guterman thread 2 by lisa s | dressform.

guterman thread 2 by lisa s | dressform

The Digital Podge site that I worked on at Line required photos from Flickr, and some other bits and pieces from LinkedIn.

The problem with collecting a lot of data is that code spends a lot of time waiting for input and output. The clever programmer however uses threads (or something similar). A thread can be thought of as a small sub-program, which appear to all run at the same time. Meaning that 20 or so threads could all be waiting for IO at the same time.

There is a problem with threading though—at least in Python—whereby only a certain number of threads can be created at once, so instead the very clever programmer creates a work queue, and a set of worker threads. Each thread grabs the item from the top of the queue and when the queue is empty the program is finished.

The following Python code sets up the worker threads and the work queue, and is what I used as a basis for the data collection on the Digital Podge project.

import Queue
import threading
 
 
def do_work(*args):
    # Do something with args
    pass
 
 
number_of_workers = 20
work_queue = Queue.Queue()
 
 
def worker():
    while True:
        item = work_queue.get()
        do_work(*item)
        work_queue.task_done()
 
for __ in range(number_of_workers):
     t = threading.Thread(target=worker)
     t.setDaemon(True)
     t.start()
 
for item in work_source():
    work_queue.put(item)
 
work_queue.join()

Posted on Friday 5th February, 2010.

The short URL for this post is: http://sneeu.com/s/pBD