Python Generators and Why You Should Use Them

Written by Dan Sackett on October 3, 2014

When it comes to intensive programming processes, iterating through a large dataset can be quite slow.

Before I get started, let's get it straight that I'm talking about iterators in general. In Python, iterators can be lists, dictionaries, and even strings. These are items that can be looped through in a for loop. What makes these so slow though?

Well, let's take a list for instance:

range(10000)

This will give us a list of integers from 0 to 99999. When we create this, these values are all stored in memory. With a list of that size and much larger, this is not ideal and is the reason why we will run into performance issues when iterating.

If you didn't know, we can save ourselves some time by using a generator. A generator is actually an iterator, but unlike a list or string, the values are not stored in memory. Only one number is available at a time and the program will forget it once it moves onto the next one. For this reason, generators can only be iterated through once. It is also why it makes iteration much faster.

What does a generator look like?

We can write a simple one like a list comprehension:

>>> gen = (x for x in range(10))
>>> gen
<generator object <genexpr> at 0x2dbbbe0>

Things to note here:

Cool, but what do we do with a generator object?

In simplest forms, we can loop through it like a list and get all of our values.

>>> [x for x in gen]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

See, it works exactly like we would expect. But what happens when we try to iterate through this again?

>>> [x for x in gen]
[]

An empty list! Where are the numbers we had before? As I mentioned above, generators can only be iterated through once. This is because a generator object saves your current state as you progress through it. So if you print the first number from the generator, it will save the fact that you already computed that number. When you go to print the next number in the generator it will remember that you already printed the first number and will compute the next one.

When we've iterated through the entire list, it saves the fact that all numbers have been exhausted and the generator returns the empty list. To see this in effect, check this out:

>>> gen = (x for x in range(5))
>>> gen.next()
0
>>> gen.next()
1
>>> gen.next()
2
>>> gen.next()
3
>>> gen.next()
4
>>> gen.next()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
StopIteration

As you can tell, generator objects have a next() method attached to them. This illustrates that when you call the next method the first time it prints 0, the first number in this list. When you call the next method the second time it remembers that you have already done one step and it prints the next number in the list. We get a StopIteration error when the list is exhausted.

This concept is very cool and can really let us do some intensive computing at a low cost. What allows us to save the state? It's the yield keyword. If I were to expand the generator from before into a real function we would get something like this:

>>> def gen():
...     l = range(5)
...     for x in l:
...         yield x
>>> y = gen()
>>> y
<generator object gen at 0x2dbb0a0>
>>> y.next()
0
>>> y.next()
1
>>> y.next()
2
>>> y.next()
3
>>> y.next()
4
>>> y.next()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
StopIteration

As you can see from the result, the function works the same. What's new is the yield statement. It is very similar to a return statement with some variations. This yield keyword is the one thing that makes this function a generator actually. When we go through our loop and then yield the number, we are actually doing two things.

First is we return the number itself. We can see this because we print it out. The second thing yield does is save the state of the function. This is what allows us to pick up where we left off when looping or using next().

How about some proof that generators are fast. Below I'm using the timeit module built into Python to test how long a process takes. I'll run each test 100 times in these cases.

from timeit import timeit

# Running a normal list comprehension using range
timeit("[x for x in range(1000000)]", number=100)
# 5.336906909942627

# Running a normal list comprehension using xrange
timeit("[x for x in xrange(1000000)]", number=100)
# 3.492034912109375

# Running a generator using range
timeit("(x for x in range(1000000))", number=100)
# 1.6750469207763672

# Running a generator using xrange
timeit("(x for x in xrange(1000000))", number=100)
# 0.00013709068298339844

As you can see, we save ourselves an incredible amount of time by using a generator. Even more when we use xrange over range (but we already knew that).

In the end, why should we use generators? Because in the real world we typically deal with large amounts of data. As our datasets grow, regular iteration isn't going to cut it. We need to save memory where we can and generators are our best friends.


python generators

comments powered by Disqus