Python List Comprehensions, Generators, and Built-Ins: What to use?
Recently I've been writing about some of Python's standard library as well as comprehensions and I think it can get confusing on what you should use.
Today I wanted to clear the air and give some benchmarks on the different methods for generating iterable-type objects.
So what do I mean about an iterable-type object?
In Python, an iterable is an object that has an
__iter__ method on it.
When this exists, it provides a blueprint for what will be returned when you
loop over it. In my last post I talked about the
iter() function and
it turns out that it can be a good way to check if an item is in fact an
iterable. We can do this through duck typing like so:
try: i = iter(MY_ELEMENT) except TypeError: return 'This is not an iterator' else: return 'I am an iterator!'
From this result we know if we have an iterable. As a basis, lists, tuples, and dictionaries are all iterable types. Knowing that, let's get into benchmarking some tasks.
For my first set of tests, I wanted to work a simple filtering exercise. The idea is to create a new list of the even numbers in a range of 0-50.
>>> from timeit import timeit >>> timeit('[x for x in xrange(50) if x % 2 == 0]', number=1000) # 0.004503965377807617 >>> timeit('filter(lambda x: x % 2 == 0, xrange(50))', number=1000) # 0.0074999332427978516 >>> timeit('from itertools import ifilter; ifilter(lambda x: x % 2 == 0, xrange(50))', number=1000) # 0.0016210079193115234 >>> timeit('(x for x in xrange(50) if x % 2 == 0)', number=1000) # 0.0006279945373535156
Looking at the result, it's clear that there really isn't an obvious winner. Each filter is very fast and gives us an iterable type when we're done. This is what we expected though. On small data sets, these kinds of operations are routine and will typically not matter when tuning your application.
What happens when we have a lot more data in our list though?
I tried the same tests again but this time used 50,000 items.
>>> from timeit import timeit >>> timeit('[x for x in xrange(50000) if x % 2 == 0]', number=1000) # 3.7509877681732178 >>> timeit('filter(lambda x: x % 2 == 0, xrange(50000))', number=1000) # 6.625590085983276 >>> timeit('from itertools import ifilter; ifilter(lambda x: x % 2 == 0, xrange(50000))', number=1000) # 0.002830982208251953 >>> timeit('(x for x in xrange(50000) if x % 2 == 0)', number=1000) # 0.0006492137908935547
Looks like some methods are clearly more efficient than others. Before I go on with the explanations, I wanted to also show off the mapping functions. In my example, I wanted to create a new iterable with cubed numbers. The first test is with number 0-50 as we did before.
>>> from timeit import timeit >>> timeit('[x**3 for x in xrange(50)]', number=1000) # 0.006129026412963867 >>> timeit('map(lambda x: x**3, xrange(50))', number=1000) # 0.009211063385009766 >>> timeit('from itertools import imap; imap(lambda x: x**3, xrange(50))', number=1000) # 0.0015909671783447266 >>> timeit('(x**3 for x in xrange(50))', number=1000) # 0.0006258487701416016
Again, we aren't seeing a huge benefit for any method in comparison to the others. Let's try this again on 50,000 items though.
>>> from timeit import timeit >>> timeit('[x**3 for x in xrange(50000)]', number=1000) # 5.076784133911133 >>> timeit('map(lambda x: x**3, xrange(50000))', number=1000) # 8.603441953659058 >>> timeit('from itertools import imap; imap(lambda x: x**3, xrange(50000))', number=1000) # 0.0032291412353515625 >>> timeit('(x**3 for x in xrange(50000))', number=1000) # 0.0012519359588623047
Again, looks like we have some much more efficient routes.
So what's going on?
As soon as we move into larger sets of data, we can see that there are two clear losers in both test cases. The first of which is the list comprehension. This is the third slowest option because a list comprehension will build us the actual list. What I mean is that when we are finished with the mapping or filtering action, the list is made and available. We don't need to iterate through it, we can directly index it and play with values. For a small data set, this can be good because we may want to interact with the data. When scaling up the size though, we have better options.
Before I get to the clear winners though, let's look at the clear loser of all four: the built-in functions.
In both tests, the built-ins of
map() performed the absolute slowest for us. There are a couple reasons for this:
- We are creating a full list in memory like the list comprehension. This means processing it all in one shot.
- We are using lambda. Running items though a lambda function is just another hoop we have to jump through to get an end result.
In general, these two functions are useful but just not as efficient as a list comprehension. What's funny is Python's BDFL Guido Van Rossum actually pushed to have them remove in Python 3 as they were more or less something for Lisp developers to hold on to. The functions remained, although in Python 3 they act like their Itertools counterparts.
Speaking of itertools, using the
i functions gives us one of the fastest and most efficient choices. How are they so fast?
You need to remember that while the built-ins and list comprehension generate a full list for us, the Itertools functions will return an iterator object. In that instance, we must iterate over our objects to get any values from it. We don't have those available to us to reference unless in a loop. This is why the benchmarks favor this route. If for instance we were to use the
list() function with our Itertools cases, you'd see the same performance as the built-ins.
The best choice though?
It's a generator.
I used a generator expression for this instance and it looks exactly like a list comprehension except for the
( ). The generator expression has the advantage because it uses the
yield statement to give us values. It works like the Itertools cases in that to get the actual values you must iterate through it. The way it returns values is incredibly efficient though as it computes the numbers on the fly.
So given all of this, which option should you choose?
I like to stick to three rules.
- If you're working with a small data set and want the resulting iterable, list comprehensions are a good choice.
- If you're working with a large data set, use a generator.
- If you don't care about the resulting iterable, use a generator.
Generators are one of the more efficient bits of Python and will give you the benefit of having a strong base if the data grows. As well, the syntax is much more Pythonic than any of the built-in and functional types so it's good practice to use them.
Hopefully this sheds some light on the "which method is best?" debate.
comments powered by Disqus