The other day I discovered an interesting Python behavior that I somehow had managed not to hit before — in fairness, I use Python mostly for scripting and automation, not ‘real’ software development, but I still thought I understood the basics reasonably well.
Can you spot the problem? The following is designed to remove words
from a list if they are below a certain number of characters,
specified by args.minlength
:
for w in words:
if len(w) < int(args.minlength):
words.remove(w)
The impending misbehavior, if you didn’t catch it by this point, is
not necessarily obvious. It won’t barf an error at you, and you can
actually get it to pass a trivial test, depending on how the test data
is configured. But on a real dataset, you’ll end up with lots of
words shorter than args.minlength
left in words
after you
(thought) you iterated through and cleaned them!
(If you want to play with this on your own, try running the above loop
against the contents of your personal iSpell dictionary — typically
~/.ispell_english
on Unix/Linux — or some other word list.
The defect will quickly become apparent.)
A good description to the problem, along with several solutions, is of course found on Stack Overflow. But to save you the click: the problem is iterating over a mutable object, such as a list, and then modifying the list (e.g. by removing items) inside the loop. Per the Python docs, you shouldn’t do that:
If you need to modify the sequence you are iterating over while inside the loop (for example to duplicate selected items), it is recommended that you first make a copy. Iterating over a sequence does not implicitly make a copy.
The solution is easy:
for w in words[:]:
if len(w) < int(args.minlength):
words.remove(w)
Adding the slice notation causes Python to iterate over a copy of the list (pre-modification), which is what you actually want most of the time, and then you’re free to modify the actual list all you want from inside the loop. There are lots of other possible solutions if you don’t like the slice notation, but that one seems pretty elegant (and it’s what’s recommended in the Python docs so it’s presumably what someone else reading your code ought to expect).
I’d seen the for item in list[:]:
construct in sample code before,
but the exact nature of the bugs it prevents hadn’t been clear to me
before. Perhaps this will be enlightening to someone else as well.
0 Comments, 0 Trackbacks