{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Iterators and Generators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's going on when you run a for loop in Python?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "16\n", "25\n", "81\n" ] } ], "source": [ "a = [1, 4, 5, 9]\n", "for x in a :\n", " print (x ** 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is happening is that under the hood, Python is creating an *iterator* -- an object that lets you step through the list one element at a time, returning each element as it goes. We can actually get to the underlying iterator of a list by calling the `__iter__` function:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "i = a.__iter__()\n", "print(i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An iterator object is one that implements the iterator *protocol*.\n", "\n", "> A protocol is basically like an interface in Java or an abstract class in C++ -- it's basically a contract that an object must satisfy to be used in certain ways.\n", "\n", "The iterator protocol says that an iterator needs to support two operations: `__iter__` that returns the iterator itself, and `__next__` that, as you might expect, returns the next element of whatever is being iterated over.\n", "\n", "> we need to define `__iter__` because `for` loops invoke `__iter__` on whatever you are iterating over, and Python wants you to be able to use collections like lists or the iterators themselves in `for` lopos" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "4\n", "5\n", "9\n" ] } ], "source": [ "print(i.__next__())\n", "print(i.__next__())\n", "print(i.__next__())\n", "print(i.__next__())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When an iterator runs out of items (i.e., when `__next__` gets to the end fo the collection), it *raises* a `StopIteration` exception.\n", "\n", "> We have not really talked about exceptions in this class. Exceptions are a language construct that lets you break out of even very deep control flow when something \"bad\" happens (it doesn't have to be truly bad, like in the `StopIteration` case). You can then *catch* exceptions to do something (like end a `for` loop) when they are raised." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "ename": "StopIteration", "evalue": "", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mStopIteration\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__next__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m#this will raise an exception\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mStopIteration\u001b[0m: " ] } ], "source": [ "print(i.__next__()) #this will raise an exception" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because iterators are different objects, you can create *multiple* iterators from the same collection that each step over the data:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "4\n", "1\n", "4\n", "5\n", "9\n", "5\n", "9\n" ] } ], "source": [ "i1 = a.__iter__()\n", "i2 = a.__iter__()\n", "print(i1.__next__())\n", "print(i1.__next__())\n", "print(i2.__next__())\n", "print(i2.__next__())\n", "print(i1.__next__())\n", "print(i1.__next__())\n", "print(i2.__next__())\n", "print(i2.__next__())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's write an iterator that prints out every *other* element of a list. Note that we're going to use a trick: we'll keep a \"normal\" iterator for the list as part of our skip iterator:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "class skipIterator :\n", " def __init__(self, inList) :\n", " self._inner = inList.__iter__()\n", " \n", " def __iter__(self) :\n", " return self #remember we just return ourself\n", " \n", " def __next__(self) :\n", " #There's a trick here: we don't need to\n", " #raise StopException ourselves. The inner\n", " #iterator will raise the exception, and since\n", " #we don't do anything special, that exception\n", " #will propagate out of __next__ as if we\n", " #raised it ourself\n", " self._inner.__next__() #skip one element\n", " return self._inner.__next__() #return the next" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n", "4\n", "6\n", "8\n", "10\n" ] } ], "source": [ "s = skipIterator([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])\n", "for x in s :\n", " print (x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Generators\n", "\n", "Normally, an iterator needs a way to keep track of its \"current\" position in the data its iterating over, which can make for complicated code. Instead, we can use *generators* to do this kind of tracking automatically. A generator is a function that *yields* elements as it executes. A yield statement essentially returns a value from the function, but \"pauses\" the function where the yield was invoked.\n", "\n", "Any function that has a `yield` statement in it automatically returns a generator object. It implements the iterator protocol (so you can use it in for loops). Calling `__next__` on a generator object executes the function until you get to a `yield` statement, then pauses and returns whatever is yielded. Calling `__next__` again just picks up execution at the yield statement and executes until the next `yield`.\n", "\n", "Let's start by writing an iterator that counts out the gaps between particular letters in a string. Note how we have to keep track of both how far along we are in the string as well as how long the current gap is:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "class findDist :\n", " def __init__(self, tstr, char) :\n", " self.string = tstr\n", " self.char = char\n", " self.pos = 0\n", "\n", " def __iter__(self) : \n", " return self\n", "\n", " def __next__(self) : \n", " delta = 0\n", " if (self.pos == len(self.string)) : \n", " raise StopIteration() #at the end of the str\n", " while (self.string[self.pos] != self.char) : \n", " delta += 1\n", " self.pos += 1\n", " self.pos += 1 #important to skip over the char\n", " return delta" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "2\n", "1\n", "1\n", "2\n" ] } ], "source": [ "for i in findDist('abracadabra', 'a') :\n", " print (i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's write the same thing using a generator. By writing a function with `yield`, we automatically get an iterator without having to write a class that implements the protocol:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def findDistYield(string, char) : \n", " delta = 0\n", " for c in string : \n", " if c == char :\n", " yield delta\n", " delta = 0 \n", " else :\n", " delta += 1" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "2\n", "1\n", "1\n", "2\n" ] } ], "source": [ "for i in findDistYield('abracadabra', 'a') :\n", " print(i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use iterators and generators to write iterators for new classes that we define. Consider a linked list class with an iterator:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "class LinkedList :\n", " \n", " def __init__(self, init_val = None) :\n", " self.data = init_val\n", " self.next = None\n", " \n", " def insert(self, val) :\n", " newNode = LinkedList(self.data)\n", " newNode.next = self.next\n", " self.next = newNode\n", " self.data = val\n", " \n", " def insertList(self, vals) :\n", " for i in vals[::-1] :\n", " self.insert(i)\n", " \n", " class LinkedListIterator :\n", " def __init__(self, cur) :\n", " self.cur = cur\n", " \n", " def __iter__(self) :\n", " return self\n", " \n", " def __next__(self) :\n", " if (self.cur.data == None) :\n", " raise StopIteration\n", " else :\n", " ret = self.cur.data\n", " self.cur = self.cur.next\n", " return ret\n", " \n", " def __iter__(self) :\n", " return LinkedList.LinkedListIterator(self)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "2\n", "3\n", "4\n", "5\n" ] } ], "source": [ "l = LinkedList()\n", "l.insertList([1, 2, 3, 4, 5])\n", "for x in l :\n", " print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's do the same thing with a generator:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "class LinkedList2 :\n", "\n", " def __init__(self, init_val = None) :\n", " self.data = init_val\n", " self.next = None\n", " \n", " def insert(self, val) :\n", " newNode = LinkedList(self.data)\n", " newNode.next = self.next\n", " self.next = newNode\n", " self.data = val\n", " \n", " def insertList(self, vals) :\n", " for i in vals[::-1] :\n", " self.insert(i) \n", " \n", " def __iter__(self) :\n", " cur = self\n", " while (cur.data != None) :\n", " yield cur.data\n", " cur = cur.next" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5\n", "4\n", "3\n", "2\n", "1\n" ] } ], "source": [ "l2 = LinkedList2()\n", "l2.insertList([5, 4, 3, 2, 1])\n", "for x in l2 :\n", " print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Much shorter!\n", "\n", "But note that LinkedList is a recursive type: its next pointer is another linked list. Could we do something even more clever with generators? Yes! We can yield the current element, then iterate over the rest of the list by invoking its generator:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "class LinkedList3 :\n", "\n", " def __init__(self, init_val = None) :\n", " self.data = init_val\n", " self.next = None\n", " \n", " def insert(self, val) :\n", " newNode = LinkedList(self.data)\n", " newNode.next = self.next\n", " self.next = newNode\n", " self.data = val\n", " \n", " def insertList(self, vals) :\n", " for i in vals[::-1] :\n", " self.insert(i) \n", " \n", " def __iter__(self) :\n", " if self.data != None :\n", " yield self.data\n", " yield from self.next" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n", "4\n", "6\n", "8\n", "10\n" ] } ], "source": [ "l3 = LinkedList3()\n", "l3.insertList([2, 4, 6, 8, 10])\n", "for x in l3 :\n", " print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chaining Generators\n", "\n", "We can chain generators together: by passing one generator to another and iterating over each, we can build a pipeline that passes data from one to the next. What's great about this is that the processing happens one element at a time (thanks to the `yield` statement) rather than fully building a list each time.\n", "\n", "Let's first do this the normal way:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[-0.5, -2.0, -4.5, -8.0, -12.5]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def square(vals) :\n", " return [v ** 2 for v in vals]\n", "\n", "def negate(vals) :\n", " return [-1 * v for v in vals]\n", "\n", "def div(vals) :\n", " return [v / 2 for v in vals]\n", "\n", "negate(div(square([1, 2, 3, 4, 5])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But what's happening is that we're creating a brand new list each time we call the next function in the chain. This can take a lot of memory, and a lot of time, if the lists are big. Let's now do the same thing with generators:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def ysquare(vals) :\n", " for v in vals :\n", " yield v ** 2\n", " \n", "def ynegate(vals) :\n", " for v in vals :\n", " yield -1 * v\n", " \n", "def ydiv(vals) :\n", " for v in vals :\n", " yield v / 2\n", " \n", "ynegate(ydiv(ysquare([1, 2, 3, 4, 5])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this does not generate the list, since `ynegate` is a generator function. You have to iterate over it in order to get the values out of it. Luckily, `list`s can be constructed by passing them an iterator:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[-0.5, -2.0, -4.5, -8.0, -12.5]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g = ynegate(ydiv(ysquare([1, 2, 3, 4, 5])))\n", "list(g)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get the same result, but each time the `list` constructor asks for the next element, each generator in the chain operates on just one additional item in the input list. The item is squared, then yielded to `ydiv`, then yielded to `ynegate`.\n", "\n", "One final thing: just like we have list comprehensions as a fast way of building new lists, we have *generator expressions* as a fast way of building simple generators:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "generator" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "esquare = (v ** 2 for v in [1, 2, 3, 4, 5])\n", "type(esquare)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's probably useful to compare that to what the list comprehension would have looked like:\n", "`[v ** 2 for v in [1, 2, 3, 4, 5]]`\n", "But since we used a generator expression, we created a generator that needs to be iterated over, rather than a new list. We can then keep the chain going:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "ediv = (v / 2 for v in esquare)\n", "enegate = (-1 * v for v in ediv)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[-0.5, -2.0, -4.5, -8.0, -12.5]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(enegate)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }