{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Numpy\n", "\n", "Numpy is a popular library in Python for performing lots of data analysis tasks, because it provides data structures for n-dimensional arrays and matrices. These structures support many of the common operations you might want to do on a matrix.\n", "\n", "Let's start by creating an array of 9 random integers in the range [-10, 10)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-9 -3 -8 -7 4 -4 2 2 -2]\n" ] } ], "source": [ "import numpy as np\n", "\n", "data = np.random.randint(-10, 10, size = 9)\n", "print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note two things here. First, while this `data` looks like a list, it isn't. It is an `ndarray`, which stands for *n-dimensional array*. This is the main datatype that numpy provides.\n", "\n", "For any ndarray, we can ask for its *shape*: this tells us how many dimensions it has, and how big each dimension is. In this case, we have 1 dimension, and it has 9 elements." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "(9,)\n" ] } ], "source": [ "print(type(data))\n", "print(data.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reshaping\n", "\n", "One of the most useful abilities in Numpy is the ability to *reshape* one ndarray into another. Think of this as \"pouring\" the data from one array, row by row, into the next array, row by row. So, for example, we can reshape the 9-element, 1-dimensional array into a 9x1 2-dimensional array:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-9]\n", " [-3]\n", " [-8]\n", " [-7]\n", " [ 4]\n", " [-4]\n", " [ 2]\n", " [ 2]\n", " [-2]]\n", "(9, 1)\n", "\n" ] } ], "source": [ "data_arr = np.reshape(data, (9, 1))\n", "print(data_arr)\n", "print(data_arr.shape)\n", "print(type(data_arr))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or into a 1x9 2-dimsional array:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-9 -3 -8 -7 4 -4 2 2 -2]]\n", "(1, 9)\n" ] } ], "source": [ "data_arr = np.reshape(data_arr, (1, 9))\n", "print(data_arr)\n", "print(data_arr.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or into a 3x3. Note that in this case, the data is filled in row-by-row:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-9 -3 -8]\n", " [-7 4 -4]\n", " [ 2 2 -2]]\n", "(3, 3)\n" ] } ], "source": [ "data_arr = np.reshape(data_arr, (3, 3))\n", "print(data_arr)\n", "print(data_arr.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that you can only reshape ndarrays into \"compatible\" ones: they must be able to hold exactly the same amount of data. Reshaping the array into one that is too small or too large won't work:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "cannot reshape array of size 9 into shape (4,4)", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtoo_large\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata_arr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m4\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mtoo_small\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata_arr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/numpy/core/fromnumeric.py\u001b[0m in \u001b[0;36mreshape\u001b[0;34m(a, newshape, order)\u001b[0m\n\u001b[1;32m 290\u001b[0m [5, 6]])\n\u001b[1;32m 291\u001b[0m \"\"\"\n\u001b[0;32m--> 292\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_wrapfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'reshape'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnewshape\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 293\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 294\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/numpy/core/fromnumeric.py\u001b[0m in \u001b[0;36m_wrapfunc\u001b[0;34m(obj, method, *args, **kwds)\u001b[0m\n\u001b[1;32m 54\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_wrapfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 55\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 56\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 57\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 58\u001b[0m \u001b[0;31m# An AttributeError occurs if the object does not have\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: cannot reshape array of size 9 into shape (4,4)" ] } ], "source": [ "too_large = np.reshape(data_arr, (4, 4))\n", "too_small = np.reshape(data_arr, (3, 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collective operations:\n", "\n", "One useful thing that Numpy supports is the ability to do \"collective\" operations on the rows/columns/etc of an n-dimensional array. For example, you can compute the mean of every column in `data_arr` by asking for the mean along axis 0:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-4.66666667 1. -4.66666667]\n" ] } ], "source": [ "column_mean = np.mean(data_arr, axis=0)\n", "print(column_mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this creates a 1x3 ndarray: each column has its own mean, so there are as many entries in the result as there are columns in the original.\n", "\n", "You can also compute means along other dimensions, such as along the row:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-6.66666667 -2.33333333 0.66666667]\n" ] } ], "source": [ "row_mean = np.mean(data_arr, axis=1)\n", "print(row_mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also compute things like standard deviation and variance:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([4.78423336, 2.94392029, 2.49443826])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.std(data_arr, axis=0)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([22.88888889, 8.66666667, 6.22222222])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.var(data_arr, axis = 0)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-14, 3, -14])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(data_arr, axis = 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## nd-array math\n", "\n", "One of the trickier things to get used to in numpy is how math on ndarrays is performed.\n", "\n", "The first thing to keep in mind is that Numpy *always* tries to do element-by-element operations if it can. If you add two arrays together, you will add together the individual elements if the shapes are compatible. But, surprisingly, the same thing will happen if you *multiply* two arrays together:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-5 0 0]\n", " [ 2 -1 -4]\n", " [-5 -5 1]]\n" ] } ], "source": [ "data_tmp = np.random.randint(-5, 5, (3, 3))\n", "print(data_tmp)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-9 -3 -8]\n", " [-7 4 -4]\n", " [ 2 2 -2]]\n", "[[-14 -3 -8]\n", " [ -5 3 -8]\n", " [ -3 -3 -1]]\n", "[[ 45 0 0]\n", " [-14 -4 16]\n", " [-10 -10 -2]]\n" ] } ], "source": [ "print(data_arr)\n", "print(data_arr + data_tmp)\n", "print(data_arr * data_tmp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Broadcasting\n", "\n", "So what if you try to do operations on ndarrays that aren't the same size? Some times, the operation will just fail:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[2 3]\n", " [5 6]]\n" ] } ], "source": [ "data_bad = np.array([[2, 3], [5, 6]])\n", "print(data_bad)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "operands could not be broadcast together with shapes (3,3) (2,2) ", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdata_arr\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mdata_bad\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mValueError\u001b[0m: operands could not be broadcast together with shapes (3,3) (2,2) " ] } ], "source": [ "data_arr + data_bad" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But other times, numpy will try to \"broadcast\" the operands so that the dimensions line up. The way it does this is by copying the data along missing dimensions (or along dimensions of size 1) to create compatible ndarrays. The [rules of broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) are complicated, so here we'll just talk about a few of them.\n", "\n", "If you try to perform math with a scalar value, numpy will copy that value out into an array of matching dimension before performing the operation:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-9 -3 -8]\n", " [-7 4 -4]\n", " [ 2 2 -2]]\n", "[[-10 -4 -9]\n", " [ -8 3 -5]\n", " [ 1 1 -3]]\n" ] } ], "source": [ "print(data_arr)\n", "print(data_arr - 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If instead you have a column vector of the same size as the other operand, numpy will copy that column enough times to match the total number of columns:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 4]\n", " [-4]\n", " [ 2]]\n" ] } ], "source": [ "col_data = np.random.randint(-5, 5, size=(3, 1))\n", "print(col_data)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-9 -3 -8]\n", " [-7 4 -4]\n", " [ 2 2 -2]]\n", "[[-13 -7 -12]\n", " [ -3 8 0]\n", " [ 0 0 -4]]\n" ] } ], "source": [ "print(data_arr)\n", "print(data_arr - col_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the same for row vectors:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[3 3 4]]\n" ] } ], "source": [ "row_data = np.random.randint(-5, 5, size=(1, 3))\n", "print(row_data)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-9 -3 -8]\n", " [-7 4 -4]\n", " [ 2 2 -2]]\n", "[[-12 -6 -12]\n", " [-10 1 -8]\n", " [ -1 -1 -6]]\n" ] } ], "source": [ "print(data_arr)\n", "print(data_arr - row_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that numpy does element-by-element math, so multiplication does not do matrix multiplication:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 45 0 0]\n", " [-14 -4 16]\n", " [-10 -10 -2]]\n" ] } ], "source": [ "print(data_arr * data_tmp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead, you need to use `numpy.dot`:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 79 43 4]\n", " [ 63 16 -20]\n", " [ 4 8 -10]]\n" ] } ], "source": [ "print(np.dot(data_arr, data_tmp))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numpy Matrices\n", "\n", "While you can use ndarrays to do a lot of matrix math (e.g., `numpy.dot` for matrix multiplication, `numpy.linalg.inv` for matrix inversion, etc. -- see [Numpy Linear Algebra](https://docs.scipy.org/doc/numpy/reference/routines.linalg.html)) numpy also provides a special class for doing matrix math called, unsurprisingly, `Matrix`.\n", "\n", "You can create a matrix by passing a two-dimensional ndarray to `numpy.matrix`:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-9 -3 -8]\n", " [-7 4 -4]\n", " [ 2 2 -2]]\n", "[[-9 -3 -8]\n", " [-7 4 -4]\n", " [ 2 2 -2]]\n" ] } ], "source": [ "data_mtx = np.matrix(data_arr)\n", "print(data_arr)\n", "print(data_mtx)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "new_mtx = np.matrix(data_tmp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now matrix multiplication works exactly like you expect it to:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 79 43 4]\n", " [ 63 16 -20]\n", " [ 4 8 -10]]\n" ] } ], "source": [ "print(data_mtx * new_mtx)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Matrices also give easy access to Transpose operations and Inverse operations:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-9 -7 2]\n", " [-3 4 2]\n", " [-8 -4 -2]]\n", "[[-0. -0.09090909 0.18181818]\n", " [-0.09090909 0.14049587 0.08264463]\n", " [-0.09090909 0.04958678 -0.23553719]]\n" ] } ], "source": [ "print(data_mtx.T) #Transpose\n", "print(data_mtx.I) #Inverse" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }