In [1]:
# Standard imports
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_context("talk")

Numpy Review

This is a brief overview of numpy for DS100. The Jupyter Notebook can be obtained here: Numpy_Review.ipynb.

Importing Numpy

It is customary to import Numpy as np:

In [2]:
import numpy as np

As you learned in homework one the np.array is the key data structure in numpy for dense arrays of data.

Creating Arrays

You can build arrays from python lists. Data 8 Compatibility: In data8 you used a datascience package function called make_array which wraps the more standard np.array function we will use in this class.

In [3]:
np.array([[1.,2.], [3.,4.]])
Out[3]:
array([[ 1.,  2.],
       [ 3.,  4.]])
In [4]:
np.array([x for x in range(5)])
Out[4]:
array([0, 1, 2, 3, 4])

Array's don't have to contain numbers:

In [5]:
np.array([["A", "matrix"], ["of", "words."]])
Out[5]:
array([['A', 'matrix'],
       ['of', 'words.']], 
      dtype='<U6')

Making Arrays of Zeros

In [6]:
np.zeros(5)
Out[6]:
array([ 0.,  0.,  0.,  0.,  0.])

Making Arrays of Ones

In [7]:
np.ones([3,2])
Out[7]:
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])
In [8]:
np.eye(4)
Out[8]:
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

Making Arrays from ranges:

The np.arange(start, stop, step) function is like the python range function.

In [9]:
np.arange(0, 10, 2)
Out[9]:
array([0, 2, 4, 6, 8])

You can make a range of other types as well:

In [10]:
np.arange(np.datetime64('2016-12-31'), np.datetime64('2017-02-01'))
Out[10]:
array(['2016-12-31', '2017-01-01', '2017-01-02', '2017-01-03',
       '2017-01-04', '2017-01-05', '2017-01-06', '2017-01-07',
       '2017-01-08', '2017-01-09', '2017-01-10', '2017-01-11',
       '2017-01-12', '2017-01-13', '2017-01-14', '2017-01-15',
       '2017-01-16', '2017-01-17', '2017-01-18', '2017-01-19',
       '2017-01-20', '2017-01-21', '2017-01-22', '2017-01-23',
       '2017-01-24', '2017-01-25', '2017-01-26', '2017-01-27',
       '2017-01-28', '2017-01-29', '2017-01-30', '2017-01-31'], dtype='datetime64[D]')

Interpolating numbers

The linspace(start,end,num) function generates num numbers evenly spaced between the start and end.

In [11]:
np.linspace(0, 5, 10)
Out[11]:
array([ 0.        ,  0.55555556,  1.11111111,  1.66666667,  2.22222222,
        2.77777778,  3.33333333,  3.88888889,  4.44444444,  5.        ])

Learn more about working with datetime objects.

A random array

You can also generate arrays of random numbers (we will cover this in greater detail later).

  • randn generates random numbers from a Normal(mean=0, var=1) distribution.
  • rand generates random numbers from a Uniform(low=0, high=1) distribution.
  • permutation generates a random permutation of a sequence of numbers.
In [12]:
np.random.randn(3,2)
Out[12]:
array([[ 0.3601399 ,  1.31206686],
       [-0.95112397,  0.62475726],
       [-1.24179768,  1.63392069]])
In [13]:
np.random.rand(3,2)
Out[13]:
array([[ 0.2703233 ,  0.26231929],
       [ 0.44311229,  0.06318871],
       [ 0.31311525,  0.40825175]])
In [14]:
np.random.permutation(range(0,10))
Out[14]:
array([4, 2, 5, 9, 1, 0, 8, 3, 6, 7])

Properties of Arrays

Shape

Arrays have a shape which corresponds to the number of rows, columns, fibers, ...

In [15]:
A = np.array([[1., 2., 3.], [4., 5., 6.]])
print(A)
A.shape
[[ 1.  2.  3.]
 [ 4.  5.  6.]]
Out[15]:
(2, 3)

Type

Arrays have a type which corresponds to the type of data they contain

In [16]:
A.dtype
Out[16]:
dtype('float64')
In [17]:
np.arange(1,5).dtype
Out[17]:
dtype('int64')
In [18]:
(np.array([True, False])).dtype
Out[18]:
dtype('bool')
In [19]:
np.array(["Hello", "Worlddddd!"]).dtype
Out[19]:
dtype('<U10')

What does <U6 mean?

  • < Little Endian
  • U Unicode
  • 6 length of longest string

and we can change the type of an array:

In [20]:
np.array([1,2,3]).astype(float)
Out[20]:
array([ 1.,  2.,  3.])
In [21]:
np.array(["1","2","3"]).astype(int)
Out[21]:
array([1, 2, 3])

Learn more about numpy array types

Polymorphism

Can an array have more than one type?

In [22]:
np.array([1,2,3, "cat", True])
Out[22]:
array(['1', '2', '3', 'cat', 'True'], 
      dtype='<U21')

Does the following command work:

x = np.array([1,2,3,4])
x[3] = "cat"
In [23]:
x = np.array([1,2,3,4])
# x[3] = "cat" # <-- uncomment this line to find out

Jagged Arrays

Is the following valid?

A = np.array([[1, 2, 3], [4, 5], [6]])
In [24]:
A = np.array([[1, 2, 3], [4, 5], [6]])
A
Out[24]:
array([[1, 2, 3], [4, 5], [6]], dtype=object)

What happened?

In [25]:
A.shape
Out[25]:
(3,)
In [26]:
print(A.dtype)
object
In [27]:
print(A[0])
print(A[1])
print(A[2])
[1, 2, 3]
[4, 5]
[6]

Issues with Jagged Arrays

Jagged arrays can be problematic:

  1. Difficult to index (extract columns).
    A[0,1] 
    > Error
    A[0][1] 
    > 2
    
  2. Not as efficiently represented in contiguous memory.

Reshaping

Often you will need to reshape matrices. Suppose you have the following array:

In [28]:
np.arange(1,13)
Out[28]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

What will the following produce:

np.arange(1,13).reshape(4,3)

Option A:

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

Option B:

array([[ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11],
       [ 4,  8, 12]])

Solution

In [29]:
A = np.arange(1,13).reshape(4,3)
A
Out[29]:
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

Flatting Matrix

Flattening a matrix (higher dimensional array) produces a one dimensional array.

In [30]:
A.flatten()
Out[30]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

Advanced: Array representation

Numpy stores data contiguously in memory

In [31]:
print(A.dtype)
A.data.tobytes()
int64
Out[31]:
b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\n\x00\x00\x00\x00\x00\x00\x00\x0b\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x00\x00'

Numpy stores matrices in row-major order (by rows)

In [32]:
print(np.arange(1,13).reshape(4,3, order='C'))
print()
print(np.arange(1,13).reshape(4,3, order='F'))
[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]

[[ 1  5  9]
 [ 2  6 10]
 [ 3  7 11]
 [ 4  8 12]]

What does the 'F' mean?

Fortran ordering. In BLAS libraries are specified for Fortran and C programming languages which differ both in the column (Fortran) or row (C) indexing.

Slicing

From homework 1 you should already be pretty good at Slicing so let's test your slicing knowledge.

  • Program 1:
    x[:, 0]
    
    **Answer**

    B


  • Program 2:
    x[0, :]
    
    **Answer**

    A


  • Program 3:
    x[:2, 1:]
    
    **Answer**

    H


  • Program 4:
    x[0::2, :]
    
    **Answer**

    D

Understanding the slice syntax

begin:end:stride

Modifying a Slice

Suppose I wanted to make all entries in my matrix 0 in the top right corner as in (H) above.

In [33]:
H = np.arange(1,13).reshape(4,3)
print("Before:\n", H)
Before:
 [[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]
In [34]:
H[:2, 1:] = 0
print("After:\n", H)
After:
 [[ 1  0  0]
 [ 4  0  0]
 [ 7  8  9]
 [10 11 12]]

Boolean Indexing

We can apply boolean operations to arrays. This is essential when trying to select and modify individual elements.

Question: Given the following definition of A:

[[   1.    2.    3.]
 [   4.    5. -999.]
 [   7.    8.    9.]
 [  10. -999. -999.]]

what will the following output:

A > 3
  • Option A:
False
  • Option B:
array([[False, False, False],
       [ True,  True, False],
       [ True,  True,  True],
       [ True, False, False]], dtype=bool)
In [35]:
A = np.array([[  1.,   2.,   3.],
       [  4.,   5.,   -999.0],
       [  7.,   8.,   9.],
       [ 10.,  -999.0,  -999.0]])

A > 3.
Out[35]:
array([[False, False, False],
       [ True,  True, False],
       [ True,  True,  True],
       [ True, False, False]], dtype=bool)

Question: What will the following output

A = np.array([[   1.,    2.,    3.],
       [   4.,    5., -999.],
       [   7.,    8.,    9.],
       [  10., -999., -999.]])

A[A > 3]
  • Option A:
array([ 4,  7, 10,  5,  8, 11,  6,  9, 12])
  • Option B:
array([  4.,   5.,   7.,   8.,   9.,  10.])
  • Option C:
array([[  nan,   nan,  nan],
       [  4.,    5.,   nan],
       [  7.,    8.,   9.],
       [ 10.,    nan,  nan]])
In [36]:
A = np.array([[  1.,   2.,   3.],
       [  4.,   5.,   -999.0],
       [  7.,   8.,   9.],
       [ 10.,  -999.0,  -999.0]])

A[A > 3] 
Out[36]:
array([  4.,   5.,   7.,   8.,   9.,  10.])

Question: Replace the -999.0 entries with np.nan.

array([[   1.,    2.,    3.],
       [   4.,    5., -999.],
       [   7.,    8.,    9.],
       [  10., -999., -999.]])

Solution

In [37]:
A = np.array([[  1.,   2.,   3.],
       [  4.,   5.,   -999.0],
       [  7.,   8.,   9.],
       [ 10.,  -999.0,  -999.0]])
  • Construct a boolean array that indicates where the value is 999.0:
In [38]:
ind = (A == -999.0)
print(ind)
[[False False False]
 [False False  True]
 [False False False]
 [False  True  True]]
  • Assign np.nan to all the True entires:
In [39]:
A[ind] = np.nan
A
Out[39]:
array([[  1.,   2.,   3.],
       [  4.,   5.,  nan],
       [  7.,   8.,   9.],
       [ 10.,  nan,  nan]])

Question: What might -999.0 represent? Why might I want to replace the -999.0 with a np.nan?

Solution: It could be safer in calculations. For example when computing the mean of the transformed A we get:

In [40]:
print(A)
np.mean(A)
[[  1.   2.   3.]
 [  4.   5.  nan]
 [  7.   8.   9.]
 [ 10.  nan  nan]]
Out[40]:
nan

Perhaps instead we want:

In [41]:
np.nanmean(A)
Out[41]:
5.4444444444444446
In [42]:
help(np.nanmean)
Help on function nanmean in module numpy.lib.nanfunctions:

nanmean(a, axis=None, dtype=None, out=None, keepdims=<class 'numpy._globals._NoValue'>)
    Compute the arithmetic mean along the specified axis, ignoring NaNs.
    
    Returns the average of the array elements.  The average is taken over
    the flattened array by default, otherwise over the specified axis.
    `float64` intermediate and return values are used for integer inputs.
    
    For all-NaN slices, NaN is returned and a `RuntimeWarning` is raised.
    
    .. versionadded:: 1.8.0
    
    Parameters
    ----------
    a : array_like
        Array containing numbers whose mean is desired. If `a` is not an
        array, a conversion is attempted.
    axis : int, optional
        Axis along which the means are computed. The default is to compute
        the mean of the flattened array.
    dtype : data-type, optional
        Type to use in computing the mean.  For integer inputs, the default
        is `float64`; for inexact inputs, it is the same as the input
        dtype.
    out : ndarray, optional
        Alternate output array in which to place the result.  The default
        is ``None``; if provided, it must have the same shape as the
        expected output, but the type will be cast if necessary.  See
        `doc.ufuncs` for details.
    keepdims : bool, optional
        If this is set to True, the axes which are reduced are left
        in the result as dimensions with size one. With this option,
        the result will broadcast correctly against the original `a`.
    
        If the value is anything but the default, then
        `keepdims` will be passed through to the `mean` or `sum` methods
        of sub-classes of `ndarray`.  If the sub-classes methods
        does not implement `keepdims` any exceptions will be raised.
    
    Returns
    -------
    m : ndarray, see dtype parameter above
        If `out=None`, returns a new array containing the mean values,
        otherwise a reference to the output array is returned. Nan is
        returned for slices that contain only NaNs.
    
    See Also
    --------
    average : Weighted average
    mean : Arithmetic mean taken while not ignoring NaNs
    var, nanvar
    
    Notes
    -----
    The arithmetic mean is the sum of the non-NaN elements along the axis
    divided by the number of non-NaN elements.
    
    Note that for floating-point input, the mean is computed using the same
    precision the input has.  Depending on the input data, this can cause
    the results to be inaccurate, especially for `float32`.  Specifying a
    higher-precision accumulator using the `dtype` keyword can alleviate
    this issue.
    
    Examples
    --------
    >>> a = np.array([[1, np.nan], [3, 4]])
    >>> np.nanmean(a)
    2.6666666666666665
    >>> np.nanmean(a, axis=0)
    array([ 2.,  4.])
    >>> np.nanmean(a, axis=1)
    array([ 1.,  3.5])

More Complex Bit Logic

Often we will want to work with multiple different arrays at once and select subsets of entries from each array. Consider the following example:

In [43]:
names = np.array(["Joey", "Henry", "Joseph", 
                  "Jim", "Sam", "Deb", "Mike", 
                  "Bin", "Joe", "Andrew", "Bob"])

favorite_number = np.arange(len(names)) 

Suppose a subset of these people are staff members:

In [44]:
staff = ["Joey",  "Deb", "Sam"]

Question:

How could we compute the sum of the staff members favorite numbers?

One solution is to use for loops:

In [45]:
total = 0 
for i in range(len(names)):
    if names[i] in staff:
        total += favorite_number[i]

print("total:", total)
total: 9

Another solution would be to use the np.in1d function to determine which people are staff.

In [46]:
is_staff = np.in1d(names, staff)
is_staff
Out[46]:
array([ True, False, False, False,  True,  True, False, False, False,
       False, False], dtype=bool)

Boolean indexing

In [47]:
favorite_number[is_staff].sum()
Out[47]:
9

Question:

What does the following expression compute:

starts_with_j = np.char.startswith(names, "J")
starts_with_j[is_staff].mean()

The fraction of the staff have names that begin with J?

In [48]:
starts_with_j = np.char.startswith(names, "J")
starts_with_j[is_staff].mean()
Out[48]:
0.33333333333333331

Question:

What does it mean to take the mean of an array of booleans?

The values True and False correspond to the integers 1 and 0 and are treated as such in mathematical expressions (e.g., mean(), sum(), as well as linear algebraic operations).

Question

What does the following expression compute:

favorite_number[starts_with_j & is_staff].sum()

The sum of the favorite numbers of staff starting with J

In [49]:
favorite_number[starts_with_j & is_staff].sum()
Out[49]:
0

Other Useful Operations

A Note on using Array operations

In [50]:
data = np.random.rand(1000000)

Consider the following two programs.

Program A

s = 0
c = 0
for x in data:
    if x > 0.5:
        s += x
        c += 1
result = s/c

Program B

result = data[data > 0.5].mean()
  1. What do they do?
  2. Which one is faster?
  3. Which one is clearer?






Solution

In [51]:
%%timeit
s = 0
for x in data:
    if x > 0.5:
        s += x
result = s/len(data)
10 loops, best of 3: 107 ms per loop
In [52]:
%%timeit
result = data[data > 0.5].mean()
100 loops, best of 3: 6.91 ms per loop

Important Points

Using the array abstractions instead of looping can often be:

  1. Clearer
  2. Faster

These are fundamental goals of abstraction.








Mathematical operations:

Numpy arrays support standard mathematical operations

In [53]:
A = np.arange(1., 13.).reshape(4,3)
print(A)
[[  1.   2.   3.]
 [  4.   5.   6.]
 [  7.   8.   9.]
 [ 10.  11.  12.]]
In [54]:
A * 0.5 + 3
Out[54]:
array([[ 3.5,  4. ,  4.5],
       [ 5. ,  5.5,  6. ],
       [ 6.5,  7. ,  7.5],
       [ 8. ,  8.5,  9. ]])

notice that operations are element wise.

In [55]:
A.T
Out[55]:
array([[  1.,   4.,   7.,  10.],
       [  2.,   5.,   8.,  11.],
       [  3.,   6.,   9.,  12.]])
In [56]:
A.sum()
Out[56]:
78.0
In [57]:
A.mean()
Out[57]:
6.5




Be Careful with Floating Point Numbers

What is the value of the following: $$ A - \exp \left( \log \left( A \right) \right) $$




Solution:

In [58]:
A = np.arange(1., 13.).reshape(4,3)
print(A)

(A - np.exp(np.log(A)))
[[  1.   2.   3.]
 [  4.   5.   6.]
 [  7.   8.   9.]
 [ 10.  11.  12.]]
Out[58]:
array([[  0.00000000e+00,   0.00000000e+00,  -4.44089210e-16],
       [  0.00000000e+00,   8.88178420e-16,   0.00000000e+00],
       [  8.88178420e-16,   1.77635684e-15,  -1.77635684e-15],
       [ -1.77635684e-15,  -1.77635684e-15,   0.00000000e+00]])

What happened?!

Floating point precision is not perfect and we are applying transcendental functions.





A simpler examples

What is the value of the following expression:

0.1 + 0.2 == 0.3
In [59]:
0.1 + 0.2 == 0.3
Out[59]:
False
In [60]:
print(0.1 + 0.2)
0.30000000000000004

For these situations consider using np.isclose:

In [61]:
help(np.isclose)
Help on function isclose in module numpy.core.numeric:

isclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)
    Returns a boolean array where two arrays are element-wise equal within a
    tolerance.
    
    The tolerance values are positive, typically very small numbers.  The
    relative difference (`rtol` * abs(`b`)) and the absolute difference
    `atol` are added together to compare against the absolute difference
    between `a` and `b`.
    
    Parameters
    ----------
    a, b : array_like
        Input arrays to compare.
    rtol : float
        The relative tolerance parameter (see Notes).
    atol : float
        The absolute tolerance parameter (see Notes).
    equal_nan : bool
        Whether to compare NaN's as equal.  If True, NaN's in `a` will be
        considered equal to NaN's in `b` in the output array.
    
    Returns
    -------
    y : array_like
        Returns a boolean array of where `a` and `b` are equal within the
        given tolerance. If both `a` and `b` are scalars, returns a single
        boolean value.
    
    See Also
    --------
    allclose
    
    Notes
    -----
    .. versionadded:: 1.7.0
    
    For finite values, isclose uses the following equation to test whether
    two floating point values are equivalent.
    
     absolute(`a` - `b`) <= (`atol` + `rtol` * absolute(`b`))
    
    The above equation is not symmetric in `a` and `b`, so that
    `isclose(a, b)` might be different from `isclose(b, a)` in
    some rare cases.
    
    Examples
    --------
    >>> np.isclose([1e10,1e-7], [1.00001e10,1e-8])
    array([True, False])
    >>> np.isclose([1e10,1e-8], [1.00001e10,1e-9])
    array([True, True])
    >>> np.isclose([1e10,1e-8], [1.0001e10,1e-9])
    array([False, True])
    >>> np.isclose([1.0, np.nan], [1.0, np.nan])
    array([True, False])
    >>> np.isclose([1.0, np.nan], [1.0, np.nan], equal_nan=True)
    array([True, True])





Aggregating along an axis

Grouping by row:

In [62]:
A.sum(axis=0)
Out[62]:
array([ 22.,  26.,  30.])

This is the same as:

(nrow, ncols) = A.shape

s = np.zeros(ncols)

for i in range(nrows):
    s += A[i,:]

print(s)

Grouping by col:

In [63]:
A.sum(axis=1)
Out[63]:
array([  6.,  15.,  24.,  33.])

This is the same as:

(nrows, ncols) = A.shape

s = np.zeros(nrows)

for i in range(ncols):
    s += A[:,i]

print(s)

Other Functions to Checkout

  • mean computes the mean
  • std standard deviation
  • var variance

and many more








Linear Algebra

Suppose we wanted to use linear algebra to compute the sum along axis 1


$$ \texttt{A.sum( axis = 1) } = \sum_j A_{i,j} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6\end{bmatrix} \cdot \begin{bmatrix} 1 \\ 1 \\ 1\end{bmatrix} $$

Will the following work?

b = np.ones(3)
A * b






Solution

In [64]:
A = np.array([[1, 2, 3], [4, 5, 6]])
b = np.ones(3)
A * b
Out[64]:
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])





Explanation:

We ended up computing an element-wise product. The vector of ones was replicated once for each row and then used to scale the entire row.


The correct expression for matrix multiplication

In [65]:
A.dot(b)
Out[65]:
array([  6.,  15.])

In the later python versions (>3.5) you can use the infix operator @ which is probably easier to read

In [66]:
A @ b
Out[66]:
array([  6.,  15.])







Solving Linear Systems

Suppose you are asked to solve the following system of linear equations:

$$ 5x - 3y = 2 \\ -9x + 2y = -7 $$

this means that we want to solve the following linear systems:

$$ \begin{bmatrix} 5 & -3 \\ -9 & 2 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 2 \\ -7 \end{bmatrix} $$

Solving for $x$ and $y$ we get:

$$ \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 5 & -3 \\ -9 & 2 \end{bmatrix}^{-1} \begin{bmatrix} 2 \\ -7 \end{bmatrix} $$

This can be solved numerically using NumPy:

In [67]:
A = np.array([[5, -3], [-9, 2]])
b = np.array([2,-7])
In [68]:
from numpy.linalg import inv
inv(A) @ b
Out[68]:
array([ 1.,  1.])

Preferred way to solve (more numerically stable)

In [69]:
from numpy.linalg import solve
solve(A, b)
Out[69]:
array([ 1.,  1.])

Two points:

  1. Issue with performance
  2. Issue with numerical stability

When the matrix is not full rank it may be necessary to use lstsq.