# Standard imports
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_context("talk")
This is a brief overview of numpy for DS100. The Jupyter Notebook can be obtained here: Numpy_Review.ipynb.
It is customary to import Numpy as np
:
import numpy as np
As you learned in homework one the np.array
is the key data structure in numpy for dense arrays of data.
You can build arrays from python lists.
Data 8 Compatibility: In data8 you used a datascience
package function called make_array
which wraps the more standard np.array
function we will use in this class.
np.array([[1.,2.], [3.,4.]])
np.array([x for x in range(5)])
Array's don't have to contain numbers:
np.array([["A", "matrix"], ["of", "words."]])
np.zeros(5)
np.ones([3,2])
np.eye(4)
The np.arange(start, stop, step)
function is like the python range
function.
np.arange(0, 10, 2)
You can make a range of other types as well:
np.arange(np.datetime64('2016-12-31'), np.datetime64('2017-02-01'))
The linspace(start,end,num)
function generates num
numbers evenly spaced between the start
and end
.
np.linspace(0, 5, 10)
Learn more about working with datetime objects.
You can also generate arrays of random numbers (we will cover this in greater detail later).
randn
generates random numbers from a Normal(mean=0, var=1) distribution.rand
generates random numbers from a Uniform(low=0, high=1) distribution.permutation
generates a random permutation of a sequence of numbers.np.random.randn(3,2)
np.random.rand(3,2)
np.random.permutation(range(0,10))
Arrays have a shape which corresponds to the number of rows, columns, fibers, ...
A = np.array([[1., 2., 3.], [4., 5., 6.]])
print(A)
A.shape
Arrays have a type which corresponds to the type of data they contain
A.dtype
np.arange(1,5).dtype
(np.array([True, False])).dtype
np.array(["Hello", "Worlddddd!"]).dtype
What does <U6
mean?
<
Little EndianU
Unicode6
length of longest stringnp.array([1,2,3]).astype(float)
np.array(["1","2","3"]).astype(int)
Learn more about numpy array types
Can an array have more than one type?
np.array([1,2,3, "cat", True])
Does the following command work:
x = np.array([1,2,3,4])
x[3] = "cat"
x = np.array([1,2,3,4])
# x[3] = "cat" # <-- uncomment this line to find out
Is the following valid?
A = np.array([[1, 2, 3], [4, 5], [6]])
A = np.array([[1, 2, 3], [4, 5], [6]])
A
What happened?
A.shape
print(A.dtype)
print(A[0])
print(A[1])
print(A[2])
A[0,1]
> Error
A[0][1]
> 2
Often you will need to reshape matrices. Suppose you have the following array:
np.arange(1,13)
What will the following produce:
np.arange(1,13).reshape(4,3)
Option A:
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
Option B:
array([[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11],
[ 4, 8, 12]])
Solution
A = np.arange(1,13).reshape(4,3)
A
Flattening a matrix (higher dimensional array) produces a one dimensional array.
A.flatten()
Numpy stores data contiguously in memory
print(A.dtype)
A.data.tobytes()
Numpy stores matrices in row-major order (by rows)
print(np.arange(1,13).reshape(4,3, order='C'))
print()
print(np.arange(1,13).reshape(4,3, order='F'))
What does the 'F'
mean?
Fortran ordering. In BLAS libraries are specified for Fortran and C programming languages which differ both in the column (Fortran) or row (C) indexing.
From homework 1 you should already be pretty good at Slicing so let's test your slicing knowledge.
x[:, 0]
B
x[0, :]
A
x[:2, 1:]
H
x[0::2, :]
D
Understanding the slice syntax
begin:end:stride
Suppose I wanted to make all entries in my matrix 0 in the top right corner as in (H) above.
H = np.arange(1,13).reshape(4,3)
print("Before:\n", H)
H[:2, 1:] = 0
print("After:\n", H)
We can apply boolean operations to arrays. This is essential when trying to select and modify individual elements.
Question: Given the following definition of A:
[[ 1. 2. 3.]
[ 4. 5. -999.]
[ 7. 8. 9.]
[ 10. -999. -999.]]
what will the following output:
A > 3
False
array([[False, False, False],
[ True, True, False],
[ True, True, True],
[ True, False, False]], dtype=bool)
A = np.array([[ 1., 2., 3.],
[ 4., 5., -999.0],
[ 7., 8., 9.],
[ 10., -999.0, -999.0]])
A > 3.
Question: What will the following output
A = np.array([[ 1., 2., 3.],
[ 4., 5., -999.],
[ 7., 8., 9.],
[ 10., -999., -999.]])
A[A > 3]
array([ 4, 7, 10, 5, 8, 11, 6, 9, 12])
array([ 4., 5., 7., 8., 9., 10.])
array([[ nan, nan, nan],
[ 4., 5., nan],
[ 7., 8., 9.],
[ 10., nan, nan]])
A = np.array([[ 1., 2., 3.],
[ 4., 5., -999.0],
[ 7., 8., 9.],
[ 10., -999.0, -999.0]])
A[A > 3]
Question: Replace the -999.0 entries with np.nan
.
array([[ 1., 2., 3.],
[ 4., 5., -999.],
[ 7., 8., 9.],
[ 10., -999., -999.]])
Solution
A = np.array([[ 1., 2., 3.],
[ 4., 5., -999.0],
[ 7., 8., 9.],
[ 10., -999.0, -999.0]])
ind = (A == -999.0)
print(ind)
np.nan
to all the True
entires:A[ind] = np.nan
A
Question: What might -999.0 represent? Why might I want to replace the -999.0 with a np.nan
?
Solution: It could be safer in calculations. For example when computing the mean of the transformed A we get:
print(A)
np.mean(A)
Perhaps instead we want:
np.nanmean(A)
help(np.nanmean)
Often we will want to work with multiple different arrays at once and select subsets of entries from each array. Consider the following example:
names = np.array(["Joey", "Henry", "Joseph",
"Jim", "Sam", "Deb", "Mike",
"Bin", "Joe", "Andrew", "Bob"])
favorite_number = np.arange(len(names))
Suppose a subset of these people are staff members:
staff = ["Joey", "Deb", "Sam"]
How could we compute the sum of the staff members favorite numbers?
One solution is to use for loops:
total = 0
for i in range(len(names)):
if names[i] in staff:
total += favorite_number[i]
print("total:", total)
Another solution would be to use the np.in1d function to determine which people are staff.
is_staff = np.in1d(names, staff)
is_staff
Boolean indexing
favorite_number[is_staff].sum()
What does the following expression compute:
starts_with_j = np.char.startswith(names, "J")
starts_with_j[is_staff].mean()
The fraction of the staff have names that begin with J
?
starts_with_j = np.char.startswith(names, "J")
starts_with_j[is_staff].mean()
What does it mean to take the mean of an array of booleans?
The values True
and False
correspond to the integers 1
and 0
and are treated as such in mathematical expressions (e.g., mean()
, sum()
, as well as linear algebraic operations).
What does the following expression compute:
favorite_number[starts_with_j & is_staff].sum()
The sum of the favorite numbers of staff starting with J
favorite_number[starts_with_j & is_staff].sum()
data = np.random.rand(1000000)
%%timeit
s = 0
for x in data:
if x > 0.5:
s += x
result = s/len(data)
%%timeit
result = data[data > 0.5].mean()
Using the array abstractions instead of looping can often be:
These are fundamental goals of abstraction.
Numpy arrays support standard mathematical operations
A = np.arange(1., 13.).reshape(4,3)
print(A)
A * 0.5 + 3
notice that operations are element wise.
A.T
A.sum()
A.mean()
What is the value of the following: $$ A - \exp \left( \log \left( A \right) \right) $$
Solution:
A = np.arange(1., 13.).reshape(4,3)
print(A)
(A - np.exp(np.log(A)))
What happened?!
Floating point precision is not perfect and we are applying transcendental functions.
0.1 + 0.2 == 0.3
print(0.1 + 0.2)
For these situations consider using np.isclose
:
help(np.isclose)
A.sum(axis=0)
This is the same as:
(nrow, ncols) = A.shape
s = np.zeros(ncols)
for i in range(nrows):
s += A[i,:]
print(s)
A.sum(axis=1)
This is the same as:
(nrows, ncols) = A.shape
s = np.zeros(nrows)
for i in range(ncols):
s += A[:,i]
print(s)
A = np.array([[1, 2, 3], [4, 5, 6]])
b = np.ones(3)
A * b
Explanation:
We ended up computing an element-wise product. The vector of ones was replicated once for each row and then used to scale the entire row.
A.dot(b)
In the later python versions (>3.5) you can use the infix operator @
which is probably easier to read
A @ b
Suppose you are asked to solve the following system of linear equations:
$$ 5x - 3y = 2 \\ -9x + 2y = -7 $$this means that we want to solve the following linear systems:
$$ \begin{bmatrix} 5 & -3 \\ -9 & 2 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 2 \\ -7 \end{bmatrix} $$Solving for $x$ and $y$ we get:
$$ \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 5 & -3 \\ -9 & 2 \end{bmatrix}^{-1} \begin{bmatrix} 2 \\ -7 \end{bmatrix} $$This can be solved numerically using NumPy:
A = np.array([[5, -3], [-9, 2]])
b = np.array([2,-7])
from numpy.linalg import inv
inv(A) @ b
Preferred way to solve (more numerically stable)
from numpy.linalg import solve
solve(A, b)
Two points:
When the matrix is not full rank it may be necessary to use lstsq.