Demystifying pandas and numpy filtering

In the course of analyzing data, one will inevitably want to remove items from a collection, leaving behind only the items which satisfy a condition. In vanilla python, there are two equivalent ways to spell such an operation.

In [1]:
# Functional, but utterly unpythonic

list(filter(lambda n: n % 2 == 0, range(10)))
Out[1]:
[0, 2, 4, 6, 8]
In [2]:
# Syntactic sugar makes for quite readable code

[n for n in range(10) if n % 2 == 0]
Out[2]:
[0, 2, 4, 6, 8]

Textbook Pandas Example

It might be surprising for people coming from a conventional programming background that filtering in pandas (and in numpy, the numerical magic powering many pandas operations) works quite differently.

Take this simplified and tweaked example from the excellent Data Science Handbook by Jake VanderPlas. We have the area and population for the five biggest US states and we want to filter to only those with over 100 people per square km or with an area of at least 500,000 square km.

In [3]:
import pandas as pd

area = [423967, 695662, 141297, 170312, 149995]
population = [38332521, 26448193, 19651127, 19552860, 12882135]
index = ['California', 'Texas', 'New York', 'Florida', 'Illinois']
df = pd.DataFrame({'area': area, 'population': population}, index=index)

df[(df.population / df.area > 100) | (df.area > 500000)]
Out[3]:
area population
Texas 695662 26448193
New York 141297 19651127
Florida 170312 19552860

While this syntax is easy enough to memorize, it's worth understanding why it works like this.


Basic numpy indexing

Let's go back to basics. Let's create an array with the first four positive integers.

In [4]:
import numpy as np

a = np.array([1, 2, 3, 4])

a
Out[4]:
array([1, 2, 3, 4])

We can, of course, index this all the usual ways we're used to with python lists.

In [5]:
a[0]
Out[5]:
1
In [6]:
a[1:3]
Out[6]:
array([2, 3])
In [7]:
a[-2:]
Out[7]:
array([3, 4])

With numpy arrays, we have a few additional ways to select items by index.

One very common operation is to index an array by an array of indexes; what results is the values from the first array at the indexes specified in the second. This time, let's use a utility function to create the first array with the natural numbers between 10 and 20, endpoint excluded.

In [8]:
a = np.arange(10, 20)

a
Out[8]:
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

When we index it with [0, 3, 4], we get back an array with the zeroth, third and fourth values.

In [9]:
indexes = np.array([0, 3, 4])

a[indexes]
Out[9]:
array([10, 13, 14])

Simple lists work as the indexing array as well. Indexing an array with a list literal results in clean-looking code.

In [10]:
a[[0, 3, 4]]
Out[10]:
array([10, 13, 14])

This is simple enough. An extension of this syntax exists for indexing matrices, but that's outside the scope of this little blog post.


Boolean indexing

Let's turn our attention back toward a goal. How do we filter a numpy array (or a Series or a DataFrame)? Well, numpy supports another indexing syntax. We can create an array of the same shape but with a dtype of bool, where each entry is True or False. When we index a by this array, we get back only the items which correspond to a True in the array of booleans. Below we set the first five values to True and make the last five False.

In [11]:
a[[
    True, True, True, True, True,
    False, False, False, False, False
]]
Out[11]:
array([10, 11, 12, 13, 14])

Equipped with this syntax, we can ask a new question: how best can we create arrays which correspond to an input array such that there's a True for all the values we we can't to keep, and a False for those we don't.

One easy way comes to mind: use a list comprehension which maps each value in the array to True or False according to our predicate. This works and it gets us something like an answer to our question.

In [12]:
a[[n < 15 for n in a]]
Out[12]:
array([10, 11, 12, 13, 14])

If this feels weird to you though, that makes sense: we've reimplemented the conditional clause in a strange way. The following is equivalent to the former, but more pythonic. For the toy examples we've been using, it's might be the right solution.

In [13]:
np.array([n for n in a if n < 15])
Out[13]:
array([10, 11, 12, 13, 14])

Element-wise operations in numpy

This is where numpy's element-wise operations become important. When you use a numpy array with a binary operator (>, <, ==, !=, +, -, *, /, **, ...), the resulting value is an array in which each value is the result of performing that operation on corresponding items. If one of those operands is a scalar, the operation is applied to that scalar and every item in the array. If they're both arrays and they have the same shape, the corresponding items are those at the same index. If they don't have the same shape, well-defined rules dictate how items are made to correspond. Those rules are called broadcasting.

This knowledge gives us enough information to understand idiomatic numpy filtering. Let's compare our array of values from 10 to 20 with the number of 15.

In [14]:
a < 15
Out[14]:
array([ True,  True,  True,  True,  True, False, False, False, False, False], dtype=bool)

The boolean array here is the same one we defined in various ways above and so we can use it as we did above.

In [15]:
a[a < 15]
Out[15]:
array([10, 11, 12, 13, 14])

A slightly more complicated example

For a more sophisticated example, let's create a matrix with $i$ rows by $j$ columns, where the items are $j^i$— that is, the item in the $i$th row and the $j$th column is the index of the column taken to the power of the index of the row.

In [16]:
bases = np.arange(1, 6)

bases
Out[16]:
array([1, 2, 3, 4, 5])
In [17]:
bases_matrix = np.repeat(bases, 5).reshape(5, 5)

bases_matrix
Out[17]:
array([[1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4],
       [5, 5, 5, 5, 5]])
In [18]:
powers = bases_matrix**np.arange(1, 6)

powers
Out[18]:
array([[   1,    1,    1,    1,    1],
       [   2,    4,    8,   16,   32],
       [   3,    9,   27,   81,  243],
       [   4,   16,   64,  256, 1024],
       [   5,   25,  125,  625, 3125]])

Say we want to determine values of j**i where j and i are each between 1 and 5, endpoint inclusive, and the resulting value is greater than 10. This is a more difficult proposition than our smaller examples above, but it is easily accomplished using the array we just constructed and simple numpy indexing.

In [19]:
powers[powers > 10]
Out[19]:
array([  16,   32,   27,   81,  243,   16,   64,  256, 1024,   25,  125,
        625, 3125])

Noww want to select those powers which end in 4. We can do accomplish this by using element-wise modulo arithmetic and comparison.

In [20]:
powers % 10
Out[20]:
array([[1, 1, 1, 1, 1],
       [2, 4, 8, 6, 2],
       [3, 9, 7, 1, 3],
       [4, 6, 4, 6, 4],
       [5, 5, 5, 5, 5]])
In [21]:
powers % 10 == 4
Out[21]:
array([[False, False, False, False, False],
       [False,  True, False, False, False],
       [False, False, False, False, False],
       [ True, False,  True, False,  True],
       [False, False, False, False, False]], dtype=bool)
In [22]:
powers[powers % 10 == 4]
Out[22]:
array([   4,    4,   64, 1024])

Now let's do both: powers which are greater than 10 and end in 4.

In [23]:
powers[(powers % 10 == 4) & (powers > 10)]
Out[23]:
array([  64, 1024])
In [24]:
# no, wrong, don't
try:
    powers[powers % 10 == 4 and powers > 10]
except ValueError as e:
    print(e)
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

The reason the parens and the ampersand (&) operator are necessary here is because of the nature of elementwise binary operations on boolean arrays.

Look at the values of those two arrays.

In [25]:
powers % 10 == 4
Out[25]:
array([[False, False, False, False, False],
       [False,  True, False, False, False],
       [False, False, False, False, False],
       [ True, False,  True, False,  True],
       [False, False, False, False, False]], dtype=bool)
In [26]:
powers > 10
Out[26]:
array([[False, False, False, False, False],
       [False, False, False,  True,  True],
       [False, False,  True,  True,  True],
       [False,  True,  True,  True,  True],
       [False,  True,  True,  True,  True]], dtype=bool)

We want an array which is True where both of those arrays are True. We could use this method:

In [27]:
np.logical_and(powers > 10, powers % 10 == 4)
Out[27]:
array([[False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False, False,  True, False,  True],
       [False, False, False, False, False]], dtype=bool)

But the and operator is not element-wise and does not do the same thing as this, hence the error above. It tries to coerce both arrays to a truth value and numpy refuses its request. Operator precedence means that we need to parenthesize both sides.

In [28]:
(powers % 10 == 4) & (powers > 10)
Out[28]:
array([[False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False, False,  True, False,  True],
       [False, False, False, False, False]], dtype=bool)

Back to DataFrames

There are two things we need to know about pandas and DataFrames before we can approach our first example with this new knowledge. First, a DataFrame is a collection of special numpy arrays called Series. Second, DataFrames can be indexed by a numpy array and the result will be the same as indexing all of its constituent series by that array.

In [29]:
area = [423967, 695662, 141297, 170312, 149995]
population = [38332521, 26448193, 19651127, 19552860, 12882135]
index = ['California', 'Texas', 'New York', 'Florida', 'Illinois']
df = pd.DataFrame({'area': area, 'population': population}, index=index)

density = df.population / df.area

density
Out[29]:
California     90.413926
Texas          38.018740
New York      139.076746
Florida       114.806121
Illinois       85.883763
dtype: float64
In [30]:
dense_states = density > 100

dense_states
Out[30]:
California    False
Texas         False
New York       True
Florida        True
Illinois      False
dtype: bool
In [31]:
large_states = df.area > 500000

large_states
Out[31]:
California    False
Texas          True
New York      False
Florida       False
Illinois      False
Name: area, dtype: bool
In [32]:
dense_or_large_states = dense_states | large_states

dense_or_large_states
Out[32]:
California    False
Texas          True
New York       True
Florida        True
Illinois      False
dtype: bool
In [33]:
df[dense_or_large_states]
Out[33]:
area population
Texas 695662 26448193
New York 141297 19651127
Florida 170312 19552860

This example is now fairly straightforward: to filter the dataset to only the states of the minimum area and population density, we perform elementwise operations on the population and area arrays, and use the result to index the entire dataset.


Conclusion

Understanding why filtering numpy arrays and pandas objects work the way it does involves understanding how boolean indexing works and how numpy element-wise operations work, both of which are key to competency with numpy itself. If you found this post interesting or edifying, I encourage you to learn more about numpy in the two resources I linked in this post: Python Data Science Handbook and the Python Numpy Tutorial.