Demystifying pandas and numpy filtering
In the course of analyzing data, one will inevitably want to remove items from a collection, leaving behind only the items which satisfy a condition. In vanilla python, there are two equivalent ways to spell such an operation.
# Functional, but utterly unpythonic
list(filter(lambda n: n % 2 == 0, range(10)))
# Syntactic sugar makes for quite readable code
[n for n in range(10) if n % 2 == 0]
Textbook Pandas Example¶
It might be surprising for people coming from a conventional programming background that filtering in pandas
(and in numpy
, the numerical magic powering many pandas operations) works quite differently.
Take this simplified and tweaked example from the excellent Data Science Handbook by Jake VanderPlas. We have the area and population for the five biggest US states and we want to filter to only those with over 100 people per square km or with an area of at least 500,000 square km.
import pandas as pd
area = [423967, 695662, 141297, 170312, 149995]
population = [38332521, 26448193, 19651127, 19552860, 12882135]
index = ['California', 'Texas', 'New York', 'Florida', 'Illinois']
df = pd.DataFrame({'area': area, 'population': population}, index=index)
df[(df.population / df.area > 100) | (df.area > 500000)]
While this syntax is easy enough to memorize, it's worth understanding why it works like this.
Basic numpy indexing¶
Let's go back to basics. Let's create an array with the first four positive integers.
import numpy as np
a = np.array([1, 2, 3, 4])
a
We can, of course, index this all the usual ways we're used to with python lists.
a[0]
a[1:3]
a[-2:]
With numpy arrays, we have a few additional ways to select items by index.
One very common operation is to index an array by an array of indexes; what results is the values from the first array at the indexes specified in the second. This time, let's use a utility function to create the first array with the natural numbers between 10 and 20, endpoint excluded.
a = np.arange(10, 20)
a
When we index it with [0, 3, 4]
, we get back an array with the zeroth, third and fourth values.
indexes = np.array([0, 3, 4])
a[indexes]
Simple lists work as the indexing array as well. Indexing an array with a list literal results in clean-looking code.
a[[0, 3, 4]]
This is simple enough. An extension of this syntax exists for indexing matrices, but that's outside the scope of this little blog post.
Boolean indexing¶
Let's turn our attention back toward a goal. How do we filter a numpy
array (or a Series
or a DataFrame
)? Well, numpy
supports another indexing syntax. We can create an array of the same shape but with a dtype
of bool
, where each entry is True
or False
. When we index a
by this array, we get back only the items which correspond to a True
in the array of booleans. Below we set the first five values to True
and make the last five False
.
a[[
True, True, True, True, True,
False, False, False, False, False
]]
Equipped with this syntax, we can ask a new question: how best can we create arrays which correspond to an input array such that there's a True
for all the values we we can't to keep, and a False
for those we don't.
One easy way comes to mind: use a list comprehension which maps each value in the array to True
or False
according to our predicate. This works and it gets us something like an answer to our question.
a[[n < 15 for n in a]]
If this feels weird to you though, that makes sense: we've reimplemented the conditional clause in a strange way. The following is equivalent to the former, but more pythonic. For the toy examples we've been using, it's might be the right solution.
np.array([n for n in a if n < 15])
Element-wise operations in numpy¶
This is where numpy's element-wise operations become important. When you use a numpy array with a binary operator (>
, <
, ==
, !=
, +
, -
, *
, /
, **
, ...), the resulting value is an array in which each value is the result of performing that operation on corresponding items. If one of those operands is a scalar, the operation is applied to that scalar and every item in the array. If they're both arrays and they have the same shape, the corresponding items are those at the same index. If they don't have the same shape, well-defined rules dictate how items are made to correspond. Those rules are called broadcasting.
This knowledge gives us enough information to understand idiomatic numpy
filtering. Let's compare our array of values from 10 to 20 with the number of 15.
a < 15
The boolean array here is the same one we defined in various ways above and so we can use it as we did above.
a[a < 15]
A slightly more complicated example¶
For a more sophisticated example, let's create a matrix with $i$ rows by $j$ columns, where the items are $j^i$— that is, the item in the $i$th row and the $j$th column is the index of the column taken to the power of the index of the row.
bases = np.arange(1, 6)
bases
bases_matrix = np.repeat(bases, 5).reshape(5, 5)
bases_matrix
powers = bases_matrix**np.arange(1, 6)
powers
Say we want to determine values of j**i
where j
and i
are each between 1 and 5, endpoint inclusive, and the resulting value is greater than 10. This is a more difficult proposition than our smaller examples above, but it is easily accomplished using the array we just constructed and simple numpy
indexing.
powers[powers > 10]
Noww want to select those powers which end in 4. We can do accomplish this by using element-wise modulo arithmetic and comparison.
powers % 10
powers % 10 == 4
powers[powers % 10 == 4]
Now let's do both: powers which are greater than 10 and end in 4.
powers[(powers % 10 == 4) & (powers > 10)]
# no, wrong, don't
try:
powers[powers % 10 == 4 and powers > 10]
except ValueError as e:
print(e)
The reason the parens and the ampersand (&
) operator are necessary here is because of the nature of elementwise binary operations on boolean arrays.
Look at the values of those two arrays.
powers % 10 == 4
powers > 10
We want an array which is True
where both of those arrays are True
. We could use this method:
np.logical_and(powers > 10, powers % 10 == 4)
But the and
operator is not element-wise and does not do the same thing as this, hence the error above. It tries to coerce both arrays to a truth value and numpy
refuses its request. Operator precedence means that we need to parenthesize both sides.
(powers % 10 == 4) & (powers > 10)
Back to DataFrames¶
There are two things we need to know about pandas
and DataFrames
before we can approach our first example with this new knowledge. First, a DataFrame
is a collection of special numpy
arrays called Series
. Second, DataFrame
s can be indexed by a numpy
array and the result will be the same as indexing all of its constituent series by that array.
area = [423967, 695662, 141297, 170312, 149995]
population = [38332521, 26448193, 19651127, 19552860, 12882135]
index = ['California', 'Texas', 'New York', 'Florida', 'Illinois']
df = pd.DataFrame({'area': area, 'population': population}, index=index)
density = df.population / df.area
density
dense_states = density > 100
dense_states
large_states = df.area > 500000
large_states
dense_or_large_states = dense_states | large_states
dense_or_large_states
df[dense_or_large_states]
This example is now fairly straightforward: to filter the dataset to only the states of the minimum area and population density, we perform elementwise operations on the population and area arrays, and use the result to index the entire dataset.
Conclusion¶
Understanding why filtering numpy
arrays and pandas
objects work the way it does involves understanding how boolean indexing works and how numpy
element-wise operations work, both of which are key to competency with numpy
itself. If you found this post interesting or edifying, I encourage you to learn more about numpy
in the two resources I linked in this post: Python Data Science Handbook and the Python Numpy Tutorial.