{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the course of analyzing data, one will inevitably want to remove items from a collection, leaving behind only the items which satisfy a condition. In vanilla python, there are two equivalent ways to spell such an operation."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[0, 2, 4, 6, 8]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Functional, but utterly unpythonic\n",
"\n",
"list(filter(lambda n: n % 2 == 0, range(10)))"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[0, 2, 4, 6, 8]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Syntactic sugar makes for quite readable code\n",
"\n",
"[n for n in range(10) if n % 2 == 0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Textbook Pandas Example\n",
"\n",
"It might be surprising for people coming from a conventional programming background that filtering in `pandas` (and in `numpy`, the numerical magic powering many pandas operations) works quite differently.\n",
"\n",
"Take this simplified and tweaked example from the excellent [Data Science Handbook by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html). We have the area and population for the five biggest US states and we want to filter to only those with over 100 people per square km or with an area of at least 500,000 square km."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" area | \n",
" population | \n",
"
\n",
" \n",
" \n",
" \n",
" Texas | \n",
" 695662 | \n",
" 26448193 | \n",
"
\n",
" \n",
" New York | \n",
" 141297 | \n",
" 19651127 | \n",
"
\n",
" \n",
" Florida | \n",
" 170312 | \n",
" 19552860 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" area population\n",
"Texas 695662 26448193\n",
"New York 141297 19651127\n",
"Florida 170312 19552860"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"area = [423967, 695662, 141297, 170312, 149995]\n",
"population = [38332521, 26448193, 19651127, 19552860, 12882135]\n",
"index = ['California', 'Texas', 'New York', 'Florida', 'Illinois']\n",
"df = pd.DataFrame({'area': area, 'population': population}, index=index)\n",
"\n",
"df[(df.population / df.area > 100) | (df.area > 500000)]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"While this syntax is easy enough to memorize, it's worth understanding _why_ it works like this.\n",
"\n",
"---\n",
"## Basic numpy indexing\n",
"\n",
"Let's go back to basics. Let's create an array with the first four positive integers."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 2, 3, 4])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"\n",
"a = np.array([1, 2, 3, 4])\n",
"\n",
"a"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can, of course, index this all the usual ways we're used to with python lists."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a[0]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([2, 3])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a[1:3]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([3, 4])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a[-2:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With numpy arrays, we have a few additional ways to select items by index.\n",
"\n",
"One very common operation is to index an array by an array of indexes; what results is the values from the first array at the indexes specified in the second. This time, let's use a utility function to create the first array with the natural numbers between 10 and 20, endpoint excluded."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a = np.arange(10, 20)\n",
"\n",
"a"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we index it with `[0, 3, 4]`, we get back an array with the zeroth, third and fourth values."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([10, 13, 14])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"indexes = np.array([0, 3, 4])\n",
"\n",
"a[indexes]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Simple lists work as the indexing array as well. Indexing an array with a list literal results in clean-looking code."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([10, 13, 14])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a[[0, 3, 4]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is simple enough. An extension of this syntax exists for indexing matrices, but that's outside the scope of this little blog post.\n",
"\n",
"---\n",
"\n",
"## Boolean indexing\n",
"\n",
"Let's turn our attention back toward a goal. How do we filter a `numpy` array (or a `Series` or a `DataFrame`)? Well, `numpy` supports another indexing syntax. We can create an array of the same shape but with a `dtype` of `bool`, where each entry is `True` or `False`. When we index `a` by this array, we get back only the items which correspond to a `True` in the array of booleans. Below we set the first five values to `True` and make the last five `False`."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([10, 11, 12, 13, 14])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a[[\n",
" True, True, True, True, True,\n",
" False, False, False, False, False\n",
"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Equipped with this syntax, we can ask a new question: how best can we create arrays which correspond to an input array such that there's a `True` for all the values we we can't to keep, and a `False` for those we don't.\n",
"\n",
"One easy way comes to mind: use a list comprehension which maps each value in the array to `True` or `False` according to our predicate. This works and it gets us something like an answer to our question."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([10, 11, 12, 13, 14])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a[[n < 15 for n in a]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If this feels weird to you though, that makes sense: we've reimplemented the conditional clause in a strange way. The following is equivalent to the former, but more pythonic. For the toy examples we've been using, it's might be the right solution."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([10, 11, 12, 13, 14])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.array([n for n in a if n < 15])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Element-wise operations in numpy\n",
"\n",
"This is where numpy's element-wise operations become important. When you use a numpy array with a binary operator (`>`, `<`, `==`, `!=`, `+`, `-`, `*`, `/`, `**`, ...), the resulting value is an array in which each value is the result of performing that operation on corresponding items. If one of those operands is a scalar, the operation is applied to that scalar and every item in the array. If they're both arrays and they have the same shape, the corresponding items are those at the same index. If they don't have the same shape, well-defined rules dictate how items are made to correspond. Those rules are called [broadcasting](http://cs231n.github.io/python-numpy-tutorial/#numpy-broadcasting).\n",
"\n",
"This knowledge gives us enough information to understand idiomatic `numpy` filtering. Let's compare our array of values from 10 to 20 with the number of 15."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ True, True, True, True, True, False, False, False, False, False], dtype=bool)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a < 15"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The boolean array here is the same one we defined in various ways above and so we can use it as we did above."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([10, 11, 12, 13, 14])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a[a < 15]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## A slightly more complicated example\n",
"\n",
"For a more sophisticated example, let's create a matrix with $i$ rows by $j$ columns, where the items are $j^i$— that is, the item in the $i$th row and the $j$th column is the index of the column taken to the power of the index of the row."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 2, 3, 4, 5])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bases = np.arange(1, 6)\n",
"\n",
"bases"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[1, 1, 1, 1, 1],\n",
" [2, 2, 2, 2, 2],\n",
" [3, 3, 3, 3, 3],\n",
" [4, 4, 4, 4, 4],\n",
" [5, 5, 5, 5, 5]])"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bases_matrix = np.repeat(bases, 5).reshape(5, 5)\n",
"\n",
"bases_matrix"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1, 1, 1, 1, 1],\n",
" [ 2, 4, 8, 16, 32],\n",
" [ 3, 9, 27, 81, 243],\n",
" [ 4, 16, 64, 256, 1024],\n",
" [ 5, 25, 125, 625, 3125]])"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"powers = bases_matrix**np.arange(1, 6)\n",
"\n",
"powers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Say we want to determine values of `j**i` where `j` and `i` are each between 1 and 5, endpoint inclusive, and the resulting value is greater than 10. This is a more difficult proposition than our smaller examples above, but it is easily accomplished using the array we just constructed and simple `numpy` indexing."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 16, 32, 27, 81, 243, 16, 64, 256, 1024, 25, 125,\n",
" 625, 3125])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"powers[powers > 10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Noww want to select those powers which end in 4. We can do accomplish this by using element-wise modulo arithmetic and comparison."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[1, 1, 1, 1, 1],\n",
" [2, 4, 8, 6, 2],\n",
" [3, 9, 7, 1, 3],\n",
" [4, 6, 4, 6, 4],\n",
" [5, 5, 5, 5, 5]])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"powers % 10"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[False, False, False, False, False],\n",
" [False, True, False, False, False],\n",
" [False, False, False, False, False],\n",
" [ True, False, True, False, True],\n",
" [False, False, False, False, False]], dtype=bool)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"powers % 10 == 4"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 4, 4, 64, 1024])"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"powers[powers % 10 == 4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's do both: powers which are greater than 10 and end in 4."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 64, 1024])"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"powers[(powers % 10 == 4) & (powers > 10)]"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()\n"
]
}
],
"source": [
"# no, wrong, don't\n",
"try:\n",
" powers[powers % 10 == 4 and powers > 10]\n",
"except ValueError as e:\n",
" print(e)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The reason the parens and the ampersand (`&`) operator are necessary here is because of the nature of elementwise binary operations on boolean arrays.\n",
"\n",
"Look at the values of those two arrays."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[False, False, False, False, False],\n",
" [False, True, False, False, False],\n",
" [False, False, False, False, False],\n",
" [ True, False, True, False, True],\n",
" [False, False, False, False, False]], dtype=bool)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"powers % 10 == 4"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[False, False, False, False, False],\n",
" [False, False, False, True, True],\n",
" [False, False, True, True, True],\n",
" [False, True, True, True, True],\n",
" [False, True, True, True, True]], dtype=bool)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"powers > 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want an array which is `True` where both of those arrays are `True`. We could use this method:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[False, False, False, False, False],\n",
" [False, False, False, False, False],\n",
" [False, False, False, False, False],\n",
" [False, False, True, False, True],\n",
" [False, False, False, False, False]], dtype=bool)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.logical_and(powers > 10, powers % 10 == 4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But the `and` operator is _not_ element-wise and does not do the same thing as this, hence the error above. It tries to coerce both arrays to a truth value and `numpy` refuses its request. Operator precedence means that we need to parenthesize both sides."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[False, False, False, False, False],\n",
" [False, False, False, False, False],\n",
" [False, False, False, False, False],\n",
" [False, False, True, False, True],\n",
" [False, False, False, False, False]], dtype=bool)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(powers % 10 == 4) & (powers > 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Back to DataFrames"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"There are two things we need to know about `pandas` and `DataFrames` before we can approach our first example with this new knowledge. First, a `DataFrame` is a collection of special `numpy` arrays called `Series`. Second, `DataFrame`s can be indexed by a `numpy` array and the result will be the same as indexing all of its constituent series by that array."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"California 90.413926\n",
"Texas 38.018740\n",
"New York 139.076746\n",
"Florida 114.806121\n",
"Illinois 85.883763\n",
"dtype: float64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"area = [423967, 695662, 141297, 170312, 149995]\n",
"population = [38332521, 26448193, 19651127, 19552860, 12882135]\n",
"index = ['California', 'Texas', 'New York', 'Florida', 'Illinois']\n",
"df = pd.DataFrame({'area': area, 'population': population}, index=index)\n",
"\n",
"density = df.population / df.area\n",
"\n",
"density"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"California False\n",
"Texas False\n",
"New York True\n",
"Florida True\n",
"Illinois False\n",
"dtype: bool"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dense_states = density > 100\n",
"\n",
"dense_states"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"California False\n",
"Texas True\n",
"New York False\n",
"Florida False\n",
"Illinois False\n",
"Name: area, dtype: bool"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"large_states = df.area > 500000\n",
"\n",
"large_states"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"California False\n",
"Texas True\n",
"New York True\n",
"Florida True\n",
"Illinois False\n",
"dtype: bool"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dense_or_large_states = dense_states | large_states\n",
"\n",
"dense_or_large_states"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" area | \n",
" population | \n",
"
\n",
" \n",
" \n",
" \n",
" Texas | \n",
" 695662 | \n",
" 26448193 | \n",
"
\n",
" \n",
" New York | \n",
" 141297 | \n",
" 19651127 | \n",
"
\n",
" \n",
" Florida | \n",
" 170312 | \n",
" 19552860 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" area population\n",
"Texas 695662 26448193\n",
"New York 141297 19651127\n",
"Florida 170312 19552860"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[dense_or_large_states]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This example is now fairly straightforward: to filter the dataset to only the states of the minimum area and population density, we perform elementwise operations on the population and area arrays, and use the result to index the entire dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Conclusion\n",
"\n",
"Understanding why filtering `numpy` arrays and `pandas` objects work the way it does involves understanding how boolean indexing works and how `numpy` element-wise operations work, both of which are key to competency with `numpy` itself. If you found this post interesting or edifying, I encourage you to learn more about `numpy` in the two resources I linked in this post: [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) and the [Python Numpy Tutorial\n",
"](http://cs231n.github.io/python-numpy-tutorial/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}