Introduction to sparse matrices¶

A sparse matrix is just a matrix that is mostly zero. Typically, when people talk about sparse matrices in numerical computations, they mean matrices that are mostly zero and are represented in a way that takes advantage of that sparsity to reduce required storage or optimize operations.

As an extreme case, imagine a $M \times N$ matrix where $M = N = 1000000$, which is entirely zero save for a single $1$ at $(42, 999999)$. It's obvious that storing a trillion values—or 64Tb of 64-bit integers—is unnecessary, and we can write software which just assumes that the value is 0 at every index besides row $42$, column $999999$. We can describe this entire matrix with 5 integers:

$M=1000000$, $N=1000000$

$v=1$, $r=42$, $c=999999$.

If we had a second value $3$ at position $(33, 34)$, the same scheme would still work reasonably well:

$M=1000000$, $N=1000000$

$v_0=1$, $r_0=42$, $c_0=999999$

$v_1=3$, $r_1=33$, $c_1=34$.

In the course of analyzing data, one will inevitably want to remove items from a collection, leaving behind only the items which satisfy a condition. In vanilla python, there are two equivalent ways to spell such an operation.

# Functional, but utterly unpythonic

list(filter(lambda n: n % 2 == 0, range(10)))

[0, 2, 4, 6, 8]

# Syntactic sugar makes for quite readable code

[n for n in range(10) if n % 2 == 0]

[0, 2, 4, 6, 8]

Sparse matrix representations in scipy

Introduction to sparse matrices¶

Demystifying pandas and numpy filtering