Sparse matrix representations in scipy

Introduction to sparse matrices

A sparse matrix is just a matrix that is mostly zero. Typically, when people talk about sparse matrices in numerical computations, they mean matrices that are mostly zero and are represented in a way that takes advantage of that sparsity to reduce required storage or optimize operations.

As an extreme case, imagine a $M \times N$ matrix where $M = N = 1000000$, which is entirely zero save for a single $1$ at $(42, 999999)$. It's obvious that storing a trillion values—or 64Tb of 64-bit integers—is unnecessary, and we can write software which just assumes that the value is 0 at every index besides row $42$, column $999999$. We can describe this entire matrix with 5 integers:

$M=1000000$, $N=1000000$

$v=1$, $r=42$, $c=999999$.

If we had a second value $3$ at position $(33, 34)$, the same scheme would still work reasonably well:

$M=1000000$, $N=1000000$

$v_0=1$, $r_0=42$, $c_0=999999$

$v_1=3$, $r_1=33$, $c_1=34$.

This is similar to the Dictionary of Keys format and the COOrdinate format.

Of course, taken to the other extreme, this is quite inefficient. If this array were fully dense, with all nonzero values, we would have to store roughly three times as many numbers than if we had just stored the values consecutively in an array.

To understand how these different representations work, let's use some toy examples constructed from small matrices. In practice, there isn't much benefit to storing anything so small or so dense as a sparse matrix, but they're useful for illustrative purposes. Below we have a $(5, 5)$ matrix in which every value is either $0$ or $1$ with most values being $0$.

In [1]:
import numpy as np

m = np.matrix([
    [0, 0, 0, 1, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 1, 0, 1],
    [0, 1, 0, 1, 0]
])

m
Out[1]:
matrix([[0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 1, 0, 1],
        [0, 1, 0, 1, 0]])

For the remainder of this post, we'll take advantage of HTML display in notebooks and the sympy pretty printer to display matrices using a little utility function.

In [2]:
# the following is purely for the purposes of pretty printing matrices
from IPython.display import display
import sympy; sympy.init_printing()

def display_matrix(m):
    display(sympy.Matrix(m))


display_matrix(m)
$$\left[\begin{matrix}0 & 0 & 0 & 1 & 0\\0 & 0 & 0 & 0 & 0\\0 & 0 & 0 & 1 & 0\\0 & 0 & 1 & 0 & 1\\0 & 1 & 0 & 1 & 0\end{matrix}\right]$$

Much better!

Our first matrix here is sparse in the strict mathematical sense — it's mostly zero — but we're using np.matrix, a dense matrix object. To make sparse matrices, we'll make use of the objects provided by scipy.sparse.

The scipy sparse matrix constructors all accept dense matrices as inputs, which will allow us to create sparse matrices from our contrived examples and take them apart and see how they work.


COO representation

scipy.sparse.coo_matrix API docs | wikipedia

First on our list is COO representation. The capitalization of the name might make it seem like an acronym, but it's just an abbreviation of coordinate, and the format itself is quite comprehensible.

In [3]:
from scipy import sparse

mat_coo = sparse.coo_matrix(m)
mat_coo
Out[3]:
<5x5 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in COOrdinate format>

The repr of a sparse matrix doesn't show any of the data like a standard matrix does. And sympy doesn't understand sparse matrices of this type. To see the data, we'll have to coerce the representation back to dense.

All sparse matrix representations in scipy have a todense() method which converts the matrix to a standard numpy matrix. (Again, the traditional definition of sparse matrix is in conflict with the conventional definition—todense() just changes the representation. It does not fill the zeros in with nonzero values.)

In [4]:
display_matrix(mat_coo.todense())
$$\left[\begin{matrix}0 & 0 & 0 & 1 & 0\\0 & 0 & 0 & 0 & 0\\0 & 0 & 0 & 1 & 0\\0 & 0 & 1 & 0 & 1\\0 & 1 & 0 & 1 & 0\end{matrix}\right]$$

If you're like me, you might be tempted to dig into the scipy source to see how todense() is implemented on the various matrix representations. Unfortunately for us, the scipy source does not give itself over to inspection so easily. If you're comfortable with Fortran, LAPACK, BLAS, ATLAS, etc., the source might make more sense, but in that case, you likely have no need for this post. Instead, let's take a look at the way attributes on the COO matrix instance to see how the data is stored.

COO matrices store the value, row and column for each nonzero item in the matrix. While wikipedia describes the COO format as consisting of 3-value tuples with $(row, column, value)$ for each nonzero item, the scipy implementation stores the data, the row indices and the column indices each as their own array with a length equal to the number of nonzero items ($NNZ$).

In [5]:
mat_coo.row, mat_coo.col, mat_coo.data
Out[5]:
(array([0, 2, 3, 3, 4, 4], dtype=int32),
 array([3, 3, 2, 4, 1, 3], dtype=int32),
 array([1, 1, 1, 1, 1, 1]))

This is easy to read and understand; at row $0$, column $3$, the value is $1$. In fact, we can easily see that all nonzero values are $1$.

Let's construct a slightly-less-trivial example where the values are the integers from $1$ to $10$.

In [6]:
data = list(range(1, 11))
rows = [0, 0, 2, 2, 2, 2, 3, 3, 4, 4]
cols = [3, 4, 0, 1, 3, 4, 1, 3, 0, 4]

scipy.sparse.coo_matrix accepts data in the canonical representation as two-tuple, in which the first item is the nonzero values, and the second item is itself a two-value tuple with the rows and columns repesctively. A second argument shape is required, or else it would be unclear whether empty rows and columns existed beyond the bounds of the explicitly provided data.

In [7]:
display_matrix(
    sparse.coo_matrix((data, (rows, cols)), (5, 5)).todense())
$$\left[\begin{matrix}0 & 0 & 0 & 1 & 2\\0 & 0 & 0 & 0 & 0\\3 & 4 & 0 & 5 & 6\\0 & 7 & 0 & 8 & 0\\9 & 0 & 0 & 0 & 10\end{matrix}\right]$$

As I mentioned before, it's not easy to find and read the points in the scipy source where the various sparse representations are constructed and made dense. To illustrate how these operations and other work, let's make our own. I'm going to prefix all these simplistic sparse matrix classes with Naive because they're only for illustrative purposes. Real world sparse matrix libraries handle lots of corner cases, take advantage of sorting to optimize certain operations and call out to lower-level code to optimize other operations. Ours will do none of these things and instead focus on iteration, setting and getting values in order to make the details of these formats more intuitive.

Below is an abstract base class describing everything we want our sparse matrix classes to handle. We'll only handle a couple bits of common functionality in our base class. It accepts and validates a keyword argument for shape and saves that as an instance property; likewise for a dtype argument which sets the type of the data (eg. float, int). It also assumes that we'll define iteration, and uses that to implement a to_dense method and a utility method for pretty printing it using sympy.

In [8]:
class NaiveSparseMatrix(object):
    
    def __init__(self, shape=None, dtype=np.int64):
        if shape is None or len(shape) != 2:
            raise ValueError('shape must be provided as (M, N)')
        self.shape = shape
        self.dtype = dtype
        
    def __iter__(self):
        raise NotImplementedError
        
    def __get__(self):
        raise NotImplementedError
        
    def __set__(self):
        raise NotImplementedError
    
    def __len__(self):
        raise NotImplementedError

    def to_dense(self):
        # Our simplistic densification method constructs a numpy array, full
        # of zeros, of the same shape as the sparse matrix, and then
        # progressively fills it up by iterating over the nonzero values and
        # indices and assigning accordingly.
        mat = np.matrix(np.zeros(shape=self.shape, dtype=self.dtype))
        for row, column, value in self:
            mat[row, column] = value
        return mat
        
    def display(self):
        return display_matrix(self.to_dense())

The COO format is simple and our NaiveCOOMatrix class reflects that simplicity.

The advantages of the format are easy to see, too. The canonical representation makes it trivial to iterate over the nonzero values; as a consequence, it's easy to construct, it's easy to iterate over the nonzero values, and it's easy to set and get items by their indices.

In [9]:
class NaiveCOOMatrix(NaiveSparseMatrix):
    
    def __init__(self, items, **kwargs):
        # Accept a single positional argument which is the format described by
        # wikipedia—a list of (row, column, value) tuples.
        self.items = items
        super().__init__(**kwargs)
        
    def __iter__(self):
        # We want our iterator to iterate (row, column, value) tuples, which is
        # trivial in this format.
        for row, column, value in self.items:
            yield row, column, value
    
    def __getitem__(self, coord):
        # To find an item by coordinate, we just iterate over nonzero values
        # and look for matching coordinates.
        for row, column, value in self.items:
            if (row, column) == coord:
                return value
            
        # If we don't find it in the explicitly defined items, we know it's 0.
        return 0
            
    def __setitem__(self, coord, new_value):
        # Setting is similar to getting; we look for a matching coordinate, and
        # if we find one, we overwrite the value at the corresponding index in
        # the values array.
        for index, (row, column, value) in enumerate(self.items):
            if (row, column) == coord:
                self.items[index] = (row, column, new_value)
                break
        # If we don't find it, we can just append it to the items array.
        # Arguably, we should care about inserting it at a sorted position so
        # that iteration order makes more sense, but that's not a concern for
        # our toy examples.
        else:
            self.items.append((coord[0], coord[1], new_value))
            
    def __len__(self):
        # Counting nonzero is also easy for this representation. We have as
        # many items as we have nonzero values.
        return len(self.items)

OK! We've made a class for representing a sparse matrix. Besides the obvious optimizations, ours differs in some really important ways from scipy.sparse.coo_matrix.

  • It accepts a list of (row, column, value) tuples rather 3 arrays, one of each kind.
  • We spell it to_dense() rather than todense() because we're good people who like nice APIs.
  • scipy.sparse.coo_matrix doesn't support indexing or assignment, and does support a whole range of mathematical operations.
  • Ours supports iterating nonzero values along with their indices, but doesn't guarantee an order. It's not clear how useful this is, but all the previously-stated caveats about this being for illustrative perhaps apply here.

Since ours accepts the data in a different format, let's put our data into that format and construct it.

In [10]:
items = list(zip(rows, cols, data))

naive_coo = NaiveCOOMatrix(items, shape=(5, 5))
naive_coo.display()
$$\left[\begin{matrix}0 & 0 & 0 & 1 & 2\\0 & 0 & 0 & 0 & 0\\3 & 4 & 0 & 5 & 6\\0 & 7 & 0 & 8 & 0\\9 & 0 & 0 & 0 & 10\end{matrix}\right]$$

We did it! Let's sanity check our implementation by accessing a defined value, $5$ at position $(2, 3)$.

In [11]:
naive_coo[2, 3]
Out[11]:
$$5$$

Accessing $(0, 0)$, for which we didn't supply a value, should return $0$.

In [12]:
naive_coo[0, 0]
Out[12]:
$$0$$

We define its __len__ as the number of its nonzero values.

In [13]:
len(naive_coo)
Out[13]:
$$10$$

If we assign a nonzero value to $(0, 0)$, we should be able to access it subsequently.

In [14]:
naive_coo[0, 0] = -1
naive_coo.display()
$$\left[\begin{matrix}-1 & 0 & 0 & 1 & 2\\0 & 0 & 0 & 0 & 0\\3 & 4 & 0 & 5 & 6\\0 & 7 & 0 & 8 & 0\\9 & 0 & 0 & 0 & 10\end{matrix}\right]$$
In [15]:
naive_coo[0, 0]
Out[15]:
$$-1$$

And now the len should be a bit bigger.

In [16]:
len(naive_coo)
Out[16]:
$$11$$

Assigning a new value to a coordinate with a nonzero value should overwrite the existing value and not increase the length.

In [17]:
naive_coo[2, 3] = 99

naive_coo.display()
$$\left[\begin{matrix}-1 & 0 & 0 & 1 & 2\\0 & 0 & 0 & 0 & 0\\3 & 4 & 0 & 99 & 6\\0 & 7 & 0 & 8 & 0\\9 & 0 & 0 & 0 & 10\end{matrix}\right]$$
In [18]:
len(naive_coo)
Out[18]:
$$11$$

So that's a COO matrix. One major downside of this representation is the one mentioned in our giant example in the opening. Depending on how sparse a matrix is, and ours is not very sparse, the COO representation might actually increase the required storage. Let's look at how many values it takes to represent out matrix.

In [19]:
sum(len(t) for t in naive_coo.items)
Out[19]:
$$33$$

The COO format requires storing 33 numbers to represent 11 nonzero numbers. Storing every value consecutively would only require storing 25 numbers. Different representations take advantage of the structure of the sparsity to minimize storage and optimize operations.


DOK representation

scipy.sparse.dok_matrix API docs | wikipedia

DOK stands for dictionary of keys and it's exactly what it sounds like. Of all the formats discussed in this post, it's by far the simplest to implement using vanilla Python. Like COO, it stores 3 numbers per each non-zero number, but it uses a dictionary where the key is the pair of row and column and the value is the number.

All scipy.sparse matrix constructors support being supplied a single argument with a dense matrix, so we'll create the same example as the previous using that call signature, and then let's take it apart and see what it's made of.

In [20]:
mat_dok = sparse.dok_matrix([
    [0, 0, 0, 1, 2],
    [0, 0, 0, 0, 0],
    [3, 4, 0, 5, 6],
    [0, 7, 0, 0, 8],
    [9, 0, 0, 0, 10]
])

display_matrix(mat_dok.todense())
$$\left[\begin{matrix}0 & 0 & 0 & 1 & 2\\0 & 0 & 0 & 0 & 0\\3 & 4 & 0 & 5 & 6\\0 & 7 & 0 & 0 & 8\\9 & 0 & 0 & 0 & 10\end{matrix}\right]$$

sparse_dok implements keys(), values() and items() just like a vanilla python dict.

In [21]:
list(mat_dok.items())
Out[21]:
[((3, 1), 7),
 ((4, 4), 10),
 ((2, 1), 4),
 ((2, 0), 3),
 ((2, 3), 5),
 ((0, 4), 2),
 ((0, 3), 1),
 ((3, 4), 8),
 ((2, 4), 6),
 ((4, 0), 9)]

Implementation-wise, it doesn't get simpler. We store the dict, and we use it for iteration, lookup and assignment.

In [22]:
class NaiveDOKMatrix(NaiveSparseMatrix):
    
    def __init__(self, pairs, **kwargs):
        self.data = dict(pairs)
        super().__init__(**kwargs)
        
    def __iter__(self):
        for (row, column), value in sorted(self.data.items()):
            yield (row, column, value)

    def __getitem__(self, coord):
        return self.data.get(coord, 0)
            
    def __setitem__(self, coord, new_value):
        self.data[coord] = new_value
            
    def __len__(self):
        return len(self.data)


naive_dok = NaiveDOKMatrix(mat_dok.items(), shape=(5, 5))

naive_dok.display()
$$\left[\begin{matrix}0 & 0 & 0 & 1 & 2\\0 & 0 & 0 & 0 & 0\\3 & 4 & 0 & 5 & 6\\0 & 7 & 0 & 0 & 8\\9 & 0 & 0 & 0 & 10\end{matrix}\right]$$
In [23]:
naive_dok[2, 3]
Out[23]:
5
In [24]:
naive_dok[0, 0]
Out[24]:
$$0$$
In [25]:
naive_dok[0, 0] = 11

naive_dok.display()
$$\left[\begin{matrix}11 & 0 & 0 & 1 & 2\\0 & 0 & 0 & 0 & 0\\3 & 4 & 0 & 5 & 6\\0 & 7 & 0 & 0 & 8\\9 & 0 & 0 & 0 & 10\end{matrix}\right]$$

LIL Representation

The LIL or list of lists representation is also straightforward to understand and implement. LIL is a row-oriented representation, in which row-based operations are easier to implement and may be less complex to compute.

A LIL matrix is constructed from a single array of length $M$ (the number of rows) in which each item is a list of (column_index, value) pairs.

In [26]:
class NaiveLILMatrix(NaiveSparseMatrix):
    
    def __init__(self, items, **kwargs):
        self.items = items
        super().__init__(**kwargs)
        
    def __iter__(self):
        # It's necessary to enumerate here, because the row index is
        # not stored explicitly but rather is the index in items at
        # which the pairs of column indexes and values are stored.
        for row, row_items in enumerate(self.items):
            for column, value in row_items:
                yield (row, column, value)

    def __getitem__(self, coord):
        # To get an item, we look up the column-value pairs at the
        # supplied row index, then look for a matching column index.
        for column, value in self.items[coord[0]]:
            if column == coord[1]:
                return value
        return 0
            
    def __setitem__(self, coord, new_value):
        # Setting value sis slightly trickier. We enumerate the column
        # index, value pairs for the given row, keeping track of the
        # current index as we iterate. If a matching column is found,
        # we use that index to overwrite. Otherwise, we append the new
        # column_index, value pair to the row array.
        for index, (column, value) in enumerate(self.items[coord[0]]):
            if column == coord[1]:
                self.items[coord[0]][index] = (coord[1], new_value)
                break
        else:
            self.items[coord[0]].append((coord[1], new_value))
            
    def __len__(self):
        # The NNZ of the a LIL matrix is the sum of the length of the
        # column index, value arrays in each per-row array.
        return sum(len(row_items) for row_items in self.items)
In [27]:
items = [
    [(3, 1), (4, 2)],
    [],
    [(0, 3), (1, 4), (3, 4), (4, 6)],
    [(1, 7), (3, 8)],
    [(0, 9), (4, 10)]
]
naive_lil = NaiveLILMatrix(items, shape=(5, 5))

naive_lil.display()
$$\left[\begin{matrix}0 & 0 & 0 & 1 & 2\\0 & 0 & 0 & 0 & 0\\3 & 4 & 0 & 4 & 6\\0 & 7 & 0 & 8 & 0\\9 & 0 & 0 & 0 & 10\end{matrix}\right]$$

CSR representation

scipy.sparse.csr_matrix API docs | wikipedia

CSR stands for compressed sparse row and is good for implementing fast arithmetic operations as well as slicing by row. It's more complicated than the previous examples and it can be used to take better advantage of the sparse structure.

In [28]:
mat_csr = sparse.csr_matrix((data, (rows, cols)), (5, 5))
display_matrix(mat_csr.todense())
$$\left[\begin{matrix}0 & 0 & 0 & 1 & 2\\0 & 0 & 0 & 0 & 0\\3 & 4 & 0 & 5 & 6\\0 & 7 & 0 & 8 & 0\\9 & 0 & 0 & 0 & 10\end{matrix}\right]$$

Now let's take it apart to see what's inside.

In [29]:
mat_csr.data, mat_csr.indptr, mat_csr.indices
Out[29]:
(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int64),
 array([ 0,  2,  2,  6,  8, 10], dtype=int32),
 array([3, 4, 0, 1, 3, 4, 1, 3, 0, 4], dtype=int32))

It's clear enough that data is the nonzero values in "row-major" order, which is to say, left to right then top to bottom, much like how English text is read. It's less clear, though, what the other arrays are.

The second array is nondecreasing — each value is equal to or greater than the previous. Its first value is $0$ and its last is $10$. It's hard to say what that might be, because we provided 10 input values in the range $[0, 10)$.

Let's double our nonzero values to make it more clear what's going on here.

In [30]:
doubled_data = [n * 2 for n in data]
mat_csr = sparse.csr_matrix((doubled_data, (rows, cols)), (5, 5))
display_matrix(mat_csr.todense())
$$\left[\begin{matrix}0 & 0 & 0 & 2 & 4\\0 & 0 & 0 & 0 & 0\\6 & 8 & 0 & 10 & 12\\0 & 14 & 0 & 16 & 0\\18 & 0 & 0 & 0 & 20\end{matrix}\right]$$
In [31]:
mat_csr.data, mat_csr.indptr, mat_csr.indices
Out[31]:
(array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20], dtype=int64),
 array([ 0,  2,  2,  6,  8, 10], dtype=int32),
 array([3, 4, 0, 1, 3, 4, 1, 3, 0, 4], dtype=int32))

The data in the first array doubles and everything else stays the same. We can reasonably conclude that the second array doesn't hold the nonzero data but describes its position in the matrix.

Things become much more obvious when we look at indptr in pairs.

In [32]:
pointer_ranges = list(zip(
    mat_csr.indptr,
    mat_csr.indptr[1:]))

pointer_ranges
Out[32]:
[(0, 2), (2, 2), (2, 6), (6, 8), (8, 10)]
In [33]:
[j - i for i, j in pointer_ranges]
Out[33]:
[2, 0, 4, 2, 2]

When we subtract the latter value from the former value, we get the number of nonzero values in each row. And this is the key to how the CSR representation works. indptr has pointers to the two other arrays, describing successively where the data for each row starts and begins. The numbers in data are, as already figured out, the nonzero numbers in the matrix. The numbers in indices are the column indexes at which those corresponding nonzero numbers belong.

Equipped with this knowledge, we can access the values and the column indices by row.

In [34]:
values_by_row = [mat_csr.data[i:j] for i, j in pointer_ranges]

values_by_row
Out[34]:
[array([2, 4], dtype=int64),
 array([], dtype=int64),
 array([ 6,  8, 10, 12], dtype=int64),
 array([14, 16], dtype=int64),
 array([18, 20], dtype=int64)]
In [35]:
row_indices_by_row = [mat_csr.indices[i:j] for i, j in pointer_ranges]

row_indices_by_row
Out[35]:
[array([3, 4], dtype=int32),
 array([], dtype=int32),
 array([0, 1, 3, 4], dtype=int32),
 array([1, 3], dtype=int32),
 array([0, 4], dtype=int32)]

Now we know enough to write our own naive implementation. Like last time, we're going to change things from the way scipy works for sake of clarity.

In [36]:
class NaiveCSRMatrix(NaiveSparseMatrix):
    
    def __init__(self, values, row_extents, column_indices, **kwargs):
        # Accept the canonical representation of values, row_extents
        # and column_indices.
        self.values = values
        self.row_extents = row_extents
        self.column_indices = column_indices
        super().__init__(**kwargs)
        
    def __iter__(self):
        # Take the row_extents and zip them pairwise to get the index
        # ranges for each row, and use those to slice the values and the
        # column indices.
        pointer_ranges = zip(
            self.row_extents,
            self.row_extents[1:])
        # Since pointer_ranges is naturally M+1, or one greater than the
        # number of rows, our pairwise ranges correspond directly to the
        # row indices, and so we can iterate them with enumerate to keep
        # track of the row index.
        for row, (row_start, row_end) in enumerate(pointer_ranges):
            values = self.values[row_start:row_end]
            columns = self.column_indices[row_start:row_end]
            for value, column in zip(values, columns):
                yield row, column, value

    def __getitem__(self, coord):
        # To get items by their indices, we take the supplied row index
        # to look up the indices for values in that row.
        row, column = coord
        row_start = self.row_extents[row]
        row_end = self.row_extents[row + 1]
        
        # Then we look for the supplied column index in the column_indices
        # array, starting where the row starts and ending where the row ends.
        try:
            index = self.column_indices.index(column, row_start, row_end)
        except ValueError:
            return 0
        
        # Now we have the index of the column_indices where the column is;
        # that corresponds to the index in values where the value is.
        return self.values[index]
            
    def __setitem__(self, coord, new_value):
        # Setting an item is similar to getting it. We slice the column_indices
        # into the ones belonging to the supplied row, then we check to see if
        # the supplied column index is in that slice. If so, we look up the index
        # as above and then overwrite it. If not, we insert the value at the end
        # of the row, we insert the column index at the end of the row, and we 
        # increment all the row indices afterward.
        row, column = coord
        row_start = self.row_extents[row]
        row_end = self.row_extents[row + 1]
        column_indices = self.column_indices[row_start:row_end]
        if column in column_indices:
            index = self.column_indices.index(column, row_start, row_end)
            self.values[index] = new_value
        else:
            self.values.insert(row_end, new_value)
            self.column_indices.insert(row_end, column)
            for row_index in range(row + 1, self.shape[0] + 1):
                self.row_extents[row_index] += 1
            
    def __len__(self):
        # Counting nonzero values remains simple.
        return len(self.values)


values = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
row_extents = [0, 2, 2, 6, 8, 10]
column_indices = [3, 4, 0, 1, 3, 4, 1, 3, 0, 4]

mat_csr = NaiveCSRMatrix(
    values, row_extents, column_indices,
    shape=(5, 5))

mat_csr.display()
$$\left[\begin{matrix}0 & 0 & 0 & 1 & 2\\0 & 0 & 0 & 0 & 0\\3 & 4 & 0 & 5 & 6\\0 & 7 & 0 & 8 & 0\\9 & 0 & 0 & 0 & 10\end{matrix}\right]$$
In [37]:
mat_csr[2, 3]
Out[37]:
$$5$$
In [38]:
mat_csr[1, 0]
Out[38]:
$$0$$
In [39]:
mat_csr[3, 4] = 11

mat_csr.display()
$$\left[\begin{matrix}0 & 0 & 0 & 1 & 2\\0 & 0 & 0 & 0 & 0\\3 & 4 & 0 & 5 & 6\\0 & 7 & 0 & 8 & 11\\9 & 0 & 0 & 0 & 10\end{matrix}\right]$$
In [40]:
len(mat_csr)
Out[40]:
$$11$$

CSR is a row-oriented format which makes certain row-wise operations simpler to implement and less computationally complex to execute. If we wanted to get the nonzero values and their respective column indices for each row, we could do so easily.

In [41]:
def csr_to_row_values(mat_csr):
    row_ranges = zip(mat_csr.row_extents, mat_csr.row_extents[1:])
    return [
        list(zip(
            mat_csr.column_indices[start:end],
            mat_csr.values[start:end]))
        for start, end in row_ranges
    ]

csr_to_row_values(mat_csr)
Out[41]:
$$\left [ \left [ \left ( 3, \quad 1\right ), \quad \left ( 4, \quad 2\right )\right ], \quad \left [ \right ], \quad \left [ \left ( 0, \quad 3\right ), \quad \left ( 1, \quad 4\right ), \quad \left ( 3, \quad 5\right ), \quad \left ( 4, \quad 6\right )\right ], \quad \left [ \left ( 1, \quad 7\right ), \quad \left ( 3, \quad 8\right ), \quad \left ( 4, \quad 11\right )\right ], \quad \left [ \left ( 0, \quad 9\right ), \quad \left ( 4, \quad 10\right )\right ]\right ]$$

Getting the nonzero values for each column from a CSR-represented matrix is significantly more difficult. In the implementation below, the row_extent pairs are used to create pairs of column indices and values for each row. In order to ensure that missing column show up in the result as empty arrays, we flat map those pairs into a single list, group by the column index, create a dictionary from those groups, and use that to look up the values per column.

In [42]:
import itertools

def csr_to_column_values(mat_csr):
    row_ranges = zip(mat_csr.row_extents, mat_csr.row_extents[1:])
    indices_values = [
        pair for start, end in row_ranges
        for pair in zip(
            mat_csr.column_indices[start:end],
            mat_csr.values[start:end])
    ]
    
    column_key = lambda pair: pair[0]
    column_value_dict = {
        column: [value for _, value in group]
        for column, group in itertools.groupby(
            sorted(indices_values, key=column_key), column_key)
    }
    return [
        column_value_dict.get(column_index, [])
        for column_index in range(0, mat_csr.shape[1])
    ]
    

csr_to_column_values(mat_csr)
Out[42]:
$$\left [ \left [ 3, \quad 9\right ], \quad \left [ 4, \quad 7\right ], \quad \left [ \right ], \quad \left [ 1, \quad 5, \quad 8\right ], \quad \left [ 2, \quad 6, \quad 11, \quad 10\right ]\right ]$$

CSC representation

scipy.sparse.csc_matrix API docs | wikipedia

CSC stands for 'compressed sparse column', and as you might expect, it's the sister format to CSR, except the pointer array holds the extents of the columns.

We'll make this using the signature that allows us to supply a dense matrix, which we'll make by calling the to_dense() method we defined on all these naive matrix objects.

In [43]:
mat_csc = sparse.csc_matrix(mat_csr.to_dense(), shape=(5, 5))
mat_csc
Out[43]:
<5x5 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Column format>

scipy.sparse.csc_matrix uses the same naming convention for its canonical representation as does csr_matrix. We can see that our data, which was mostly in row-major order (save for that stray 11 we assigned post-construction), is no longer mostly-ordered in its representation because it's column-major ordered now.

In [44]:
mat_csc.data, mat_csc.indptr, mat_csc.indices
Out[44]:
(array([ 3,  9,  4,  7,  1,  5,  8,  2,  6, 11, 10], dtype=int64),
 array([ 0,  2,  4,  4,  7, 11], dtype=int32),
 array([2, 4, 2, 3, 0, 2, 3, 0, 2, 3, 4], dtype=int32))

Creating a NaiveSparseCSC class is largely a matter of swapping column and row in various places in our NaiveSparseCSR class. The same goes for writing column and row slicing functionality. The implementation is left as an exercise to the reader, or at least to the reader's imagination.


BSR representation

scipy.sparse.bsr_matrix API docs

BSR stands for 'block sparse row' and it is also related to CSR.

In [45]:
mat_bsr = sparse.bsr_matrix((data, (rows, cols)), (5, 5))
mat_bsr.todense()
Out[45]:
matrix([[ 0,  0,  0,  1,  2],
        [ 0,  0,  0,  0,  0],
        [ 3,  4,  0,  5,  6],
        [ 0,  7,  0,  8,  0],
        [ 9,  0,  0,  0, 10]], dtype=int64)
In [46]:
mat_bsr.data, mat_bsr.indptr, mat_bsr.indices
Out[46]:
(array([[[ 1]],
 
        [[ 2]],
 
        [[ 3]],
 
        [[ 4]],
 
        [[ 5]],
 
        [[ 6]],
 
        [[ 7]],
 
        [[ 8]],
 
        [[ 9]],
 
        [[10]]], dtype=int64),
 array([ 0,  2,  2,  6,  8, 10], dtype=int32),
 array([3, 4, 0, 1, 3, 4, 1, 3, 0, 4], dtype=int32))

We're running into the limitations of our small contrived example, so let's borrow this example from the scipy docs.

In [47]:
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6]).repeat(4).reshape(6, 2, 2)
mat_bsr = sparse.bsr_matrix((data,indices,indptr), shape=(6, 6))

display_matrix(mat_bsr.todense())
$$\left[\begin{matrix}1 & 1 & 0 & 0 & 2 & 2\\1 & 1 & 0 & 0 & 2 & 2\\0 & 0 & 0 & 0 & 3 & 3\\0 & 0 & 0 & 0 & 3 & 3\\4 & 4 & 5 & 5 & 6 & 6\\4 & 4 & 5 & 5 & 6 & 6\end{matrix}\right]$$

Like the name implies, BSR format represents a sparse matrix as a dense array of dense blocks.

Here, scipy infers the blocksize from the data we provide. Let's look at that data again.

In [48]:
data
Out[48]:
array([[[1, 1],
        [1, 1]],

       [[2, 2],
        [2, 2]],

       [[3, 3],
        [3, 3]],

       [[4, 4],
        [4, 4]],

       [[5, 5],
        [5, 5]],

       [[6, 6],
        [6, 6]]])

data is an array in which every value is a $2 \times 2$ array—effectively, a little matrix. BSR requires that this block size divides the matrix's dimensions evenly, which allows the other indices to be relative to that block size. So our $6 \times 6$ matrix is indexable as a $3 \times 3$ matrix in which the items are not individual values but block matrices themselves. Once we make that jump in indexability, most of the rest is CSR-like.

In [49]:
class NaiveBSRMatrix(NaiveSparseMatrix):
    
    def __init__(self, blocks, row_extents, column_indices, **kwargs):
        # Accept the canonical representation of values, row_extents
        # and column_indices.
        self.blocks = blocks
        # Assume that all blocks are the same size and the block size
        # divides the matrix evenly.
        self.block_size = (len(blocks[0]), len(blocks[0][0]))
        self.row_extents = row_extents
        self.column_indices = column_indices
        super().__init__(**kwargs)
        
    def _scale_column(self, column_index, offset=0):
        # Take a column block index and an offset within that block and 
        # return an absolute column index.
        return column_index * self.block_size[1] + offset
    
    def _scale_coord(self, coord, offset=(0, 0)):
        # Take a block coordinate and an offset within that block and 
        # return an absolute coordinate.
        return (
            self._scale_row(coord[0], offset[0]),
            self._scale_column(coord[1], offset[1])
        )
    
    def _scale_row(self, row_index, offset=0):
        # Take a row block index and an offset within that block and 
        # return an absolute row index.
        return row_index * self.block_size[0] + offset
    
    def _unscale_column(self, column_index):
        # Take an absolute column index and return the column index of
        # that block and the relative column index within that block.
        return (
            column_index // self.block_size[1],
            column_index % self.block_size[1]
        )
    
    def _unscale_coord(self, coord):
        # Take an absolute coordinate and return the coordinate of
        # that block and the relative coordinate within that block.
        row, row_offset = self._unscale_row(coord[0])
        column, column_offset = self._unscale_column(coord[1])
        return (row, column), (row_offset, column_offset)
    
    def _unscale_row(self, row_index):
        # Take an absolute row index and return the row index of
        # that block and the relative row index within that block.
        return (
            row_index // self.block_size[0],
            row_index % self.block_size[0]
        )
        
    def __iter__(self):
        # Zip the row extents pairwise to get the index ranges for
        # each block row.
        pointer_ranges = zip(
            self.row_extents, self.row_extents[1:])

        for row_index, (start, end) in enumerate(pointer_ranges):
            blocks = self.blocks[start:end]
            column_indices = self.column_indices[start:end]
            # By zipping the blocks with themselves, we transpose the blocks
            # that belong to the same rows, thereby making a list that is as long
            # as the height of the block size, in which item is a tuple with the
            # same length as the number of blocks in a row, containing the lists
            # for that row of that block.
            # So [
            #     [1, 2, 0, 0, 5, 6],
            #     [3, 4, 0, 0, 7, 8],
            #     ...
            # ]
            #
            # which is represented as
            # [
            #     [[1, 2], [3, 4]],
            #     [[5, 6], [7, 8]]
            # ]
            #
            # becomes [[[1, 2], [5, 6]], [[3, 4], [7, 8]].
            #
            # By enumerating these rows with indices, we keep track of the
            # row offset within the block.
            for row_offset, row_block in enumerate(zip(*blocks)):
                # By zipping together those transposed row blocks with the column
                # indices for this block, we get lists of block rows, per blocks,
                # with their column indices alongside.
                # So [[[1, 2], [5, 6]], [[3, 4], [7, 8]]
                # 
                # becomes [[0, [[1, 2], [5, 6]]], [2, [[3, 4], [7, 8]]]]
                for column_index, block in zip(column_indices, row_block):
                    # As we enumerate the values inside those blocks above, we can keep track
                    # of the column offset within the bock. Now we have the value, the row offset,
                    # the column offset, and the block coordinate, and can use all of these
                    # to generate the tuple of absolute row, column and value.
                    for column_offset, value in enumerate(block):
                        yield (
                            self._scale_row(row_index, row_offset),
                            self._scale_column(column_index, column_offset),
                            value
                        )

    def __getitem__(self, coord):
        # Getting works a lot like it does for the CSR representation with
        # the addition of a coordinate transformation.
        #
        # First, "unscale" the absolute coordinate to get block-relative
        # indexes and offsets.
        (row, column), (row_offset, column_offset) = self._unscale_coord(coord)

        # From here, the logic is nearly identical to the CSR representation.
        row_start = self.row_extents[row]
        row_end = self.row_extents[row + 1]

        try:
            index = self.column_indices.index(column, row_start, row_end)
        except ValueError:
            return 0
        
        # The index that we have is the index in the blocks array where our
        # block is. The positions of the value inside that block is just
        # the row and column offset we got by unscaling the input coordinate.
        return self.blocks[index][row_offset][column_offset]
            
    def __setitem__(self, coord, new_value):
        # Setting an existing value is quite similar to CSR.
        (row, column), (row_offset, column_offset) = self._unscale_coord(coord)

        row_start = self.row_extents[row]
        row_end = self.row_extents[row + 1]
        column_indices = self.column_indices[row_start:row_end]
        if column in column_indices:
            index = self.column_indices.index(column, row_start, row_end)
            # We want to set just one value within the block, and the input
            # coordinate dictates where within that block once we find the 
            # block to change.
            self.blocks[index][row_offset][column_offset] = new_value
        else:
            # But we can't just add a value if it's not in an existing block.
            # We have to initialize a a new empty block and add it to our blocks.
            new_block = [
                [0 for _ in range(self.block_size[1])]
                for _ in range(self.block_size[0])
            ]
            new_block[row_offset][column_offset] = new_value
            self.blocks.insert(row_end, new_block)
            self.column_indices.insert(row_end, column)
            for row_index in range(row + 1, self._unscale_row(self.shape[0])[0] + 1):
                self.row_extents[row_index] += 1
            
    def __len__(self):
        # Our len here is a bit more complicated! It's the number of blocks
        # times the number of values inside each block.
        return len(self.blocks) * self.block_size[0] * self.block_size[1]


data = [
    [[1,  2],  [3,  4]],
    [[5,  6],  [7,  8]],
    [[9,  10], [11, 12]],
    [[13, 14], [15, 16]],
    [[17, 18], [19, 20]],
    [[21, 22], [23, 24]]
]
row_extents = [0, 2, 3, 6]
column_indices = [0, 2, 2, 0, 1, 2]

naive_bsr = NaiveBSRMatrix(data, row_extents, column_indices, shape=(6, 6))

naive_bsr.display()
$$\left[\begin{matrix}1 & 2 & 0 & 0 & 5 & 6\\3 & 4 & 0 & 0 & 7 & 8\\0 & 0 & 0 & 0 & 9 & 10\\0 & 0 & 0 & 0 & 11 & 12\\13 & 14 & 17 & 18 & 21 & 22\\15 & 16 & 19 & 20 & 23 & 24\end{matrix}\right]$$
In [50]:
naive_bsr[4, 1]
Out[50]:
$$14$$
In [51]:
naive_bsr[2, 2]
Out[51]:
$$0$$
In [52]:
naive_bsr[3, 2] = 11

naive_bsr.display()
$$\left[\begin{matrix}1 & 2 & 0 & 0 & 5 & 6\\3 & 4 & 0 & 0 & 7 & 8\\0 & 0 & 0 & 0 & 9 & 10\\0 & 0 & 11 & 0 & 11 & 12\\13 & 14 & 17 & 18 & 21 & 22\\15 & 16 & 19 & 20 & 23 & 24\end{matrix}\right]$$

DIA Representation

DIA format, short for diagonal, represents the data as a series of vectors along different diagonals, the diagonals themselves being indicated by relative offsets from the main diagonal.

The identity matrix provides a simple example.

In [53]:
identity_10 = np.matrix(np.identity(n=10), dtype=np.int64)

display_matrix(identity_10)
$$\left[\begin{matrix}1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1\end{matrix}\right]$$

We can construct the scipy sparse version by passing this dense matrix to the dia_matrix constructor.

In [54]:
mat_dia = sparse.dia_matrix(identity_10)

mat_dia
Out[54]:
<10x10 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements (1 diagonals) in DIAgonal format>

Looking at its canonical representation, we find that we have an array of our main diagonal and a single offset value of zero.

In [55]:
mat_dia.data, mat_dia.offsets
Out[55]:
(array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), array([0], dtype=int32))

To get a better handle on how this works, let's make a $5 \times 5$ matrix which has $1$ on its main diagonal, even numbers above it, odd numbers below it, up to $5$.

In [56]:
data = [
    [1, 2, 4, 0, 0],
    [3, 1, 2, 4, 0],
    [5, 3, 1, 2, 4],
    [0, 5, 3, 1, 2],
    [0, 0, 5, 3, 1]
]
mat_dia = sparse.dia_matrix(data, shape=(5, 5))

display_matrix(mat_dia.todense())
$$\left[\begin{matrix}1 & 2 & 4 & 0 & 0\\3 & 1 & 2 & 4 & 0\\5 & 3 & 1 & 2 & 4\\0 & 5 & 3 & 1 & 2\\0 & 0 & 5 & 3 & 1\end{matrix}\right]$$
In [57]:
mat_dia.data, mat_dia.offsets
Out[57]:
(array([[5, 5, 5, 0, 0],
        [3, 3, 3, 3, 0],
        [1, 1, 1, 1, 1],
        [0, 2, 2, 2, 2],
        [0, 0, 4, 4, 4]]), array([-2, -1,  0,  1,  2], dtype=int32))

Looking inside this one we can see that our diagonals are each in their own array, with an array indicating their offsets from the main diagonal—negative being below, positive being above, and 0 being the main diagonal.

In [58]:
class NaiveDIAMatrix(NaiveSparseMatrix):
    
    def __init__(self, diagonals, offsets, **kwargs):
        self.diagonals = diagonals
        self.offsets = offsets
        super().__init__(**kwargs)
        
    def __iter__(self):
        # Iterate a range (0, rows) to iterate by row index.
        for row in range(self.shape[0]):
            # Enumerate the offsets with index so that we can use that
            # index to access the diagonals.
            for index, offset in enumerate(self.offsets):
                # Whether a diagonal is visible is relative to the current row.
                # All the positive-offset diagonals are visible in the first row,
                # and all the negative-offset diagonals are visible in the last
                # row. The inequality check below handles those cases and the
                # in-between cases.
                if row * -1 <= offset < self.shape[1] - row:
                    yield row, row + offset, self.diagonals[index][row + offset]

    def __getitem__(self, coord):
        # As mentioned above, offsets are relative to row. In fact, the offset
        # is equal to the column index minus the row index. So we can calculate
        # which diagonal a provided coordinate would belong to by calculating
        # that offset, which we use to check the offsets array to find the value.
        row, column = coord
        offset = column - row
        try:
            index = self.offsets.index(offset)
        except ValueError:
            return 0
        
        return self.diagonals[index][column]
            
    def __setitem__(self, coord, new_value):
        # Setting and item is similar to getting an item, so long as the diagonal
        # already exists.
        row, column = coord
        offset = column - row
        if offset in self.offsets:
            index = self.offsets.index(offset)
            self.diagonals[index][column] = new_value
        # If the diagonal does not yet exist, we have to create an empty one and
        # set the provided index to the provided value.
        else:
            new_diagonal = [0 for _ in range(self.shape[0])]
            new_diagonal[column] = new_value
            self.diagonals.append(new_diagonal)
            self.offsets.append(offset)
            
    def __len__(self):
        # Our len here is a bit more complicated! It's the number of blocks
        # times the number of values inside each block.
        return sum(self.shape[0] - abs(offset) for offset in self.offsets)


diagonals = [
    [5, 5, 5, 0, 0],
    [3, 3, 3, 3, 0],
    [1, 1, 1, 1, 1],
    [0, 2, 2, 2, 2],
    [0, 0, 4, 4, 4]
]

offsets = [-2, -1,  0,  1,  2]
naive_dia = NaiveDIAMatrix(diagonals, offsets, shape=(5, 5))

naive_dia.display()
$$\left[\begin{matrix}1 & 2 & 4 & 0 & 0\\3 & 1 & 2 & 4 & 0\\5 & 3 & 1 & 2 & 4\\0 & 5 & 3 & 1 & 2\\0 & 0 & 5 & 3 & 1\end{matrix}\right]$$
In [59]:
naive_dia[3, 4]
Out[59]:
$$2$$
In [60]:
naive_dia[2, 2]
Out[60]:
$$1$$
In [61]:
naive_dia[4, 0]
Out[61]:
$$0$$
In [62]:
len(naive_dia)
Out[62]:
$$19$$
In [63]:
naive_dia[4, 0] = 9

len(naive_dia)
Out[63]:
$$20$$
In [64]:
naive_dia.display()
$$\left[\begin{matrix}1 & 2 & 4 & 0 & 0\\3 & 1 & 2 & 4 & 0\\5 & 3 & 1 & 2 & 4\\0 & 5 & 3 & 1 & 2\\9 & 0 & 5 & 3 & 1\end{matrix}\right]$$

Conclusion

Sparse matrices might seem mystifying at first glance, but they're straightforward enough that every flavor of representation available in scipy.sparse is easy enough to read and write using (mostly) pure Python.

That being said, there's a lot more to learn about sparse matrices representations — particularly, how the various representations can be leveraged to optimize mathematical operations and the complexity of constructing and converting between these representations. I wanted to provide examples in this post of how sparse matrices naturally arise from certain applications and I still hope to do so in a future post.

Thanks for reading this far! You can ask questions, file corrections or otherwise make noise towards me on Twitter and I'll edit this post to link to the HN thread when and if it comes into existence.