The Basics Of Linear Algebra for Data Scientist

Linear algebra is behind the powerful machine learning & deep learning algorithms we are so familiar with

9 min readNov 10, 2020

Linear algebra is a field of mathematics that is widely used in various disciplines. The field of data science also leans on many different applications of linear algebra. This does not mean that every data scientist needs to have an extraordinary mathematical background since the amount of math you will be dealing with depends a lot on your role.

However, a good understanding of linear algebra really enhances the understanding of many machine learning algorithms. Foremost, to really understand deep learning algorithms, linear algebra is essential.

This article introduces the most important basic linear algebra concepts and shows two relevant data science applications of linear algebra.

Matrices and Vectors

In short, we can say that linear algebra is the ‘math of vectors and matrices’. We make use of such vectors and matrices since these are convenient mathematical ways of representing large amounts of information.

A matrix is an array of numbers, symbols or expressions, made up of rows and columns. A matrix is characterized by the number of rows, m, and the number of columns, n, it has. In general, a matrix of order ‘m x n’ (read: “m by n”) has m rows and n columns. Below, we display an example 2 x 3 matrix A:

We can refer to individual elements of the matrix through its corresponding row and column. For example, A[1, 2] = 2, since in the first row and second column the number 2 is placed.

A matrix with only a single column is called a vector. For example, every column of the matrix A above is a vector. Let us take the first column of matrix A as the vector v:

In a vector, we can also refer to individual elements. Here, we only have to make use of a single index. For example, v[2] = 4, since 4 is the second element of the vector v.

Matrix Operations

Our ability to analyze and solve particular problems within the field of linear algebra will be greatly enhanced when we can perform algebraic operations with matrices. Here, the most important basic tools for performing these operations are listed.

(i) Matrix Sums

If A and B are m x n matrices, then the sum A+B is the m x n matrix whose columns are the sums of the corresponding columns in A and B. The sum A+B is defined only when A and B are the same size.

Of course, subtraction of the matrices, A-B, works in the same way, where the columns in B are subtracted from the columns in A.

(ii) Scalar Multiples

If r is a scalar, then the scalar multiple of the matrix A is r*A, which is the matrix whose columns are r times the corresponding columns in A.

r is a scalar, then the scalar multiple of the matrix A is r*A, which is the matrix whose columns are r times the corresponding columns in A.

(iii) Matrix-Vector Multiplication

If the matrix A is of size m x n (thus, it has n columns), and u is a vector of size n, then the product of A and u, denoted by Au, is the linear combination of the columns of A using the corresponding entries in u as weights.

Note: The product Au is defined only if the number of columns of the matrix A equals the number of entries in the vector u!

Properties: If A is an m x n matrix, u and v are vectors of size n and r is a scalar, then:

(iv) Matrix Multiplication

If A is an m x n matrix and B = [b1, b2, …, bp] is an n x p matrix where bi is the i-th column of the matrix B, then the matrix product AB is the m x p matrix whose columns are Ab1, Ab2, …, Abp. So, essentially, we perform the same procedure as in (iii) with matrix-vector multiplication, where each column of the matrix B is a vector.

Since

Note: The number of columns in A must match the number of rows in B in order to perform matrix multiplication.

Properties: Let A be an m x n matrix, let B and C have sizes such that the sums and products are defined, and let r be scalar. Then:

(v) Powers of a Matrix

If A is an n x n matrix and k is a positive integer, then A^k (A to the power k) is the product of k copies of A:

(vi) Matrix Transpose

Suppose we have a matrix A of size m x n, then the transpose of A (denoted by A^T) is the n x m matrix whose columns are formed from the corresponding rows of A.

Properties: Let A and B be matrices whose sizes are appropriate for the following sums and products. Then:

Matrix Inverse

Matrix algebra provides tools for manipulating matrices and creating various useful formulas in ways similar to doing ordinary algebra with real numbers. For example, the (multiplicative) inverse of a real number, say 3, is 3^-1, or 1/3. This inverse satisfies the following equations:

This concept can be generalized for square matrices. An n x n matrix A is said to be invertible if there is an n x n matrix C such that:

Where I is the n x n identity matrix. An identity matrix is a square matrix with 1’s on the diagonal and 0’s elsewhere. Below, the 5 x 5 identity matrix is shown:

Going back to the invertibility principle above, we call the matrix C an inverse of A. In fact, C is uniquely determined by A, because if B were another inverse of A, then B = BI = B(AC) = (BA)C = IC = C. This unique inverse is denoted by A^-1, so that:

Properties:

Orthogonal Matrix

An orthogonal matrix is a square matrix whose columns and rows are orthogonal unit vectors. That is, an orthogonal matrix is an invertible matrix, let us call it Q, for which:

This leads to the equivalent characterization: a matrix Q is orthogonal if its transpose is equal to its inverse:

Applications of Linear Algebra within Data Science

To show the relevance of linear algebra in the field of data science, we are briefly going through two relevant applications.

Singular Value Decomposition (SVD)

The singular value decomposition (SVD) is a very important concept within the field of data science. Some important applications of the SVD are image compression and dimensionality reduction. Let us focus on the latter application here. Dimensionality reduction is the transformation of data from a high-dimensional space into a lower-dimensional space, in such a way that the most important information of the original data is still retained. This is desirable since the analyzing of the data becomes computationally intractable once the dimension of the data is too high.

The SVD decomposes a matrix into a product of three individual matrices, as shown below:

where

Assuming the matrix M is an m x n matrix:

U is an m x m orthogonal matrix of left singular vectors;
Σ is an m x n matrix for which the diagonal entries in D (which is r x r) are the first r singular values of M;
V is an n x n orthogonal matrix of right singular vectors.

The singular values can be used to understand the amount of variance that is explained by each of the singular vectors. The more variance it captures, the more information it accounts for. In this way, we can use this information to limit the number of vectors to the amount of variance we wish to capture.

It is possible to calculate the SVD by hand, but this quickly becomes an intensive process when the matrices become of a higher dimension. In practice, one is dealing with huge amounts of data. Luckily, we can easily implement the SVD in Python, by making use of Numpy. To keep the example simple, we define a 3x3 matrix M:

import numpy as np
from numpy.linalg import svd# define the matrix as a numpy array
M = np.array([[4, 1, 5], [2, -3, 2], [1, 2, 3]])U, Sigma, VT = svd(M)print("Left Singular Vectors:")
print(U)
print("Singular Values:")
print(np.diag(Sigma))
print("Right Singular Vectors:")
print(VT)

Output:

Left Singular Vectors:
[[-0.84705289  0.08910901 -0.52398567]
 [-0.32885778 -0.8623538   0.38496556]
 [-0.41755714  0.49840295  0.75976347]]
Singular Values:
[[7.62729138 0.         0.        ]
 [0.         3.78075422 0.        ]
 [0.         0.         0.72823326]]
Right Singular Vectors:
[[-0.58519913 -0.09119802 -0.80574494]
 [-0.23007807  0.97149302  0.0571437 ]
 [-0.77756419 -0.21882468  0.58949953]]

So, in this small example, the singular values (usually denoted as σ’s) are σ1 = 7.627, σ2 = 3.781, σ3 = 0.728. Thus, when only using the first two singular vectors, we explain (σ¹² + σ²²) / (σ¹² + σ²² + σ³²) = 99.3% of the variance!

Principal Component Analysis (PCA)

Just as the singular value decomposition, principal component analysis (PCA) is an alternative technique to reduce dimensionality. The objective of PCA is to create new uncorrelated variables, called the principal components, that maximize the captured variance. So, the idea is to reduce the dimensionality of the data set, while preserving as much ‘variability’ (that is, information) as possible. This problem reduces to solving an eigenvalue-eigenvector problem.

Since eigenvalues and eigenvectors are beyond the scope of this article, we will not dive into the mathematical explanation of PCA.

Conclusion

Math is everywhere within the field of data science. To be a successful data scientist, you definitely not need to know all the ins and outs of the math behind each concept. However, to be able to make better decisions when dealing with the data and algorithms, you need to have a solid understanding of the math and statistics behind it. This article focused on the basics of linear algebra, which is a very important discipline of math to understand as a data scientist.