Friday 8 May 2015

Hucking Cool Matrix Factorization

Matrix Factorization is hucking cool. Three times over. Firstly because it is a frequently mentioned requirement in data science jobs. For examples, you can take a look at this LinkedIn job description for Lead Machine Learning Scientist and this Apple job description for Data Scientist.

In mathematics, factors are numbers you can multiply together to get another number. So what are the factors of 12? One pair is 3 and 4 because 3 x 4 = 12. By finding two numbers whose product is 12, we have factorized 12.


Matrix factorization is just the same. We take a matrix and find two matrices which multiplied together give us our original matrix. So what’s the big deal with this? The big deal is when some of the numbers are zeros. Let’s say we have a 5 x 4 matrix, that is a matrix with 5 rows and 4 columns as follows:
            5,3,0,1
            4,0,0,1
            1,1,0,5
            1,0,0,4
            0,1,5,4
Suppose that this matrix represents 5 users who rated 4 items on a website. Since some users did not rate some items, those non-rated values are given zeros (0s) in the above matrix. When we factorize our matrix, we get two matrices say P and Q. Now multiply P and Q, we get a matrix R’ as follows:
5.00839412  2.91392021  3.67942538  0.99803027
3.95686757  2.31050044  3.10775098  0.99685988
1.07421623  0.81558798  5.36465946  4.96127023
0.96378225  0.71250988  4.35147006  3.97222341
1.86851879  1.23430652  4.90605483  4.0382986
The zeros in the original R have some numbers in R’. These are interpreted as, the predictions of ratings for users who have not rated some items. You are making predictions for user ratings and can use them for recommendations.

Yay, we are now talking recommendation systems. These are a staple application in the data analytics space. That’s the second hucking cool thing about matrix factorization.

Using matrix factorization involves gradient descent, and to refine it, we use regularization to avoid overfitting. Stop here and read the previous sentence once again, preferably read out loud. Done? Now proceed.

Man, if you use the terms matrix factorization, gradient descent, regularization, and overfitting all in one sentence, you are a data scientist. And that’s what makes matrix factorization hucking cool thrice over.

So, how is this accomplished? Simple: initialize, use a loop, calculate, check and repeat until a (breaking) condition. To elaborate, from “Matrix Factorization: A Simple Tutorial and Implementation in Python”:
First initialize the two matrices with some values, calculate how ‘different’ their product is to M, and then try to minimize the difference iteratively.
The difference is usually called the error between the estimated rating and the real rating and is squared. In the loop, at each step, we need to know the gradient at the current values, and therefore we differentiate the squared error equation. Such a method is called gradient descent, aiming at finding a local minimum of the difference.
Having obtained the gradient, we formulate the update rules, we can then iteratively perform the operation until the error converges to its minimum. We can check the overall error with an equation and determine when we should stop the process.
To the main equation, regularization is introduced to avoid overfitting. This is done by adding a parameter β and modify the squared error.
You can get all the mathematical details in that article. The authors have given an implementation in Python. As usual, I translated it to Ruby. Both the programs are given below.

The Python code uses the library numPy. For my Ruby program I used NMatrix, available through gem ‘nmatrix’. On Ubuntu, the steps to install nmatrix, as given in the installation page are
  • sudo apt-get install libatlas-base-dev
  • sudo apt-get --purge remove liblapack-dev liblapack3 liblapack3gf
  • sudo gem install nmatrix
  • gem install nmatrix
It was not just numpy, there were other Python specifics that had to be translated. The Ruby equivalents I used for the Python keywords & idioms are:

Operation Python Ruby
Matrix element P[i][j] P[i, j]
Generate random values numpy.random.rand(N,K) NMatrix.random([n, k])
Dot product numpy.dot(P,Q) p.dot q
Matrix transpose Q.T q.transpose
Loop for step in xrange(steps): 0.upto steps-1 do |step|
Get ith row of P P[i, :] p.row(i)
Get jth column of Q Q[:, j] q.column(j)
Exponentiation pow **

I timed the runs of both the programs, each 10 times on my ageing Compaq laptop running Ubuntu 14.04 LTS. Surprisingly, Ruby (v2.2) was faster than Python (2.7.6) this time. On an average, Python took 6s and Ruby took 4.7s, so Ruby was 21.67% faster.


Code
Python
Ruby
I have given below the Wikipedia definitions of all the hucking cool data science terms used in this blog post --

Matrix Factorization
In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of problems.

Gradient Descent
Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

Regularization
Regularization, in mathematics and statistics and particularly in the fields of machine learning and inverse problems, refers to a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting.

Overfitting
In statistics and machine learning, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.

No comments:

Post a Comment