Friday, 8 January 2016

A Ruby Program On Logistic Regression

I wrote, rather translated, a small program in Ruby on logistic regression. There is a github repository "python-ML-minimal" that has programs written for Prof. Andrew Ng's machine learning class assignments. The cool thing is they are written with only the core language features without any dependency on libraries.

Good for learning and understanding basic concepts, but Python has a construct called list comprehension which looks like a reverse-for loop. This construct is not so easy when you are trying to read code and understand the logic behind the code. So I promptly translated one program, on logistic regression, into Ruby.

Ruby isn't known as a primary language for math. But its functional syntax for operating on collections and ability to handle formatted files cleanly make it an elegant choice to understand what an algorithm is doing. Further in this particular program, I used straight forward loops and array appends in place of the Python list comprehensions, so double easying steps: no dependency on non-core libraries plus no list comprehensions.


But first a recap : In logistic regression, you aim to classify an entity into one of two mutually-exclusive classes, such as spam / not-spam or sick / not-sick. Starting concept is the odds which is the ratio where the numerator is the probability of an event of interest and the denominator is 1 - probability of the event. You input the log of odds ratio to the logit function and you get a probability in the range 0 - 1. Using a threshold for probability, say 0.5, you do the classification.

The data file required to run is “logistic_regression_data.txt“ and here are a sample few rows:
34.62365962451697,78.0246928153624,0
30.28671076822607,43.89499752400101,0
35.84740876993872,72.90219802708364,0
60.18259938620976,86.30855209546826,1
79.0327360507101,75.3443764369103,1

Each line is a training example. In machining learning, you have pre-classified data, which is called as training data, for some reason that I don’t know of. In my programmer-speak it is a "known" entity.

The program reads the file line by line, splits each line and stores the fields upto the second last field as an array in x and the last field in the variable y. Then the x's are stored in another array m which thus has all the training examples, which in my programmer speak is the array of known entities. n is the number of features, which in my programmer speak is the number of attributes of each entity.

The program then calls the function scalefeatures passing x,m,n as arguments. What does scalefeatures function do? It first calculates mean and standard deviation of the data. Then it does feature scaling a.k.a data normalization.

You do data normalization when your range of values is too broad for different variables. For example, age will be two-digit numbers whereas salary can be five or six-digit numbers. Hence you bring them all to closer values by applying various normalization techniques, one of which is feature scaling that converts input data to standard scores.

The remaining part of the program is expressed in three functions, each corresponding to a mathematical formula in the logistics regression lesson as given below:
h_logistic_regression
def h_logistic_regression(theta, x, n)
    theta_t_x = 0
    0.upto n do |i|
        theta_t_x += theta[i] * x[i]
    end

    begin
        k = 1.0 / (1 + Math.exp(-theta_t_x))
    rescue
        if theta_t_x > 10 ** 5
            k = 1.0 / (1 + Math.exp(-100))
        else
            k = 1.0 / (1 + Math.exp(100))
        end
    end 
    
    if k == 1.0
        k = 0.99999
    end

    return k
end
gradientdescent_logistic
def gradientdescent_logistic(theta, x, y, m, n, alpha, iterations)
    0.upto iterations-1 do |i|
        thetatemp = theta.clone
        0.upto n do |j|
            summation = 0.0
            0.upto m-1 do |k|
                summation += (h_logistic_regression(theta, x[k], n) - y[k]) *
                             x[k][j]
             end
             thetatemp[j] = thetatemp[j] - alpha * summation / m
        end
        theta = thetatemp.clone
    end
    return theta
end
cost_logistic_regression
def cost_logistic_regression(theta, x, y, m, n)
    summation = 0.0
    0.upto m-1 do |i|
        summation += y[i] * Math.log(h_logistic_regression(theta, x[i], n)) +
                     (1 - y[i]) *
                     Math.log(1 - h_logistic_regression(theta, x[i], n))
    end
    return -summation / m
end
The program starts with an initial array of \(\theta\)'s of 0's. It uses these to apply gradient descent algorithm for 4000 iterations to arrive at a final cost. It then prints the initial cost and final cost.

Here's the full program:
You run it as:
ruby logistic-regression.rb < logistic_regression_data.txt
The data file is available -> here

1 comment: