Data Mining Using Machine Learning
In this digital age, data is everywhere. It becomes all the more important to recognize the patterns in the data and see if these data can be used in a more beneficial manner. This information can be further used to make decisions. Data Mining is basically a process to identify patterns in the data, extract and store it for future use. In this article, we will understand how machine learning algorithms help do data analysis.
Data mining is a field of computer science where machine learning algorithms help extract information that can be saved as knowledge for future use. Machine learning algorithms can be used to recognize a data pattern and extract the information.
Logistic regression helps us plot a sigmoid curve, or the 'S' curve. Let's look at some of the necessary concepts and then the logistic regression formula.
Probability is basically a chance of something happening. For example, if we toss a coin the probability of getting a heads is .5 (heads/ possible outcomes i.e. 1/2) similarly possible of getting a 6 in a dice is 1/6.
Probability of an event occurring divided by probability of that event not happening. As an example, if p is the number of occurrences, 1-p is the number of times it doesn't occur.
The formula for odds is:
Odds --> odds(p)
It's a ratio between two odds where the variable is increased by 1 unit.
Odds ratio = Odds(p) / Odds(p+1)
In logistic regression, the purpose is to find the value if 'p'. The formula for the logit (p) is:
Logit(p) --> logit(odds) --> logit(p)
where ln(x) is equal to loge(x).
In this logit function, the 0 - 1 values are across the x axis. The requirement is to populate these values in the y axis. For this, do an inverse of logit.
If a graph is plotted for the values p ranging between 0 & 1, it gives a sigmoid curve where it's undefined for the values 0 & 1 on the y axis.
Figure 1: Sigmoid curve as shown in a graph
The equation of the logistic regression is as follows:
Logit(p) --> logit(odds) --> logit(p) = θ0 + θ1x
Solving this equation results in:
1. = eθ0 + θ1x
2. p = (1-p) * eθ0 + θ1x
3. p = eθ0 + θ1x*- eθ0 + θ1x * p
4. p + eθ0 + θ1x * p = eθ0 + θ1x
5. p(1 + eθ0 + θ1x) = eθ0 + θ1x
6. p = eθ0 + θ1x / (1+ eθ0 + θ1x )
Line 6 gives us the equation for the logistic regression. Where
θ0: is the intercept and
θ1: is the coefficient
Using this equation, we will try to solve our problem statement. Let's re-iterate: The problem statement in this example is to find if a student seeking admission is successful or not.
To begin with, let's assume there is a historic collection of overall marks of students and based on the overall marks either the student gets admission or doesn't get admission. Using the above formula, we determine the probability of the student getting an admission. For this example, if we assume
θ0 = -.54
θ1 = .01
And the student's marks are 800, the probability of him getting the admission is:
p = e-.54 + .01 * (800) / (1+ e-.54 + .01 * 800 )
Solving this equation will give us a probability, or, the percent chances the student has to get an admission.
In this scenario, the outcome was a binary condition. There can be other scenarios where other algorithms can be used to recognize the pattern and do some analysis on the same. Data mining is generally used to make decisions based on analysis of historic data. Say a new product is getting launched during the holiday season; based on the history of consumers interest, combo offers can be launched. Based on the consumers' interest, it can be used to recommend other items, games, and so forth. A few popular open source data mining tools available are WEKA and R-Programming.