Today, analytics is at the head of innovation. From fighting global pandemics to managing businesses, it drives growth and optimization everywhere.
In the past, tools like Microsoft Office 365’s Excel popularized the use of data and simplified data analysis for everyone. Now, leveraging data has become essential in every field, to the point that simple data analysis is no longer enough.
Data often has underlying patterns. Patterns that are impossible to detect using Excel charts alone. We need tools that can analyze data in depth and identify hidden relationships within. This is where machine learning helps us. What’s machine learning, you ask? Well, that is exactly what we will discuss today.
Machine learning is a set of statistical models that extract deeper insights from data, more than any data analysis tool can. Of course, coding statistical models is not easy. People feel intimidated when you ask them to define a statistical model, let alone code it.
Thanks to Python, this no longer has to be the case. Python is a fast, robust, and easy to learn programming language used by data scientists all over the world. It has a rich collection of libraries that allow you to implement machine learning algorithms. Interestingly, you can use these algorithms without diving into the statistical quagmire underneath.
In this article, we will introduce you to machine learning and teach you how you can use Python to create your first machine learning project.
What is Machine Learning?
In theory, machine learning is a subset of computer science that uses data and algorithms to mimic the way humans learn. You see, computers are not like humans. They can only do what we explicitly program them to do.
What makes machine learning unique is that it programs machines to learn as we do. Through machine learning, computers and software can improve using new information.
Although machine learning is extremely useful in the modern world, it can’t teach computers to behave as humans do. Your computer can’t learn to dribble with a soccer ball while also learning to play chess.
A machine learning algorithm only trains a program to tackle one practical problem at a time. In most cases, this problem is studying, analyzing, and finding patterns in large amounts of data.
Structured vs. Unstructured Data in Machine Learning (ML)
How we process data in machine learning depends upon the type of data we are analyzing. There are two types of data: structured and unstructured. Let’s discuss structured data first.
Structured data is usually stored in structures such as Excel sheets, tables, or databases. We can easily map it into designated fields. It consists of information like names, zip codes, phone numbers, bank balances, or other data.
Think of data collected during surveys. The response to each question in a survey is saved separately in columns. Each survey respondent acts as a row, whereas every question serves as the column for the dataset. In machine learning, each column is known as a separate ‘feature’.
Data from the 2017 Developer Survey is a great example of structured data. The dataset has several columns, including Respondent, Professional, ProgramHobby, and Country. Each of these columns represents a question asked during the survey.
For instance, in the ‘Professional’ column, the respondents answered if they coded professionally. Likewise, ProgramHobby shows whether a respondent programs as a hobby or not. Each column is essential for analysis as it provides us with a unique bit of information.
On the other hand, unstructured data isn’t labeled, which means it doesn’t have an output variable. This makes unstructured data bad for models that need output data to improve.
The vast majority of data we generate is unstructured with Forbes estimating it to be up to 90%. Social media and websites, emails, mobile and communications, text files, and media are the primary sources of unstructured data. In other words, unstructured data can be a combination of anything from web pages, emails, and text messages to text data, audio, and video.
Because unstructured data exists in several formats, it’s hard for traditional software to ingest, process, and analyze it. Most tools can only perform simple content searches across textual data, and nothing else. This is why we need different machine learning techniques for structured and unstructured data.
Types of Machine Learning
There are three main types of machine learning: supervised, unsupervised, and reinforcement learning.
Supervised Machine Learning
Supervised learning is a unique approach to machine learning that uses structured data for training. The algorithm’s job is to learn the ‘common theme’ inside data. After identifying this ‘theme’ or pattern, the supervised algorithm uses it as a reference for future predictions.
We can further divide supervised learning into classification and regression. Supervised classification is useful when we know the output variable (also called the label). Think of a dataset outlining the success of an ad campaign on the web. The output variable here is the number of people who clicked the ad vs people who didn’t. These are two distinct categories.
The supervised algorithm first learns what’s common amongst people who clicked the ad and what’s common between people who didn’t. After analyzing the dataset and the output variable, the algorithm will classify the audience into two or more categories.
Similarly, in a survey asking people their preference for tea or coffee, the output variable has three values. People who prefer tea, those who like coffee, and those who drink neither of them.
One estimate says 84% of Britain’s population consumes tea every day. Considering the supervised algorithm uses the same data, it will classify around 84% of people as tea drinkers. People who drink coffee or neither of the beverages will be classified into separate categories. The algorithm relies heavily on the training data.
On the other hand, in a regression problem, the output variable isn’t a collection of categories, but a single continuous value. Imagine that your rich uncle has invested in a house and wants to learn how its price will change in two years. A supervised regression algorithm is perfect for estimating the future house price. All it needs is the historical data of the neighborhood.
Example of neighborhood historical data.
Source: House Prices Kaggle Notebook
You will notice that supervised learning only uses labeled data. That’s because a supervised learning algorithm can only learn through the output variable.
Supervised learning can help us solve several real-world problems. From disease detection to predicting stock market prices, supervised learning can do it all.
Unsupervised Machine Learning
Unsupervised learning algorithms process unstructured data. They can discover the differences and similarities between sets of data, making it easier to identify data groupings and hidden patterns without manual intervention.
Contrary to supervised learning, unsupervised learning doesn’t need labels. The algorithm then groups records with similar attributes together into separate ‘clusters.’
Unsupervised learning is excellent at discovering cross-selling strategies, and image and pattern recognition. It can also help data scientists during exploratory data analysis. Exploratory data analysis is a phase where we analyze data before applying machine learning.
Imagine an outbreak of Avian flu has spread in your local district. You have a dataset enlisting all the pets registered in the district. An unsupervised algorithm will analyze the dataset and break it down into separate clusters.
Residents who own cats, dogs, turtles, snakes, rabbits, birds, etc. will be separated into different clusters. This will make it easier to identify which residents currently own a bird, allowing you to inform them about the ongoing outbreak. The algorithm will help residents take necessary precautions on time.
Overall, unsupervised learning algorithms are ideal for resolving clustering problems.
Reinforcement learning is a system of algorithms where we use a feedback loop during training. Algorithms leverage rewards, penalties, and estimated errors to make more accurate decisions. This type of learning is most popular for training robots.
Logistic Regression Using Python
Today, we will implement a fundamental supervised learning algorithm in Python called Logistic Regression. It’s fast and somewhat uncomplicated, so it’s great for anyone who wants to understand machine learning.
We will use the Iris Species dataset on Kaggle to implement Logistic Regression.
The Iris dataset is extracted from the British statistician, eugenicist, and biologist, Ronald Fisher’s 1936 paper.
A Look into Data for Logistics Regression
The Iris dataset on Kaggle consists of 150 entries of iris plants. In total, there are three species of iris plants with each species of plants having 50 entries in the dataset. The dataset also has four additional columns for each flower:
All four of these variables influence what species of plant the Iris flower will be. The Species column is the output variable or label.
Usually, classification algorithms are used to classify data into two categories like Yes or No, people who clicked the ad vs. people who didn’t. This type of classification is called binary classification. However, classification algorithms can also classify data into multiple categories. In machine learning, this type of classification is known as multiclass classification.
Importing the Data
Python processes data through one of its libraries called Pandas. The pandas library imports data in a 2-D structured format called a DataFrame. We will begin the project by importing data through the pandas library.
import pandas as pd Data = pd.read_csv('filepath\iris_ Data.csv') Data.head(150)
- We start by importing the pandas library as a variable called pd.
- Pandas (or pd) has a function known as read_csv(). This function helps Python read data from comma-separated files and returns a DataFrame. Here we saved the DataFrame to a variable called Data.
- Replace ‘filepath‘ inside the read_csv function with the file path of the Iris data file in your computer.
- The head(150) function displays the top 150 entries in the Data.
This is the DataFrame of the Iris dataset
Divide the Data into Features and Labels
Just to recall, the columns of a dataset are known as features; they are independent variables. On the other hand, the output variable is known as the label and it is dependent on the features. After saving the data on the DataFrame, we will divide it into features and labels.
x = Data.iloc[:, 1:-1].values y = Data.iloc[:, -1].values print(x[:15]) print(y[:15])
- Pandas allows us to access different columns and rows of a DataFrame through the iloc method. Just insert the index of the columns and rows you want, and append it with “.values” to obtain the actual values. We save the features to variable x and the label is saved to y.
- x has 150 rows of features. We see the 15 first rows in the output.
- y has the 150 corresponding labels. We can see the first 15 labels in the output.
These are the first 15 values of the training and test set.
Splitting Data into Training and Test Data
During training, machine learning models get familiar with data. When we introduce the same data again, it starts making perfect predictions. However, the model’s accuracy plummets when we introduce new data. Splitting data into training and test sets helps us test the model’s performance on new data.
from sklearn.model_selection import train_test_split train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.3, random_state = 99)
The scikit-learn (also called sklearn) library is the primary library for machine learning in Python. You will use it several times as you implement machine learning projects. Here train_test_split from the model_selection module of sklearn. We use train_test_split to split data into training and test sets.
The method takes 4 parameters:
- Features (x)
- Labels (y)
- test_size (what percentage should we choose as test data, here it’s 0.3 or 30%)
- random_state (you can use any number, but using 99 will help you get the same results as I did).
Standardizing Data in Machine Learning
Standardization is an important step in machine learning that helps us improve the model’s performance. For now, don’t worry about how it works (we will cover it later). We will scale the values in every feature between -1 and 1.
This process is so important that some machine learning models don’t work without standardization. Logistic regression is one such model.
from sklearn.preprocessing import StandardScaler Standard = StandardScaler() train_x = Standard.fit_transform(train_x) test_x = Standard.transform(test_x)
- From sklearn.preprocessing, we import the StandardScaler class.
- We make an instance of StandardScaler and call it Standard.
- After that, we call the fit_transform() method from Standard. This method scales the training data returning it to the train_x variable and simultaneously fits the scaler to the data.
- In the end, we scale the test data too. We already fitted the scaler to the training data, so there’s no need to fit it again. Instead, we called the transform() method to return the transformed test data.
The first scaled values of the Features training set. All values are scaled between -1 and 1
Creating and Training the Logistic Regression Model
Lastly, we will implement the logistic regression model for a multiclass problem:
from sklearn.linear_model import LogisticRegression Log_Classifier = LogisticRegression(multi_class='ovr', random_state = 99) Log_Classifier.fit(train_x, train_y)
- First, we import LogisticRegression from sklearn.linear_model
- We create an instance of the LogisticRegression class called Log_Classifier. For this, we call LogisticRegression() and put multi_class=’ovr’ and random_state= 99 as parameters.
- Logistic Regression is usually used for binary classification. To convert it into a multiclass classifier, we need to add multi_class=’ovr’ in the parameter. Again, you can give
- Random_state any number, but using 99 will help you get the same result as I did.
Now for the final step.
Analyzing Results of the Logistic Regression Model in Python
from sklearn.metrics import confusion_matrix, accuracy_score results = Log_Classifier.predict(test_x) matrix = confusion_matrix(results, test_y) print(matrix) accuracy_score(results, test_y)
- First, we import the confusion_matrix and accuracy_score. Both these metrics tell us how accurate our prediction was.
- We get the prediction from the fitted Log_Classifier and saved it to the variable results.
- The confusion_matrix takes two values: the expected values from the dataset and the values we predicted through our model. Therefore, it takes the values from test_y and results. The same variables are used in accuracy_score too.
- We print the confusion_matrix called matrix
- The accuracy_score automatically prints out its return value
Yes! We achieved an accuracy score of 0.9111, which means it predicts the right value 91.11% of the time. Congratulations on completing your first machine learning project!
Today, we aimed to introduce readers to machine learning and help them implement a basic machine learning project in Python. Machine learning is a highly specialized field of data science. You need sound statistical knowledge and a firm understanding of algorithms to excel in it. Hopefully, this article helped you understand the fundamentals of machine learning.