Machine learning (ML) is the study of computational methods and construction of computer programs that automatically improve performance based on experience. ML is the most multi-disciplinary field you can ever imagine. It uses ideas from Artificial Intelligence (AI), Soft-Computing, Probability, Statistics, Information Theory and Signal Processing, Computational Complexity, Computational Mathematics, Philosophy, Control Systems Theory, Cognitive Psychology, Biology, Economics, Linguistics, Operations Research (OR), Physics, and other areas that have been explored in the last decade.
The idea of creating intelligent machines has attracted humans since ancient times. Today, with the huge development of computers and 50 years of research into AI programming techniques, the dream of intelligent machines is closer to becoming a reality. The successful application of machine learning (ML) recently involves data-mining programs that learn to monitor and detect international money laundering or fraudulent credit card transactions, information filtering systems that learn the user’s reading preferences, and autonomous vehicles that learn to drive on their own on public highways in the presence of other vehicles (note: this is not a remote control vehicle).
This field has long held potential for automating the time-consuming and expensive process of knowledge acquisition, but only in the past decade have techniques from machine learning started to play a significant role in industry.
Machine learning (ML) has just been in widespread use in different industries within the last seven or eight years. This is the approximate length of time that the Internet has been widely adopted. The use of machine learning is expected to grow exponentially over the coming years, and any company that is going to keep a tap on this revolutionary technology will produce software products that are very competitive in the market.
Humans learn by being taught, that is, by a teacher; or they learn by self-discovery, as in the case of a researcher. A physics teacher at a high school teaches Newton’s Laws of Motion to his/her students. The students learn all these concepts of physics from the teacher. A group of Ph.D. computer scientists or a group of Ph.D. physicists at Cal Tech, for example, do their learning by self-discovery. They read ACM computer journals or the Physics Review Letters about peer publications in their respective fields and figure out how to discover things that have not been done before (nor been published before).
Computer scientists and software developers, in general, apply the notion of learning to machines, as well. Machines can learn by being taught (a computer programmer, or an expert in a specific domain such as a lawyer, doctor, and so forth) or learn from self-discovery by observing data (such as data from a marketing database, patients’ management database, and so on) and being able to discover and extract predictive hidden information. They also are able to acquire new knowledge by predicting the outcome of new cases that are encountered during the process of learning.
Examples of learning in machines is Artificial Neural Network (ANN). It is a type of mathematical algorithm implemented in different types of software applications. The software systems, or specifically ANN, are fed in data for classifications, such as lung cancer patient records in the last five years from a specific hospital. There is no prior knowledge at all about the patient records’ data. The task of ANN is to discover patterns and classify new instances of the data (such as newly admitted patients whose medical conditions have not been accurately determined from an initial diagnosis) according to patterns it has discovered from the training data, during the learning stage. So, the Artificial Neural Network (ANN) is a paradigm of machine learning that learns by self-discovery. In contrast to ANN, another method of machine learning called K-Nearest Neighbour or KNN for short (Case-Based Reasoning—CBR, a more specialised form of KNN), is learn-by-memorising. This is a form of learning by being taught, so prior knowledge is needed. For machines to use CBR, a database of knowledge about known instances is required; that is, the memory of the machine has to be taught or know about external world facts before it can classify new instances of knowledge that recently arrived. (There is more on ANN, KNN, and CBR later.)
For machines, the term “learning” usually corresponds to a designed model, such as ANN mentioned above. Through the process of learning, we are improving model prediction accuracy as fast and well as possible. After learning, we expect the best fitting model of input data. Some methods that are not very robust give very good results on training data, but on testing or real data perform poorly. When this occurs, we are talking about over-fitting. The rate of over-fitting is also very dependent on input data. This mostly happens when we do not have much data compared with the number of attributes, or when the data is noisy. The noise gets into data by subjective data providing, uncertainty, acquisition errors, and so on. If we have too many attributes and not much data, the state space for finding the optimal model is too wide and we can easily lose the right way and finish in local optimum. The problem of over-fitting can be partially eliminated by suitable pre-processing and by using adequate learning methods as well as providing good input data.
Main Methods in ML
Understanding intelligence and creating intelligent artefacts, the twin goals of AI, represent two of the final frontiers of modern science. Several of the early pioneers of computer science, such as Turing, Von Neumann, and Shannon, were captivated by the idea of creating a form of machine intelligence. The questions and issues considered back then are still relevant today, perhaps even more so.
The following lists are the main methods employed in ML:
- Decision Trees
- Artificial Neural Networks (ANN)
- Bayesian Methods
- Reinforcement Learning
- Inductive Logic Programming (ILP)
- Case-Based Reasoning (CBR)
- Genetic Algorithms (GA)
- Support Vector Machines (SVM)
In recent years, attention has been paid to generating and combining different but still homogeneous classifiers with techniques called bagging, boosting, or bootstrapping. They are based on repeated generation of the same type of model over evolving training data. These methods enable reduction of model variance.
Decision tree learning is a method for approximating discrete functions by a decision tree. In the nodes of trees are attributes and in the leaves are values of discrete function. The decision tree can be rewritten in a set of if-then rules. Trees’ learning methods are popular inductive inference algorithms, mostly used for variety of classification tasks (for example, for diagnosing medical cases). Table 1 shown below shows a set of fictitious conditions that are suitable for playing tennis.
Table 1: (Fictitious Weather Data)
A set of rules can be learned from the data in Table 1, which might look like the following:
|If outlook = sunny||and humidity = high||then play_tennis = no|
|If outlook = rain||and wind = strong||then play_tennis = no|
|If outlook = overcast||then play_tennis = yes|
|If humidity = normal||then play_tennis = yes|
|If none of the above||then play_tennis = no|
The above sets of rules are shown as a decision tree in Figure 1. The rules are meant to be interpreted in order; that is, the first ones first, then if it does not apply, the second, and so on. This is called a decision list, the interpretation of a set of rules in sequence. As a decision list, the rules correctly classify all of the examples in the table, whereas taken individually, out of context, some of the rules are incorrect. For example,
|If humidity = normal||then play_tennis = yes|
This is inconsistent with row 7 of Table 1. The rules we had mentioned are classification rules where they predict the classification of the example in terms of whether to play tennis or not. It is equally possible to look for any rules that strongly associate different attribute values. These are called association rules and many can be derived from data in Table 1:
Figure 1 shows a decision tree for the concept of PlayTennis.
Artificial Neural Networks (ANN)
Neural networks learning methods provide a robust approach to approximating real-valued, discrete-valued, and vector-valued functions. The well-known algorithm Back-Propagation uses gradient descent to tune network parameters to best fit a training set with an input-output pair. This method is inspired by neurobiology. It imitates the function of the brain, where many neurons are interconnected. The instances are represented by many input-output pairs. ANN learning is robust to errors in training data and has been successfully applied to problems such as speech recognition, digital signal processing, face recognition, object recognition (extract a specific object from an image in the presence of other objects), character recognition such as OCR (Optical Character Recognition), and so on.
Bayesian reasoning provides a probabilistic approach to inference. Bayesian reasoning provides the basis for learning algorithms that directly manipulate with probabilities, as well as a framework for analysing the operation of other algorithms. A Bayesian learning algorithm that calculates explicit probabilities for hypothesis, such as the Naive-Bayes algorithm, are among the most practical approaches to certain type of learning problems. The Bayes classifier is competitive with other ML algorithms in many cases. For example, for learning to classify text documents, the Naive-Bayes classifier is one of the most effective classifiers.
The difficulty in applying Bayesian methods is that they typically require initial knowledge of many probabilities. When these probabilities are not known in advance, they are often estimated based on background knowledge, previous available data, and assumptions about the underlying probability distributions. The second practical difficulty is the significant computational cost required to determine optimal hypothesis in the general case.
In the area of Speech Processing, Bayesian methods are used extensively with ANN, Fuzzy Logic, and Digital Signal Processing techniques to process language. The smallest speech patterns that have linguistic representation in a language are called phonemes. Phonemes have three major conventional groups: vowels, semi-vowels, and consonants. Prior probabilities are assigned to phonemes depending on wether they are vowels, semi-vowels, or consonants. Speech is a sequence of waves that are transmitted over time through a medium and are characterised by physical features such as intensity and frequency. In humans, the perceived speech activates small oscillations in the inner ear that are transmitted to the brain for further processing. Java has a speech API (JSAPI), which is merely a reference implementation to be used with speech engines. Speech Engines are where the difficulty lies, in terms of technology and software development. Software developers involved in developing Speech Engines must be well versed in techniques of Machine Learning and Digital Signal Processing (DSP). The JSAPI, has no algorithm in Bayesian, DSP, ANN, or Machine Learning in general. These algorithm are all being developed at the Speech Engines level.
One area of difficulty in speech processing is filtering. Humans are quite good at filtering conversation in a noisy room. If you talk to your friend in a noisy environment like a bar, your brain filters (blocks) out other noises or speeches from other people who are talking at the same time as your friend, but receiving only the speech of your friend with whom you are currently having a conversation. Machines are still a long way away from this differential speech (signal) filtering capability of humans. Electrical engineers, computer scientists, and physicists are researching the hardware and software requirements of machine speech filtering.
Speech technology is important now and for the future. Soon we will speak to machines and they will be able to understand (Natural Language Representation). A manager will just speak such a command as: “What is the sales forecast for next month?” and the machine will responded by either printing a report on screen or spoken texts (text-to-speech). Computing with words is made possible with Fuzzy Logic.
Reinforcement learning solves the task and addresses how the agent, that senses and acts in an environment, can learn to choose optimal actions to reach its goal. Each time the agent performs an action in its environment, a trainer may provide a reward or penalty to indicate the convenience of the resulting state. For example, when an agent is trained to play a game, the trainer might provide a positive reward when the game is won, a negative reward when it is lost, and zero reward in other states. The task of the agent is to learn from this delayed reward, to choose sequences of actions that produce the greatest cumulative reward. An algorithm that can acquire optimal control strategies from delayed reward is called Q-learning.
This method can solve problems such as learning to control a mobile robot, learning to optimise operations in factories, and learning to plan therapeutic procedures. Consider building a learning robot or agent that has a set of sensors to observe the state of its environment, and a set of actions it can perform to alter this state. The sensors might be a digital camera and sonar where there are such actions as moving forward and turning. The robot’s task is to learn a control strategy for choosing actions that achieve its goals. Perhaps one goal is docking onto its battery charger whenever its battery level is low. The goal of docking to the battery charger can be captured by assigning a positive reward, say a number of +10 to state-action transitions that immediately result in a connection to the charger, and a reward of zero for every other state-action transition. The reward function may be built into the robot or known only to an external teacher who provides the reward value for each action performed by the robot. The task of the robot is to perform sequences of actions , observe their consequences, and learn a control policy. The control policy that is desired is the one that, from any initial state, chooses actions that maximize the reward accumulated over time by the agent.
The robot example is a generalized type of reinforcement learning. Maximizing cumulative reward covers many problems beyond robot learning tasks. In general, the problem is one of learning to control sequential processes. This includes, for example, manufacturing optimization problems in which a sequence of manufacturing actions must be chosen, and the reward to be maximized is the value of the goods produced minus the costs involved. It also includes software applications for call centres such as sequential scheduling problems in choosing which taxis to send for passengers where the reward to be maximized is a function of wait time of the passengers and the total fuel costs of the taxi fleet.
Inductive Logic Programing (ILP)
Inductive logic programming has its roots in concept learning from examples, a relatively straightforward form of induction. The aim of concept learning is to discover, from a given set of pre-classified examples, a set of classification rules with high predictive power. The theory of ILP is based on proof theory and model theory for the first order predicate calculus. Inductive hypothesis formation is characterized by techniques including inverse resolution, relative least general generalisations, inverse implication, and inverse entailment. This method can be used to create logical programs from training data set. The final program should be able to generate that data back. The creation of logical programs is very dependent on task complexity. In many cases, this method is not usable without many restrictions posed on the final program. ILP is used successfully in Data Mining for finding rules in huge databases.
Case-Based Reasoning (CBR)
Case-Based Reasoning is a lazy learning algorithm that classifies a new query instance by analysing similar instances while ignoring instances that are very different from the query. This method holds all previous instances in case memory. The instances/cases can be represented by values, symbols, trees, various hierarchical structures, or other structures. It is a non-generalization approach. The CBR works in the cycle: case retrieval—reuse—solution testing—learning. This method is inspired by human reasoning using knowledge from old, similar situations. This learning method is also known as Learning by Analogy. There is more on this subject from my previous article here at Gamelan: http://www.developer.com/java/article.php/10922_1491641_1.
Genetic Algorithms (GA)
Genetic algorithms provide a learning method motivated by an analogy to biological evolution. The search for an appropriate hypothesis begins with a population of initial hypothesis. Members of the current population give rise to the next generation population by operations such as selection, crossover, and mutation. At each step, a collection of hypotheses, called the current population, is updated by replacing some fraction of the population by offspring of the most fit current hypothesis. Genetic algorithms have been applied successfully to a variety of learning tasks and optimisation problems.
The following are some examples of the use of Genetic algorithms (GA):
- Used for tuning the optimal parameter’s setting when it is combined with other ML methods, such as Neural Network or Instance-Based Learning.
- Used to learn collections of rules for robot control.
- Used to recognize objects (object-recognition) from a visual scene (or image)—this field is known as Machine Vision or Computer Vision. For human visual scene identifications, we basically can tell and identify different objects from a scene or an image. Say you have an image of your friend’s study room. Suppose, in this image, there is a stack of books at one corner, a table in the middle of the room, a computer desk, a collection of CDs on the computer desk, and a heater. It is quite easy for a human to see that image and identify all the objects in the room, but can a machine do the same thing? The answer is YES, but currently it is limited; this is a growing field and perhaps machines in the future will completely achieve human-like vision (scary, huh?).
The simplest form of Computer Vision is OCR (Optical Character Recognition), where handwritten characters on a sheet of paper are scanned into a computer and the output is ASCII characters. More sophisticated applications of Computer Vision are in industrial automation visual inspection systems, for quality control and medical imaging diagnosis systems. More complex still are the applications used by the military for real-time intelligence satellite image gathering, or missile guiding systems for identifying ground targets. In fact, it was the military who were first to develop such technologies even before commercial applications started appearing on the market.
- Identification of credit card fraud, where the card is a stolen one or card numbers is being used for a fake one. If you have experienced using your credit card for shopping and when you think the transaction is accepted, all of a sudden you being asked for some more identification, such as what is your mother’s or father’s name, or what school you went to? This is GA and ANN (Artificial Neural Network). In the database of your credit company, the GA program learns the characteristics of fraud from its millions of transactions and is able to evaluate whether a transaction is fraudulent or not.
Support Vector Machines (SVM)
The term Support Vector Machines (SVM) is a misnomer. These are not a machines or anything like that, but they are algorithms. SVM has become a very popular method for classification and optimisation in recent times. SVMs were introduced in 1992. This method combines two main ideas. The first one is the concept of an optimum linear margin classifier that constructs a separating hyper-plane that maximizes distances to the training point. The second one is the concept of a kernel. In its simplest form, the kernel is a function that calculates the dot product of two training vectors. Kernels calculate these dot products in feature space, often without explicitly calculating the feature vectors, operating directly on the input vectors instead. When using feature transformation, which reformulates input vectors into new features, the dot product is calculated in feature space, even if the new feature space has higher dimensionality. The linear classifier is unaffected.
Margin maximization provides a useful trade-off with classification accuracy that can easily lead to over-fitting of the training data. SVMs are suitable for solving learning tasks where the number of attributes is large with respect to the number of training examples. Some applications are listed below:
- Computational Biology: Learning to predict protein-protein interactions from primary structure.
- Computer Vision: In Medical Decision Support and Diagnosis, automated software applications that are used to learn from photomicrographs of sputum smears to diagnose Tuberculosis.
- Information Categorization and Retrieval: Text categorization represents an interesting challenge to statistical inference and ML communities due to the growing demand for automatic information categorization and retrieval systems. SVM has been successfully applied to this task. Given a vast number of text documents to categorize and classify, it is a daunting task for a human to read through all these documents and categorize them according to their subject title, content, and so on. Imagine the huge amounts of paperwork for court systems or a television network that has to store different types of programs. If a new documentary arrives, how do you classify it? Because this film has to be archived, automatic archiving is done by software and SVM techniques are at the core of such applications. Similarly, the retrieval process is faster and more accurate using SVM compared to traditional SQL searches.
Machine Learning and Data Mining (DM)
Data Mining (DM) is also known as Knowledge Discovery in Databases (KDD). It has been defined as “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data.” It uses machine learning, statistical, and visualization techniques to discover and present knowledge in a form that is easily comprehensible to humans. DM is built on ML, and there are currently more new methods from ML that are being built in to Data Mining software.
Data Mining is the extraction of hidden predictive information from large databases. It is a powerful new technology with great potential. For example, it can help companies and institutions focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviours, allowing businesses to make proactive, knowledge-driven decisions. DM tools can answer business questions that were traditionally too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.
DM technology can generate new business opportunities by providing these capabilities:
- Automated prediction of trends and behaviours. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data quickly.
- Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together.
The most commonly used techniques from Machine Learning that apply in Data Mining are:
- Artificial neural networks
- Decision trees—Classification And Regression Trees (CART), Chi Square Automatic Interaction Detection (CHAID).
- Genetic algorithms
- Nearest neighbour method
- Rule induction—The extraction of useful if-then rules from data based on statistical significance.
In a business enterprise, managers and decision makers rely on timely, correct, and predictive information that is vital for running the business in terms of competitive advantage. There is a term that has emerged from the business and IT community over the last decade known as Business Intelligence, or BI for short. What this term means is open to interpretation by different individuals and professionals. In fact, this term is viewed by the ML community as a Data Mining Application because it is built on Machine Learning. Expert comments from Intelligent Enterprise (http://www.intelligentEnterprise.com), which is a Web site for BI communities, CEO, managers, and the likes predict that software applications for the enterprise level that do not have an analytic component (this basically means a Data Mining component) will lose out to to competitors who do have them. Business Analytics applications are quite big at the moment for systems such as CRM (Customer Relation Management) and SCM (Supply Chain Management). Leaders in this market, such as SAP, Siebel, PeopleSoft, Clarify, Oracle, Cognos, and many more are jumping to develop Business Analytics applications. The heart and core of BI is Data Mining in which the foundation is ML.
Java Data Mining API (JDMAPI)
The implementation of ML in Java is in the Java Data Mining API (JDMAPI) of JSR-73 that is javax.datamining, which has just been made available for Public Review. The purpose of this API is to create, store, access, and maintain data and metadata supporting data mining models, data scoring, and data mining results serving J2EE-compliant application servers. Currently, there is no widely agreed upon, standard API for data mining. By using JDMAPI, implementers of data mining applications can expose a single, standard API that will be understood by a wide variety of client applications and components running on the J2EE Platform. Data Mining clients can be coded against a single API that is independent of the underlying data mining system. The ultimate goal of JDMAPI is to provide for data mining systems what JDBC did for relational databases. The JDMAPI will support OLAP (Online Analytical Processing). This API is designed for business enterprise application only.
Here are some ML methods that are implemented in this Java Data Mining API (JDMAPI), and they are not exhaustive.
- Decisition Tree is found in the package javax.datamining.modeldetail.decisiontree
- Artificial Neural Network (ANN)—javax.datamining.algorithm.feedforwardneuralnet:
The only type of ANNs available in the JDMAPI is the Feedforward Neural Network and its block diagram is shown in Figure 2 below.
Some of the popular ANNs that are missing include:
GA (Genetic Algorithm) and SVM (Support Vector Machines) are not implemented in JDMAPI version 1. I made a comment to the Expert Group that designed JDMAPI during the public review period, about the inclusion of GA and SVM. The reply from Mark. F. Honick of Oracle who is the lead spec of JDMAPI group, said they will be implemented in JDMAPI version 2. He did not clearly say when is this going to come out, but obviously, there will be more ML methods that would be included in future versions. Least Square Support Vector Machines (LS-SVM) will be in version 2, plus the Expert Group would be looking to adopt modern numerical analysis techniques as Wavelet for parameter tuning of GA, ANN, BBN, or SVM. In modern data mining applications (the last 6 years), Wavelet has been incorporated successfully into Machine Learning methods such as ANN and SVM which drastically improved their performance.
Overall, JDMAPI is a killer API in the area of Data Mining. As companies are looking to move to .NET or stick with Java for application development, decision makers should examine the type of application that they want to develop. If the applications involve Business Intelligence, Web Intelligence, Business Analytics, and Data Mining in general, it is pointless to argue whether to go with .NET or J2EE. The answer is quite obvious; stick with Java. It costs millions to develop a Data Mining API from the ground up because the algorithms are so complex, plus if you are using the API for application development, expect to pump in millions more. That is one of the factors in which to decide which platform to develop application (.NET vs. J2EE). Microsoft has no equivalent API in Data Mining that is available in Visual Studio for .NET that you (the software developer) could use. You have to do it from the ground up.
Open Source Machine Learning Tools in Java
There are a number of Machine Learning Tools written in Java, but the followings are the most popular with the ML community. These GPL tools are widely adopted for Computer Science courses in Machine Learning at universities around the world (refer to the resources for download links).
The Philosophical Debate Goes On
Bill Joy, the chief scientist at Sun Microsystems, along with other computer scientists, predicts that the future will be machines dominating humans. Humans will become an endangered species. This is caused by humans trying to accelerate inventions of new technologies and that one day it will turn against us. He warned that the pace of new technology should be slowed down. This comment is from his article at Wired Magazine found here: http://www.wired.com/wired/archive/8.04/joy.html. A comment from a theologian suggests that while the academic community does weigh broader ethical considerations, Silicon Valley has a more narrow point of view. “It’s much more about building a company that can go IPO (Initial Public Offering) as fast as possible,” noting a Gold Rush mentality that produces a very short-term view of the world.
Bill Joy realized that the big advances in information technology come not from the work of computer scientists, computer architects, or electrical engineers, but from that of physical scientists (physicists). It was physicists who invented semi-conductors and transistors that are components of every electronic gadget on the planet. Computers were first developed for use by physicists in scientific calculations. The top current research topic by physicists is in Quantum Computers, based on the theory of Quantum Physics, which was first proposed and inspired by Physics Nobel Prize winner Richard Feynman of Cal Tech at a symposium at MIT in 1982. If Quantum Computers are achieved within the next 40 years, the nightmare of machines dominating humans will be on the horizon. What current computers (conventional) can compute in billions of years will take less than a second for a Quantum Computer to compute because of its massive parallelism. If you have an expert systems wit, say, more than 10,000 rules, you will notice a slow performance in the system. Increasing the number of rules will probably crash the system. An expert system that is deployed in a Quantum Computer can think fast, even when the number of rules are in the millions.
Do the experts think that machines will ever acquire intelligence? Not so, according to top Physicist and Mathematician Roger Penrose from Cambridge University, UK. Roger Penrose is well known in mathematics and physics circles for his theoretical work on extending the theory of Relativity and combining it with Quantum Physics, leading to what is now known as Quantum Gravity (a special branch of physics). He worked with Stephen Hawking in the field of cosmology in the late 1960s. From his book titled The Emperor’s New Mind, he argued that machines will never ever understand the theory of knowledge and existence of reality nor acquire consciousness. As an aside, the mathematics used in AI or Machine Learning is not nearly as complex as the mathematics used in Relativity and Quantum Physics. The kinds of argument that Professor Penrose used in his book are consistent theoretical proofs that are commonly found in mathematical inductions techniques.
I believe that it is best to leave such debate to computer scientists, philosophers, mathematicians, and physicists. The job of a software developer is to produce software applications that benefit human society now.
About the Author
Sione Palu has developed software for publishing systems, imaging, Web applications, amd symbolic computer algebra systems for secondary education. Palu graduated from the University of Auckland, New Zealand, with a science degree (B.Sc.) in mathematics and computing. His interests involve the application of Java and mathematics in the fields of mathematical modelling and simulations, symbolic AI and soft-computing, numerical analysis, image processing, wavelets, digital signal processing, control systems and computational finance.