The Use of Java in Machine Learning
Support Vector Machines (SVM)
The term Support Vector Machines (SVM) is a misnomer. These are not a machines or anything like that, but they are algorithms. SVM has become a very popular method for classification and optimisation in recent times. SVMs were introduced in 1992. This method combines two main ideas. The first one is the concept of an optimum linear margin classifier that constructs a separating hyper-plane that maximizes distances to the training point. The second one is the concept of a kernel. In its simplest form, the kernel is a function that calculates the dot product of two training vectors. Kernels calculate these dot products in feature space, often without explicitly calculating the feature vectors, operating directly on the input vectors instead. When using feature transformation, which reformulates input vectors into new features, the dot product is calculated in feature space, even if the new feature space has higher dimensionality. The linear classifier is unaffected.
Margin maximization provides a useful trade-off with classification accuracy that can easily lead to over-fitting of the training data. SVMs are suitable for solving learning tasks where the number of attributes is large with respect to the number of training examples. Some applications are listed below:
- Computational Biology: Learning to predict protein-protein interactions from primary structure.
- Computer Vision: In Medical Decision Support and Diagnosis, automated software applications that are used to learn from photomicrographs of sputum smears to diagnose Tuberculosis.
- Information Categorization and Retrieval: Text categorization represents an interesting challenge to statistical inference and ML communities due to the growing demand for automatic information categorization and retrieval systems. SVM has been successfully applied to this task. Given a vast number of text documents to categorize and classify, it is a daunting task for a human to read through all these documents and categorize them according to their subject title, content, and so on. Imagine the huge amounts of paperwork for court systems or a television network that has to store different types of programs. If a new documentary arrives, how do you classify it? Because this film has to be archived, automatic archiving is done by software and SVM techniques are at the core of such applications. Similarly, the retrieval process is faster and more accurate using SVM compared to traditional SQL searches.
Machine Learning and Data Mining (DM)
Data Mining is the extraction of hidden predictive information from large databases. It is a powerful new technology with great potential. For example, it can help companies and institutions focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviours, allowing businesses to make proactive, knowledge-driven decisions. DM tools can answer business questions that were traditionally too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.
DM technology can generate new business opportunities by providing these capabilities:
- Automated prediction of trends and behaviours. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data quickly.
- Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together.
The most commonly used techniques from Machine Learning that apply in Data Mining are:
- Artificial neural networks
- Decision trees—Classification And Regression Trees (CART), Chi Square Automatic Interaction Detection (CHAID).
- Genetic algorithms
- Nearest neighbour method
- Rule induction—The extraction of useful if-then rules from data based on statistical significance.
In a business enterprise, managers and decision makers rely on timely, correct, and predictive information that is vital for running the business in terms of competitive advantage. There is a term that has emerged from the business and IT community over the last decade known as Business Intelligence, or BI for short. What this term means is open to interpretation by different individuals and professionals. In fact, this term is viewed by the ML community as a Data Mining Application because it is built on Machine Learning. Expert comments from Intelligent Enterprise (http://www.intelligentEnterprise.com), which is a Web site for BI communities, CEO, managers, and the likes predict that software applications for the enterprise level that do not have an analytic component (this basically means a Data Mining component) will lose out to to competitors who do have them. Business Analytics applications are quite big at the moment for systems such as CRM (Customer Relation Management) and SCM (Supply Chain Management). Leaders in this market, such as SAP, Siebel, PeopleSoft, Clarify, Oracle, Cognos, and many more are jumping to develop Business Analytics applications. The heart and core of BI is Data Mining in which the foundation is ML.
Java Data Mining API (JDMAPI)
The implementation of ML in Java is in the Java Data Mining API (JDMAPI) of JSR-73 that is javax.datamining, which has just been made available for Public Review. The purpose of this API is to create, store, access, and maintain data and metadata supporting data mining models, data scoring, and data mining results serving J2EE-compliant application servers. Currently, there is no widely agreed upon, standard API for data mining. By using JDMAPI, implementers of data mining applications can expose a single, standard API that will be understood by a wide variety of client applications and components running on the J2EE Platform. Data Mining clients can be coded against a single API that is independent of the underlying data mining system. The ultimate goal of JDMAPI is to provide for data mining systems what JDBC did for relational databases. The JDMAPI will support OLAP (Online Analytical Processing). This API is designed for business enterprise application only.
Here are some ML methods that are implemented in this Java Data Mining API (JDMAPI), and they are not exhaustive.
- Decisition Tree is found in the package javax.datamining.modeldetail.decisiontree
- Artificial Neural Network (ANN)—javax.datamining.algorithm.feedforwardneuralnet:
The only type of ANNs available in the JDMAPI is the Feedforward Neural Network and its block diagram is shown in Figure 2 below.
Figure 2 is a Feedforward Neural Network with 4 neuron layers, in which two of them are hidden layers. The signal flows from left to right in Figure 2, which is clearly depicted by arrows. There are two input blocks (left-most light blue blocks, X1 and X2). Next is the first hidden layer with 3 neurons, these are N3 (top red block), N4 (middle black block), and N5, the lower black block. The second hidden layer is made up of 2 neurons, which are N6 (top green block) and N7, the bottom black block. The last layer is the output layer that is composed of 2 neurons, N8 (top black block) and N9, the bottom one. Note that X8 and X9 are not in a different layer; they just display the outputs of N8 and N9 blocks. Figure 2 is a Feedforward Neural Network because there is one output from N3 (top red block, that is Out1 of N3) that is forward connected to one of the input of block N8 of the second hidden layer (that is, In1 of N8). So, this connection jumps ahead from hidden layer one to the output layer instead of connecting to hidden layer two. The analysis of ANN is easily understood by using block diagrams or similar visual methods. The treatment and simulation of ANN in physics and electrical engineering is most commonly by using block diagrams. You might notice that it is similar to Control Systems and Signal Processing block diagrams. They are the same and one thing, with concepts of Transfer Functions, Error Rate, Momentum Rate, and so forth are also found in Signal Processing and Control Systems. It is much faster to simulate the network parameters by visual methods (such as using block diagrams) to find optimal values, once you are satisfied, and then start coding your ANN in a high-level, object-oriented language such as Java. Engineers and scientists normally use the popular industry and academic simulation tool such as MatLab to do this sort of visual simulation of the neural network before they even code in C, C++, or Java. More on ANN can be found from a tutorial by Jeff Heaton here at GAMELAN: http://www.developer.com/java/other/article.php/1546201
Some of the popular ANNs that are missing include:
- Recurrent Network (Feedback Neural Network)—Recurrent Neural Network is the opposite of Feedforward Neural Network.
- Competitive Learning Networks
- Hebbian Learning Networks
- Pricipal Component Networks
- Hopfield Network
- Bayesian Methods—javax.datamining.algorithm.naivebayes:
The type of Bayes methods adopted in the JDMAPI is Naive Bayes. JDMAPI lacks the full capability of Bayesian Belief Network (BBN).
- Inductive Logic Programming—javax.datamining.associationrules
- K-Nearest Neighbour and CBR—javax.datamining.algorithm.kmeans:
The Euclidean version of the K-Nearest Neighbour implemented in JDMAPI is the clustering k-means, the center point of a cluster of instances.
GA (Genetic Algorithm) and SVM (Support Vector Machines) are not implemented in JDMAPI version 1. I made a comment to the Expert Group that designed JDMAPI during the public review period, about the inclusion of GA and SVM. The reply from Mark. F. Honick of Oracle who is the lead spec of JDMAPI group, said they will be implemented in JDMAPI version 2. He did not clearly say when is this going to come out, but obviously, there will be more ML methods that would be included in future versions. Least Square Support Vector Machines (LS-SVM) will be in version 2, plus the Expert Group would be looking to adopt modern numerical analysis techniques as Wavelet for parameter tuning of GA, ANN, BBN, or SVM. In modern data mining applications (the last 6 years), Wavelet has been incorporated successfully into Machine Learning methods such as ANN and SVM which drastically improved their performance.
Overall, JDMAPI is a killer API in the area of Data Mining. As companies are looking to move to .NET or stick with Java for application development, decision makers should examine the type of application that they want to develop. If the applications involve Business Intelligence, Web Intelligence, Business Analytics, and Data Mining in general, it is pointless to argue whether to go with .NET or J2EE. The answer is quite obvious; stick with Java. It costs millions to develop a Data Mining API from the ground up because the algorithms are so complex, plus if you are using the API for application development, expect to pump in millions more. That is one of the factors in which to decide which platform to develop application (.NET vs. J2EE). Microsoft has no equivalent API in Data Mining that is available in Visual Studio for .NET that you (the software developer) could use. You have to do it from the ground up.
Page 3 of 4