March 2, 2021
Hot Topics:

Instance-Based Learning: A Java Implementation

  • By Sione Palu
  • Send Email »
  • More Articles »

Now, we have an Euclidean geometry of 4-dimensional space (n=4), because we have four attributes < weight, height, test_scores, and age >; note that attribute name is omitted here, because it is symbolic (non-numeric). We can find the Euclidean distance between the instance David and instance John by using Equation 1. The coordinates in our 4D space for David = (108.4, 177, 78, 31) and for John = (104, 198, 55, 38) with the calculation shown below:

d(David,John) =
= 32.2236 (round to 4 decimal places)

The following 2D plot of Figure 2 shows a query point xq amongst the instances {A, B, C, and D} of the training examples. When k=1 we have a 1-Nearest Neighbour and it is shown by the blue circle centered at point xq. Point C is is within reach of the radius of the blue circle; therefore, C is the 1-Nearest Neighbour. The magenta circle centered at point xq represents k=2, that is 2-Nearest Neigbours. The 2-Nearest Neigbours of the query point xq are points B and C.

Figure 2

The concept of similarities or closeness has no obvious analogy in the world of SQL search. Suppose that a target instance called Mark, who is not stored in our hypothetical database from Table 1, has the following attributes: < weight=96, height=187, test_scores=91, and age=34 >. If you try to do an SQL search for the closest person who has attributes that match Mark's attributes, you will definitely get a NULL return. The SQL statement is shown in Listing 1.

Listing 1
     from people
     where weight = 96
     and height = 187
     and test_scores = 91
     and age = 34

In an SQL search, you either get NULL or overwhelmed with a tremendous number of matches (say, a typical utility company and customer database), which you have to sift through and sort out what you want. SQL is generally a hit or miss scenario, where there is no in between (that is, either true or false). This dilemma of boolean logic (bivalent logic) for a hit and miss search is addressed by instance-based learning as k-Nearest Neighbour and also in Fuzzy Logic (where hit or miss is specified with a membership or certainty level). In the paradigm of Instance-Based Learning and Classification such as k-Nearest Neighbour, there is always a neighbour as long as there are Training examples (data).

The k-Nearest Neighbour algorithm is refined to weight the contribution of each of the k neighbours according to their distance to the query point xq, giving greater weight to closer neighbours. The weighting factor is shown by Equation 2:

Equation 2

The Distance Weighted Nearest Neighbour algorithm is the same as Algorithm 1, but now Step 2 is multiplied by the weighting factor wi. The Selection Engine (SE) does not calculate the weighting factor using Equation 2, but by assigning arbitrary values (up to the user). Since SE is extensible, I have provided variations of the k-Nearest Neighbour algorithm, if you want to write a class to implement distance weighting.

The other two IBL methods, the Locally Weighted Regression and Radial Basis Functions, are not discussed in this article because they are not implemented in the Selection Engine (refer to the resources for cited publications if you are interested).

Case-Based Reasoning (CBR)

Instance-based methods such as k-Nearest Neighbour and locally weighted regression share three key properties. First, they are lazy learning methods in that they defer decisions on how to generalize beyond the training data until a new query instance is observed. Second, they classify new query instances by analyzing similar instances while ignoring instances that are very different from the query. Third, they represent instances as real-valued points in an n-dimensional Euclidean space.

Case-based reasoning is a learning paradigm based on the first two of these principles, but not the third. CBR is an artificial intelligence methodology that uses specific encapsulated prior experiences as a basis for reasoning about similar new situations. CBR systems rely on various knowledge containers, such as the case-base of prior experiences and similarity criteria for comparing situations and retrieving the most relevant cases. Explicit or implicit changes in the reasoning environment, task focus, and user base may influence the fit of the current knowledge state to the task context, which can affect the quality and efficiency of reasoning results. Over time, the knowledge containers may need to be updated to maintain or improve performance in response to changes in task or environment.

A case-based reasoner learns after each problem-solving session by retaining relevant information from a problem just solved, making the new experience available for future problem solving. Crucial steps in a CBR process include finding a good match to a new problem, adapting a previous solution to successfully solve the new problem, and deciding how to index and store a new case for later effective retrieval.

  • Development: Case Based Reasoning was originally stimulated by research into how people remember information and how they are in turn reminded of information. It was observed that people commonly solve problems by remembering how they solved similar problems in the past. Case based reasoning was developed as a methodology that judges a new problem by comparing it to already classified cases. The strength of a CBR engine lies in the fact that it does not require the exact criteria of its new problem case, its target case, to be matched.

    The target case is specified by a set of attributes and values that uniquely define that case. A CBR retrieval engine retrieves the most similar case or cases from the case base and adapts the retrieved solutions to fit the specifics of the target case. Hence to create a CBR system, you only need records of cases of previously solved problems. It is not necessary to know how these problems were solved in the first place.

  • Context for CBR: Case based reasoning has proven itself to be a methodology suited to solving problems where it is difficult or impossible to illicit first principle rules from which solutions may be created. In the sphere of industry, these problem areas have very often been solved by one or two human experts whose experience with similar problems have made them best qualified to provide new solutions. It effectively means that industries in which certain problems occur are dependent on the knowledge collated by a small number of individuals and are vulnerable should these human experts leave.

CBR expert systems reason by memory rather than rules. The driving force behind case-based methods in AI is the machine learning community and CBR is regarded a subfield of machine learning. CBR learning occurs as a natural by-product of problem solving. It is a waste of effort to use similarity-based matching such as CBR if you knew exactly what you were looking for such as item's "key"—a part's part number, a person's social security number, an order's order number, and so on. Similarity-based matching is meant to be used when you don't know what options you have or when you want to have your items ranked.

Selection Engine (CBR Java Implementation)

Selection Engine (SE) is an open source project written in Java for CBR. This library is not yet documented (Java Docs) by the author but the author has included some sample code and tutorial notes with the distribution. The capabilities of SE are listed below:

  • does standard data filtering
  • computes similarity of instances
  • computes percent similarity of instances
  • handles weightings
  • sorts data by percent similarity
  • handles searches on subsets of attributes
  • generic enough to deal with any arbitrary data sent to it
  • easily integrated into larger applications
  • worked with Java and was stateless
  • generic data loading and display managers

By default, the Selection Engine (SE) reads the data, metadata, and search query from text files and sends the output to stdout; this should make it usable by anyone. Any person using the engine is encouraged to extend or customize it so that it reads from a database or develop Swing applications and Applet.

The main Java class in the SE is named SimilarityEngine; it does the computation of the similarity of a collection of items (training examples) to a specified target (query instance). The SimilarityEngine's only method is computeSimilarity(), as shown in Listing 2; it takes instances of the object's Items, SimilarityCriteria, and SimilarityWeights as its argument. This method is the main computation method for the Selection Engine, which is overloaded. A fourth argument of a DataSetStatistics object is the four argument version. The Item object has a collection of Traits, and the Items collection has a link to the metadata collection TraitDescriptors. (Click here for: Class Hierarchy and Diagram).

Items is a collection of Item objects, which is basically a standard collection. Item's attributes are defined at run time rather than compile time. Also, Item contains a Traits collection, which contains several Trait objects. A Trait object is just a "name/value" pair. TraitDescriptors represents metadata and it is a collection of TraitDescriptor objects. Each Item object will have one TraitDescriptor. A TraitDescriptor is also just a "name/value" pair. The SE recognizes primitive datatypes such as integers, floating-point numbers, strings, and boolean values.

Page 2 of 5

This article was originally published on October 31, 2002

Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date