JavaData & JavaUsing Unicode Order to Create Data Aware Alphanumeric Search Trees

# Using Unicode Order to Create Data Aware Alphanumeric Search Trees

Trees are one of the most commonly used data structures in software engineering. They are also probably most fascinating of all. In this article, we introduce the general tree and then build an algorithm and design to solve a typical problem, which leads us to a multi-way tree.

### Binary Search Trees

Binary trees are one of the most commonly used data structures where the data has some sort of natural order and fast searching is one of the requirements. A binary search tree is a binary tree organized specifically for searches. It is called so because a binary decision (decision with two outcomes) is made at each node. Each node of a binary tree has two pointers to the nodes below, one to the left and one to the right. The search starts at root at the top and descends level by level until the node is found. If the current node is greater than the desired node, the left pointer is followed to a level down; otherwise, a right pointer is followed. This traversal continues until the desired node is found or it is established that the node is not present in the tree. In any case, if the tree is balanced, searching is a log(n) operation or one longest branch in the worst case.

The point to note here is that binary trees make use of two values, greater than (>) or less than (<) to create/traverse the tree. These two are the property of the data of the tree; for example, each node is either greater than or less than any other node.

Java has two implementations of Tree, java.util.TreeMap with a Map interface and java.util.TreeSet with a Set interface.

A slightly different approach is introduced in this article to solve a typical problem in which a “tree” is constructed which uses more than the absolute value of the node and is not restricted with a binary decision of < or > to traverse.

But first, the problem.

### Problem

Let us say we have a sequence of digits, which are to be searched in another sequence of digits—like 236543 to be searched in 23654329876533.

Let us get a more concrete example. Let us say a gift card company has a distinct identifier 54327, so all the gift cards it issues are like 5432-7XXX-XXXX-XXXX. (A larger company may have a four or a three digit identifier.) So, any purchase made using these cards should be directed to this company. Let a purchase be made from a card with number 5432765487690087; this card number evidently belongs to our company in question because the card number “starts with” the number 54327. The rest of the number 65487690087 identifies the client or the individual who made the purchase. The problem is to “match” the string 54327 with the card number and the strip off this pattern and find the remainder of the number.

### Solutions

Of course there are different approaches to solving this problem. One very simple solution is to use the startsWith() method of the String class. Let’s say we have a TreeSet object having all the possible patterns (belonging presumably to different companies), when we have to find pattern which matches we can iterate over all the patterns and use startsWith() to find the one that matches. This is a painfully slow approach because of two reasons: one because we do not know at the outset till what digit position to match in the string 5432765487690087. Consequently, we are not able to use the tree property. And two, because startsWith() is inherently slow.

The terms used so far like “Pattern,” “Sequence,” and “Match” reminds us of regular expressions. Thankfully, since JDK1.4 regular expressions are now part of the Java platform. Making use of regular expressions, constructs, and classes, we can have another solution to this problem. Consider the following code snippet.

```import java.util.regex.*;

public class MatchTest {
public static void main (String args[]) {
Pattern p = Pattern.compile ("54327");
String[] splitResult = p.split("5432765487690087");
System.out.println(splitResult);
}
}
```

We start by creating a Pattern object, which is a compiled representation of a regular expression. This Pattern object can be used to create a Matcher object on which the boolean matches() method can be called to find if the pattern matches with the matcher. Or we could use the String[] split (CharSequence string) method on the Pattern object, which splits the CharSequence “around” the pattern.

The above code snippet prints 65487690087, which identifies the client from the gift card number. This is fine because it returns the final result with just two lines of code. But consider a situation where there are a number of patterns and you have to “match” against each one of them.

### Alternative

Here we introduce the concept of a data-aware tree to solve this problem. This “tree” is suited to search from within a list of contiguous characters. Each leaf node is not a representation of the complete data of the node. The entire tree branch from root to the leaf node contains the information. The information is sort of normalized across the tree. Let us clarify this by an example. The following example is suited to match the pattern of a string of numbers and return the result for our problem in question. This makes use of the fact that characters ‘0’,’1′,….’9′ are sequential Unicode/ASCII characters.

Let us introduce a class called Element, which represents a “node” of our data-aware tree.

```// Class Element
// File Element.java

import java.util.*;

/**
* This class represents a node of the tree.
* Each node has the information of the child branches.
* Any node having no child branches is the leaf node
*/
class Element {

private Element subtree [];
private BitSet values;
private boolean leaf=true;

/**
* The constructor initializes this element to have
* 10 children.
* The presence of each child is known by the
* corresponding bit in a order 10 BitSet and there
* corresponds an Element in the
* subtree array
*/
Element () {
values  = new  BitSet(10);
subtree = new Element;
}

/**
* Insert the element at the given index
* Any insertion will set the node as non-leaf node
*/
public void insertElementAt (Element element,int index) {
subtree[index] = element;
values.set(index);
leaf = false;
}

/**
* Returns the Element at the requested node,
* if the Element is not present, returns a null
*/
public Element getElementAt (int index) {
return ( (values.get(index)) ? subtree[index] : null);
}

/**
* Returns a boolean indicating that this node is a leaf
* node or not
*/
public boolean isLeaf() {
return leaf;
}
}
```

Each Element object has an array of Element[] and a BitSet object which indicates which positions (out of a possible 10) are taken by sub-treesl in other words, have other nodes below this. Each Element is created as if it were a leaf node (last node(s) of the tree). When a new Element object is added to any existing Element, the position of entry is marked by setting that bit.

The use of the Element object will become clear when we use it as a constituent of our tree.

In this tree at each node there are not just two branches, but “ten” branches to go to. Each one of these branches is decided upon by the value of the “next digit” in pattern.

The tree starts with having the first node called the zeroth node of the tree. This is an Element object that does not have a predecessor.

The following piece of code adds a number to the tree.

```public void insert (String number) {
try {
char[] numberArray = number.toCharArray();
Element currentElement = zerothNode;
Element tempElement = null;
for(int i=0; i<numberArray.length;i++) {
if ((tempElement=currentElement.getElementAt(numberArray[i]-'0'))
== null) {
tempElement = new Element();
currentElement.insertElementAt
(tempElement,(numberArray[i]-'0'));
}
currentElement = tempElement;
}
}
catch (Exception e) {
System.out.println("Unable to insert number "+number+
" Error : "+e.getMessage());
}
}
```

The adding of a pattern to the tree is like the compile() method of the Pattern class. The first element itself adds up to ‘n’ levels in the tree; each subsequent addition in the tree adds up elements to the elements above it. If there is a repetition in the pattern to be added, the existing elements in the tree are reused. The tree branches out where the pattern differs. Of course patterns, which are substrings of other patterns, are not allowed, because in pattern ‘234563’ the last ‘3’ is the leaf node and if ‘2345’ is being added, the ‘5’ at the end will never become a leaf node. This limitation can be very easily removed by adding some more information to our Element class so that elements can be characterized as leaf, non-leaf, and non-leaf-terminators.

The following code matches the already added patterns to the tree and returns the substring after removing the leading pattern, if matched.

```  public String stripPatternFromNumber(String number) {
try {
char[] numberArray = number.toCharArray();
Element currentElement = zerothNode;
Element tempElement = null;
for(int i=0; i<numberArray.length;i++) {
if ((numberArray[i] >= '0') & (numberArray[i] <= '9')) {
if ((tempElement=currentElement.getElementAt
(numberArray[i]-'0')) == null){
if((currentElement.isLeaf())&
(currentElement != zerothNode)){
return number.substring(i);
}
else {
return null;
}
}
else {
currentElement = tempElement;
}
}
else {
return null;
}
}
return null;
}
catch (Exception e) {
e.printStackTrace();
System.out.println("Error in searching number "+number+" "+
e.getMessage());
return null;
}
}
```

This searching code traverses the tree as the given number matches with the existing pattern in the tree. If the pattern is matched, the remainder of the string (customer id in our case) is returned. If the pattern does not match then a null is returned.

The complete code for class DATree.java can be viewed here.

The test program Driver.java is used to demonstrate the DATree in action.

The same program can be adapted to search for any character string, only they have to be contiguous so that we can make use of this property in creating our tree, like [a-z] or [A-Z].

This n-nary tree is different from the conventional tree data structure because here the data is not just added to the tree, but the data itself constitutes the tree. The numbers (or any character sequences) are dispersed in the tree. The value of the node is known as we traverse the tree to its leaf node on any branch. The maximum number of iterations in searching for a pattern is limited to the maximum length of any pattern in the tree.

The algorithm and design here is inspired by Trie and PATRICIA tree, both of which are multi-way trees suited for alphanumeric data. Further information on these can be found at http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Trie.html and http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/PATRICIA.html, respectively.