Introduction:
The Nearest Neighbor Classifier, while robust and capable of handling streaming data, is sensitive to outlying data points and to irrelevant features. One critical part of designing a good nearest neighbor classifier is deciding which feature set to use in classifying new data points. One great way to choose the correct subset is through search. Exhaustive search isn’t realistic, though, for data sets with large numbers of features as the number of possible subsets of features is exponential: n = 2^F where F is the number of features in the data.
The purpose of this program is to search through the space of possible subsets of features in a faster way, that is, polynomial or better, without sacrificing too much accuracy in the classification. The first two methods we use are fairly straightforward: Forward Selection, and Backward Elimination. The former method begins with the empty set of features and adds one feature at a time while the latter method begins with all the features and removes one feature at a time.
The third, original, method to search for a good subset of features requires some explanation. By relaxing our criteria for what constitutes the “nearest neighbor”, we are able to avoid some of the calculations that make searching this space expensive. In other words, we sacrifice some accuracy of the classifier in order to gain a great deal of speed in computing the subset. The algorithm works as follows:
- All of the data is normalized so that every feature’s value falls between 0 and 1.
- The user is prompted to enter a value I call “delta” to be used during the nearest-neighbor computation. To be accurate, it is no longer the “nearest” neighbor in the set that we are interested in, only a “pretty good” one. So the modified nearest neighbor selector returns the first data point that falls within delta units distance from the given point.
- I run the Forward Selection Search using the modified nearest neighbor algorithm to return a “pretty good” subset of features.
Here I’ve omitted a sample trace of the program because it is lengthy, but you could download the source and run it in the Python Interpreter just as easiliy.
Running these three algorithms on various data sets yielded the following statistics:
Large Data Set – 1000 points, 30 Features | |||
Best Set | Acc. % | Time (s) | |
Forward | {7, 9, 12} | 87.40 | 14255.89 |
Backward | {7, 9, 17, 25} | 93.00 | 20507.24 |
Special( 1 ) | {0, 1, 4, 7, 9, 10, 14, 15, 16, 17, 18, 19, 20, 27, 28} | 71.60 | 4909.35 |
Special( 2 ) | {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28} | 69.60 | 647.04 |
Small Data Set – 600 points, 16 Features | |||
Best Set | Acc. % | Time (s) | |
Forward | {2, 4, 9} | 89.66 | 1120.73 |
Backward | {2, 4} | 88.17 | 1519.89 |
Special(.25) | {2, 4, 5, 11, 12, 13, 14} | 81.66 | 449.00 |
Special( .5 ) | {1, 2, 3, 4, 5, 6, 7, 11, 12, 13, 14} | 78.83 | 221.63 |
Special( 1 ) | {0, 1, 2, 4, 5, 6, 7, 9, 10, 13, 14} | 73.66 | 47.09 |
Special( 2 ) | {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15} | 75.33 | 8.27 |
Small Special Set – 1000 points, 19 Features | |||
Best Set | Acc. % | Time (s) | |
Forward | {0, 1, 2, 3} | 86.60 | 2205.22 |
Backward | {0, 1, 2, 3} | 86.60 | 3806.77 |
Special(.25) | {0, 1, 2, 3, 5, 7, 13, 14, 15} | 67.20 | 1025.89 |
Special( .5 ) | {0, 1, 4, 8, 12, 14, 16} | 61.10 | 590.12 |
Special( 1 ) | {0, 1, 2, 3, 4, 7, 8, 10, 11, 12, 14, 15, 16, 17} | 61.10 | 128.71 |
Special( 2 ) | {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 13, 14, 15, 16, 17, 18} | 60.00 | 14.13 |
Analysis:
It is worth noting that running Special Search with delta = 0 is equivalent to running Forward Selection. As delta increases beyond a certain threshold (data set dependent), Special Search degenerates rapidly. This is because for every data point in the set, we will simply choose the next point we look at as the nearest neighbor. With two classes of points, this method will essentially yield 50% accuracy. Hence, it is possible to wind up with a feature set that is less accurate than “leave-one-out” evaluation with all the features.
Therefore, the selection of delta is important to achieve good performance in Special Search. There are some things we can say, quantitatively, about the selection of delta. First, it must by definition fall between 0 and F where F is the number of features in the data set. The lower bound is set because two points cannot be closer than 0 units in Euclidean space. The upper bound comes from the fact that when the data are normalized to values between 0 and 1, each point can differ by no more than 1 unit per feature (the reason it is not is because the distance function I am using doesn’t compute the square root part to save time).
It is apparent that the “proper” choice for delta will vary greatly according to the data set. If the data had great separation a large delta value would suffice and lead to faster computing time. Conversely, a data set with tightly grouped instances of opposite classes will require a smaller delta value to preserve a high degree of accuracy.
It is possible to do some simple preprocessing of the data before we begin the search in order to give us some idea of the data points’ separation. In fact, I used the following strategy to give me a ballpark figure for a reasonable delta value: First, I selected an arbitrary number of points at random from the set. For each point, I then recorded the class of that point, the distance to each other point in the set, and the class of each other point. I then sorted these data by class and ascending distance.
From this data, I got an idea for what value, on average, would allow us to correctly classify points without searching every point for the nearest neighbor. For the above example, a delta value of 1.75 would not sacrifice any accuracy because the closest point of opposite class was at a distance of 1.757 units. It must be mentioned though, that randomly sampling the data in the above way will not give us perfect data to use for the delta choice, only a subjective “feel” for the data’s separation.
It is up to the user to weigh the accuracy of the classifier versus the speed at which it classifies new points. Some data sets will be very conducive to Special Search while others will perform very poorly.