As I planned yesterday, I would give a summary about the nearest neighbor search today. Now comes it.
Problem statement: Given a dataset, find the nearest neighbor(s) for a certain point q.
1. Linear search
Scan the dataset linearly, and find the point that has the shortest distance to q. Although the time complexity is linear to the data size, it is not scalable when the size is extremly large, i.e., billions of webpages.
2. The use of tree structures
Two methods would be covered here. When the dimension is k:
kd-tree: short for k-dimensional tree.
The construction of the kd-tree is quite similar to the binary tree, except that each node is a k dimensional data point. When we split the data points at the ith depth, we select the median of the (i mod k) dimension of these points as node, and put the data points that are less than the node to the left, otherwise to the right. We keep on splitting the data points until we have no point to split.
Adding/deleting element, and updating the tree is omitted for simplicity.
When it is used for NNS, the pruning technique is employed(applied, used). The search is in a depth-first manner. We try to see whether the data points under a certain node could be neighbors of the query point q. If not possible, we prune the whole subtree.
It is pointed out the kd-tree is not good for the high dimensional search for the NNS.
R-tree: it is similar to B-tree. All the data points are put in the leaf nodes, while in kd-tree the internal nodes could also be data points. It also has efficient updating algorithm, which makes it suitable for the dynamically changing data. Each non-leaf node stores two pieces of data: one is pointers to other data points, the other is the bounding box for these data points.
The creation of the R-tree: to be continued.
Reference one: http://en.wikipedia.org/wiki/Nearest_neighbor_search
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.