Home

Efficient Similarity Join Algorithm over Weight Vectors

1. Overview

Similarity join, an operation that finds all pairs of similar objects in a large collection of objects, is widely used to solve various problems in many application domains including Data Cleansing and Integration, Information Retrieval, Collaborative filtering, Clustering, Pattern Recognition, Bio-informatics, and so on. Existing similarity join algorithms use an inverted index with filtering techniques to avoid unnecessary similarity computation. However, they are inefficient in filtering out dissimilar pairs, especially when element weights must be considered. We contrived an efficient algorithm for similarity joins over weight vectors. It is easily extendable to other similarity predicates that are based on aggregate weighted similarity functions. Our algorithm is mostly based on All-pairs and improved its filtering performance by computing tight similarity upper bounds with little overhead.

2. People

3. Publications

  1. Dongjoo Lee, Jaehui Park, Jouho Shim, Sang-goo Lee, An Eficient Similarity Join Algorithm with Cosine Similarity Predicate, In Proc. of DEXA 2010, 2010
  2. Dongjoo Lee, Jouho Shim, Sang-goo Lee, An Efficient Algorithm for Similarity Join over Weight Vectors. Submitted for publication.

4. Experiments

Setup

Environment

Datasets

Similarity Measures

Tokenization methods

Results

5. Download

Program codes

Data