Tags: algorithmic properties, business practices, database design, database fields, databases, divesh srivastava, flavors, incorrect addresses, information repositories, insertion, integrity, match, methodologies, motivation, multitude, nick koudas, optimization problems, pediments, poor quality, university of toronto,
Approximate Joins: Concepts and Techniques
Nick Koudas Divesh Srivastava
University of Toronto AT&T LabsResearch
koudas@cs.toronto.edu divesh@research.att.com
1 Motivation 2 Tutorial Outline
The quality of the data residing in information repositories The tutorial is example driven, and organized as follows.
and databases gets degraded due to a multitude of reasons.
Formally define the various flavors of the approximate
Such reasons include typing mistakes during insertion (e.g.,
join problem as optimization problems.
character transpositions), lack of standards for recording
database fields (e.g., addresses), and various errors intro-
Contrast different approximate match predicates,
duced by poor database design (e.g., missing integrity con- based on their algorithmic properties, computational
straints). Data of poor quality can result in significant im- overhead, and adaptability.
pediments to popular business practices: sending products
or bills to incorrect addresses, inability to locate customer
Review multiple methodologies for determining
records during service calls, inability to correlate customers whether tuples of attributes are approximately the
across multiple services, etc. same, and techniques for adapting this decision prob-
In the presence of data quality errors, a problem central lem into a join framework between two data sets.
in this context is the ability to identify whether two entities
Identify key research areas requiring further work.
(e.g., relational tuples) are approximately the same. De-
pending on the type of data under consideration, various 3 Professional Biographies
"similarity metrics" (approximate match predicates) have
been defined to quantify the closeness of a pair of data en- Nick Koudas is a faculty member at the University of
tities in a way that common mistakes are captured. A key Toronto, department of computer science. He holds a Ph.D.
operation in this context is, given two large multi-attribute from the University of Toronto, an M.Sc. from the Univer-
data sets, identify all pairs of entities (tuples) in the two sets sity of Maryland at College Park, and a B.Tech. from the
that are approximately the same. This operation has been University of Patras in Greece. He serves as an associate
well studied through the years and it is known under vari- editor for the Information Systems journal and the IEEE
ous names, including record linkage, entity identification, TKDE journal. He is the recipient of the 1998 ICDE Best
entity reconciliation and approximate join, to name a few. Paper award. His research interests include core database
Given the significance and the inherent difficulty of the ap- management, data quality, metadata management and its
proximate join problem, a plethora of techniques have been applications to networking.
developed in various communities, including the statistics, Divesh Srivastava is the head of the Database Research
pattern matching, and the database communities, deploying Department at AT&T Labs-Research. He received his
diverse approximate match predicates. Ph.D. from the University of Wisconsin, Madison, and his
The objective of this tutorial is to provide a compre- B.Tech. from the Indian Institute of Technology, Bombay,
hensive and cohesive overview of the key research results, India. He is on the editorial board of the ACM SIGMOD
techniques, and tools used for approximate joins. It com- Digital Review, and is a co-chair of the ACM SIGMOD
plements recent data quality tutorials that present broad Workshop on Information Quality in Information Systems,
overviews of various aspects of data quality, and don't 2005. His current research interests include data quality, IP
delve into the details of approximate join technology [2, 1]. network data management, and XML databases.
Permission to copy without fee all or part of this material is granted pro- References
vided that the copies are not made or distributed for direct commercial
advantage, the VLDB copyright notice and the title of the publication and [1] C. Batini, T. Catarci, and M. Scannapieco. A survey
its date appear, and notice is given that copying is by permission of the of data quality issues in cooperative information sys-
Very Large Data Base Endowment. To copy otherwise, or to republish, tems. Pre-conference ER tutorial, 2004.
requires a fee and/or special permission from the Endowment.
Proceedings of the 31st VLDB Conference, [2] T. Johnson and T. Dasu. Data quality and data clean-
Trondheim, Norway, 2005 ing: An overview. SIGMOD tutorial, 2003.
1363