A maximum dimension partitioning approach for efficiently finding all similar pairs

Jia Ling Koh, Shao Chun Peng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

For solving the All Pair Similarity Search (APSS) problem efficiently, this paper provides a maximum dimension partitioning approach to effectively filter non-similar pairs in an early stage. At first, for each data point, the dimension with the maximum value is used to decide the corresponding segment of data partition. An adjusting method is designed to balance the number of elements in each data segment. The similar pairs consist of inter-segment similar pairs and intra-segment similar pairs, where most effort of computing APSS comes from the computation of finding inter-segment similar pairs. For speeding up the computation, a pilot-vector is used to represent each segment for estimating the upper bound of similarity between each segment pair. Only the segment pairs, whose upper bounds of similarity are larger than the given similarity threshold, need to generate the inter-segment data pairs as candidates. Moreover, based on the proposed partitioning method, we designed a MapReduce framework to solve the APSS problem in parallel. The performance evaluation results show the proposed method provides better pruning effectiveness on non-similar data pairs than the related works. Moreover, the proposed partition-based method can properly fit into the MapReduce programming scheme to effectively reduce the response time of solving the APSS problem.

Original languageEnglish
Title of host publicationBig Data Analytics and Knowledge Discovery - 18th International Conference, DaWaK 2016, Proceedings
PublisherSpringer Verlag
Pages163-178
Number of pages16
Volume9829 LNCS
ISBN (Print)9783319439457
DOIs
Publication statusPublished - 2016 Jan 1
Event18th International Conference on Big Data Analytics and Knowledge Discovery, DaWaK 2016 - Porto, Portugal
Duration: 2016 Sep 62016 Sep 8

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9829 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other18th International Conference on Big Data Analytics and Knowledge Discovery, DaWaK 2016
CountryPortugal
CityPorto
Period16/9/616/9/8

Fingerprint

Partitioning
Similarity Search
Search Problems
MapReduce
Partition
Upper bound
Pruning
Response Time
Performance Evaluation
Programming
Filter
Computing

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Koh, J. L., & Peng, S. C. (2016). A maximum dimension partitioning approach for efficiently finding all similar pairs. In Big Data Analytics and Knowledge Discovery - 18th International Conference, DaWaK 2016, Proceedings (Vol. 9829 LNCS, pp. 163-178). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9829 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-43946-4_11

A maximum dimension partitioning approach for efficiently finding all similar pairs. / Koh, Jia Ling; Peng, Shao Chun.

Big Data Analytics and Knowledge Discovery - 18th International Conference, DaWaK 2016, Proceedings. Vol. 9829 LNCS Springer Verlag, 2016. p. 163-178 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9829 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Koh, JL & Peng, SC 2016, A maximum dimension partitioning approach for efficiently finding all similar pairs. in Big Data Analytics and Knowledge Discovery - 18th International Conference, DaWaK 2016, Proceedings. vol. 9829 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9829 LNCS, Springer Verlag, pp. 163-178, 18th International Conference on Big Data Analytics and Knowledge Discovery, DaWaK 2016, Porto, Portugal, 16/9/6. https://doi.org/10.1007/978-3-319-43946-4_11
Koh JL, Peng SC. A maximum dimension partitioning approach for efficiently finding all similar pairs. In Big Data Analytics and Knowledge Discovery - 18th International Conference, DaWaK 2016, Proceedings. Vol. 9829 LNCS. Springer Verlag. 2016. p. 163-178. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-43946-4_11
Koh, Jia Ling ; Peng, Shao Chun. / A maximum dimension partitioning approach for efficiently finding all similar pairs. Big Data Analytics and Knowledge Discovery - 18th International Conference, DaWaK 2016, Proceedings. Vol. 9829 LNCS Springer Verlag, 2016. pp. 163-178 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{b56b6db6305e4595839a0cabee1c3ef4,
title = "A maximum dimension partitioning approach for efficiently finding all similar pairs",
abstract = "For solving the All Pair Similarity Search (APSS) problem efficiently, this paper provides a maximum dimension partitioning approach to effectively filter non-similar pairs in an early stage. At first, for each data point, the dimension with the maximum value is used to decide the corresponding segment of data partition. An adjusting method is designed to balance the number of elements in each data segment. The similar pairs consist of inter-segment similar pairs and intra-segment similar pairs, where most effort of computing APSS comes from the computation of finding inter-segment similar pairs. For speeding up the computation, a pilot-vector is used to represent each segment for estimating the upper bound of similarity between each segment pair. Only the segment pairs, whose upper bounds of similarity are larger than the given similarity threshold, need to generate the inter-segment data pairs as candidates. Moreover, based on the proposed partitioning method, we designed a MapReduce framework to solve the APSS problem in parallel. The performance evaluation results show the proposed method provides better pruning effectiveness on non-similar data pairs than the related works. Moreover, the proposed partition-based method can properly fit into the MapReduce programming scheme to effectively reduce the response time of solving the APSS problem.",
author = "Koh, {Jia Ling} and Peng, {Shao Chun}",
year = "2016",
month = "1",
day = "1",
doi = "10.1007/978-3-319-43946-4_11",
language = "English",
isbn = "9783319439457",
volume = "9829 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "163--178",
booktitle = "Big Data Analytics and Knowledge Discovery - 18th International Conference, DaWaK 2016, Proceedings",

}

TY - GEN

T1 - A maximum dimension partitioning approach for efficiently finding all similar pairs

AU - Koh, Jia Ling

AU - Peng, Shao Chun

PY - 2016/1/1

Y1 - 2016/1/1

N2 - For solving the All Pair Similarity Search (APSS) problem efficiently, this paper provides a maximum dimension partitioning approach to effectively filter non-similar pairs in an early stage. At first, for each data point, the dimension with the maximum value is used to decide the corresponding segment of data partition. An adjusting method is designed to balance the number of elements in each data segment. The similar pairs consist of inter-segment similar pairs and intra-segment similar pairs, where most effort of computing APSS comes from the computation of finding inter-segment similar pairs. For speeding up the computation, a pilot-vector is used to represent each segment for estimating the upper bound of similarity between each segment pair. Only the segment pairs, whose upper bounds of similarity are larger than the given similarity threshold, need to generate the inter-segment data pairs as candidates. Moreover, based on the proposed partitioning method, we designed a MapReduce framework to solve the APSS problem in parallel. The performance evaluation results show the proposed method provides better pruning effectiveness on non-similar data pairs than the related works. Moreover, the proposed partition-based method can properly fit into the MapReduce programming scheme to effectively reduce the response time of solving the APSS problem.

AB - For solving the All Pair Similarity Search (APSS) problem efficiently, this paper provides a maximum dimension partitioning approach to effectively filter non-similar pairs in an early stage. At first, for each data point, the dimension with the maximum value is used to decide the corresponding segment of data partition. An adjusting method is designed to balance the number of elements in each data segment. The similar pairs consist of inter-segment similar pairs and intra-segment similar pairs, where most effort of computing APSS comes from the computation of finding inter-segment similar pairs. For speeding up the computation, a pilot-vector is used to represent each segment for estimating the upper bound of similarity between each segment pair. Only the segment pairs, whose upper bounds of similarity are larger than the given similarity threshold, need to generate the inter-segment data pairs as candidates. Moreover, based on the proposed partitioning method, we designed a MapReduce framework to solve the APSS problem in parallel. The performance evaluation results show the proposed method provides better pruning effectiveness on non-similar data pairs than the related works. Moreover, the proposed partition-based method can properly fit into the MapReduce programming scheme to effectively reduce the response time of solving the APSS problem.

UR - http://www.scopus.com/inward/record.url?scp=84981225779&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84981225779&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-43946-4_11

DO - 10.1007/978-3-319-43946-4_11

M3 - Conference contribution

AN - SCOPUS:84981225779

SN - 9783319439457

VL - 9829 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 163

EP - 178

BT - Big Data Analytics and Knowledge Discovery - 18th International Conference, DaWaK 2016, Proceedings

PB - Springer Verlag

ER -