A maximum dimension partitioning approach for efficiently finding all similar pairs

Jia Ling Koh*, Shao Chun Peng

*此作品的通信作者

研究成果: 書貢獻/報告類型會議論文篇章

1 引文 斯高帕斯(Scopus)

摘要

For solving the All Pair Similarity Search (APSS) problem efficiently, this paper provides a maximum dimension partitioning approach to effectively filter non-similar pairs in an early stage. At first, for each data point, the dimension with the maximum value is used to decide the corresponding segment of data partition. An adjusting method is designed to balance the number of elements in each data segment. The similar pairs consist of inter-segment similar pairs and intra-segment similar pairs, where most effort of computing APSS comes from the computation of finding inter-segment similar pairs. For speeding up the computation, a pilot-vector is used to represent each segment for estimating the upper bound of similarity between each segment pair. Only the segment pairs, whose upper bounds of similarity are larger than the given similarity threshold, need to generate the inter-segment data pairs as candidates. Moreover, based on the proposed partitioning method, we designed a MapReduce framework to solve the APSS problem in parallel. The performance evaluation results show the proposed method provides better pruning effectiveness on non-similar data pairs than the related works. Moreover, the proposed partition-based method can properly fit into the MapReduce programming scheme to effectively reduce the response time of solving the APSS problem.

原文英語
主出版物標題Big Data Analytics and Knowledge Discovery - 18th International Conference, DaWaK 2016, Proceedings
編輯Sanjay Madria, Takahiro Hara
發行者Springer Verlag
頁面163-178
頁數16
ISBN(列印)9783319439457
DOIs
出版狀態已發佈 - 2016
事件18th International Conference on Big Data Analytics and Knowledge Discovery, DaWaK 2016 - Porto, 葡萄牙
持續時間: 2016 9月 62016 9月 8

出版系列

名字Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
9829 LNCS
ISSN(列印)0302-9743
ISSN(電子)1611-3349

其他

其他18th International Conference on Big Data Analytics and Knowledge Discovery, DaWaK 2016
國家/地區葡萄牙
城市Porto
期間2016/09/062016/09/08

ASJC Scopus subject areas

  • 理論電腦科學
  • 一般電腦科學

指紋

深入研究「A maximum dimension partitioning approach for efficiently finding all similar pairs」主題。共同形成了獨特的指紋。

引用此