TY - JOUR
T1 - A hybrid method for sequence clustering
AU - Hsu, Jia Lien
AU - Wu, Yu Shu
AU - Wu, I. Chin
PY - 2014/9/1
Y1 - 2014/9/1
N2 - The problem of sequence clustering is one of the fundamental research topics. However, most algorithms are dedicated to the case of single-label clustering. In this paper, we propose sequence clustering algorithms which can be applied for finding multi labels with respect to variable-length sequences. In our research, we first map sequences as vectors in the feature space by applying DCT transformation on each sliding window of sequences. A large amount of feature vectors could be further reduced by using the histogram concept and the quantization technique. Then, we use the hierarchical clustering algorithm to determine sequence labels. We also apply minimum bounding rectangle (MBR) techniques to approximate the distribution of feature vectors, and the elapsed time can be reduced accordingly. According to our experiment, the accuracy in the Rand index validity can be up to 88% for the single-label clustering of equal-length case. By applying the MBR techniques, the elapsed time of improved approach can be reduced as much as one sixth of the original approach, and the accuracy remains 86%. For the multi- label clustering, the accuracy can be up to 85%, and the elapsed time is about one fifth of the single-label case.
AB - The problem of sequence clustering is one of the fundamental research topics. However, most algorithms are dedicated to the case of single-label clustering. In this paper, we propose sequence clustering algorithms which can be applied for finding multi labels with respect to variable-length sequences. In our research, we first map sequences as vectors in the feature space by applying DCT transformation on each sliding window of sequences. A large amount of feature vectors could be further reduced by using the histogram concept and the quantization technique. Then, we use the hierarchical clustering algorithm to determine sequence labels. We also apply minimum bounding rectangle (MBR) techniques to approximate the distribution of feature vectors, and the elapsed time can be reduced accordingly. According to our experiment, the accuracy in the Rand index validity can be up to 88% for the single-label clustering of equal-length case. By applying the MBR techniques, the elapsed time of improved approach can be reduced as much as one sixth of the original approach, and the accuracy remains 86%. For the multi- label clustering, the accuracy can be up to 85%, and the elapsed time is about one fifth of the single-label case.
KW - Multi-label
KW - Quantization
KW - Sequence clustering
KW - Subsequence
KW - Variable-length sequence
UR - http://www.scopus.com/inward/record.url?scp=84906971523&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84906971523&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:84906971523
SN - 1016-2364
VL - 30
SP - 1483
EP - 1503
JO - Journal of Information Science and Engineering
JF - Journal of Information Science and Engineering
IS - 5
ER -