Mining browsing behaviors for objectionable content filtering

Lung Hao Lee, Yen Cheng Juan, Wei Lin Tseng, Hsin Hsi Chen, Yuen-Hsien Tseng

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

This article explores users' browsing intents to predict the category of a user's next access during web surfing and applies the results to filter objectionable content, such as pornography, gambling, violence, and drugs. Users' access trails in terms of category sequences in click-through data are employed to mine users' web browsing behaviors. Contextual relationships of URL categories are learned by the hidden Markov model. The top-level domains (TLDs) extracted from URLs themselves and the corresponding categories are caught by the TLD model. Given a URL to be predicted, its TLD and current context are empirically combined in an aggregation model. In addition to the uses of the current context, the predictions of the URL accessed previously in different contexts by various users are also considered by majority rule to improve the aggregation model. Large-scale experiments show that the advanced aggregation approach achieves promising performance while maintaining an acceptably low false positive rate. Different strategies are introduced to integrate the model with the blacklist it generates for filtering objectionable web pages without analyzing their content. In practice, this is complementary to the existing content analysis from users' behavioral perspectives.

Original languageEnglish
Pages (from-to)930-942
Number of pages13
JournalJournal of the Association for Information Science and Technology
Volume66
Issue number5
DOIs
Publication statusPublished - 2015 May 1

Fingerprint

Websites
aggregation
Agglomeration
majority rule
Hidden Markov models
pornography
World Wide Web
gambling
content analysis
violence
drug
experiment
Experiments
performance

Keywords

  • collaborative filtering

ASJC Scopus subject areas

  • Information Systems
  • Computer Networks and Communications
  • Information Systems and Management
  • Library and Information Sciences

Cite this

Mining browsing behaviors for objectionable content filtering. / Lee, Lung Hao; Juan, Yen Cheng; Tseng, Wei Lin; Chen, Hsin Hsi; Tseng, Yuen-Hsien.

In: Journal of the Association for Information Science and Technology, Vol. 66, No. 5, 01.05.2015, p. 930-942.

Research output: Contribution to journalArticle

Lee, Lung Hao ; Juan, Yen Cheng ; Tseng, Wei Lin ; Chen, Hsin Hsi ; Tseng, Yuen-Hsien. / Mining browsing behaviors for objectionable content filtering. In: Journal of the Association for Information Science and Technology. 2015 ; Vol. 66, No. 5. pp. 930-942.
@article{b2cce998c0674bcdacc1e980b9a013fc,
title = "Mining browsing behaviors for objectionable content filtering",
abstract = "This article explores users' browsing intents to predict the category of a user's next access during web surfing and applies the results to filter objectionable content, such as pornography, gambling, violence, and drugs. Users' access trails in terms of category sequences in click-through data are employed to mine users' web browsing behaviors. Contextual relationships of URL categories are learned by the hidden Markov model. The top-level domains (TLDs) extracted from URLs themselves and the corresponding categories are caught by the TLD model. Given a URL to be predicted, its TLD and current context are empirically combined in an aggregation model. In addition to the uses of the current context, the predictions of the URL accessed previously in different contexts by various users are also considered by majority rule to improve the aggregation model. Large-scale experiments show that the advanced aggregation approach achieves promising performance while maintaining an acceptably low false positive rate. Different strategies are introduced to integrate the model with the blacklist it generates for filtering objectionable web pages without analyzing their content. In practice, this is complementary to the existing content analysis from users' behavioral perspectives.",
keywords = "collaborative filtering",
author = "Lee, {Lung Hao} and Juan, {Yen Cheng} and Tseng, {Wei Lin} and Chen, {Hsin Hsi} and Yuen-Hsien Tseng",
year = "2015",
month = "5",
day = "1",
doi = "10.1002/asi.23217",
language = "English",
volume = "66",
pages = "930--942",
journal = "Journal of the Association for Information Science and Technology",
issn = "2330-1635",
publisher = "John Wiley and Sons Ltd",
number = "5",

}

TY - JOUR

T1 - Mining browsing behaviors for objectionable content filtering

AU - Lee, Lung Hao

AU - Juan, Yen Cheng

AU - Tseng, Wei Lin

AU - Chen, Hsin Hsi

AU - Tseng, Yuen-Hsien

PY - 2015/5/1

Y1 - 2015/5/1

N2 - This article explores users' browsing intents to predict the category of a user's next access during web surfing and applies the results to filter objectionable content, such as pornography, gambling, violence, and drugs. Users' access trails in terms of category sequences in click-through data are employed to mine users' web browsing behaviors. Contextual relationships of URL categories are learned by the hidden Markov model. The top-level domains (TLDs) extracted from URLs themselves and the corresponding categories are caught by the TLD model. Given a URL to be predicted, its TLD and current context are empirically combined in an aggregation model. In addition to the uses of the current context, the predictions of the URL accessed previously in different contexts by various users are also considered by majority rule to improve the aggregation model. Large-scale experiments show that the advanced aggregation approach achieves promising performance while maintaining an acceptably low false positive rate. Different strategies are introduced to integrate the model with the blacklist it generates for filtering objectionable web pages without analyzing their content. In practice, this is complementary to the existing content analysis from users' behavioral perspectives.

AB - This article explores users' browsing intents to predict the category of a user's next access during web surfing and applies the results to filter objectionable content, such as pornography, gambling, violence, and drugs. Users' access trails in terms of category sequences in click-through data are employed to mine users' web browsing behaviors. Contextual relationships of URL categories are learned by the hidden Markov model. The top-level domains (TLDs) extracted from URLs themselves and the corresponding categories are caught by the TLD model. Given a URL to be predicted, its TLD and current context are empirically combined in an aggregation model. In addition to the uses of the current context, the predictions of the URL accessed previously in different contexts by various users are also considered by majority rule to improve the aggregation model. Large-scale experiments show that the advanced aggregation approach achieves promising performance while maintaining an acceptably low false positive rate. Different strategies are introduced to integrate the model with the blacklist it generates for filtering objectionable web pages without analyzing their content. In practice, this is complementary to the existing content analysis from users' behavioral perspectives.

KW - collaborative filtering

UR - http://www.scopus.com/inward/record.url?scp=84944315865&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84944315865&partnerID=8YFLogxK

U2 - 10.1002/asi.23217

DO - 10.1002/asi.23217

M3 - Article

VL - 66

SP - 930

EP - 942

JO - Journal of the Association for Information Science and Technology

JF - Journal of the Association for Information Science and Technology

SN - 2330-1635

IS - 5

ER -