Phishing URL Detection with Oversampling based on Text Generative Adversarial Networks

Ankesh Anand, Kshitij Gorde, Joel Ruben Antony Moniz, Noseong Park, Tanmoy Chakraborty, Bei Tseng Chu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

31 Citations (Scopus)

Abstract

The problem of imbalanced classes arises frequently in binary classification tasks. If one class outnumbers another, trained classifiers become heavily biased towards the majority class. For phishing URL detection, it is very natural that the number of collected benign URLs (i.e., the majority class) is much larger than the number of collected phishy URLs (i.e., the minority class). Oversampling the minority class can be a powerful tool to overcome this situation. However, existing methods perform the oversampling task in the feature space where the original data format is removed and URLs are succinctly represented by vectors. These methods are successful only if feature definitions are correct and the dataset is diverse and not too sparse. In this paper, we propose an oversampling technique in the data space. We train text generative adversarial networks (text-GANs) with URLs in the minority class and generate synthetic URLs that can be made part of the training set. We crawl a crowd-sourced URL repository to collect recently discovered phishy and benign URLs. Our experiments demonstrate significant performance improvements after using the proposed oversampling technique. Interestingly, some of the original test URLs are exactly regenerated by the proposed text generative model.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
EditorsYang Song, Bing Liu, Kisung Lee, Naoki Abe, Calton Pu, Mu Qiao, Nesreen Ahmed, Donald Kossmann, Jeffrey Saltz, Jiliang Tang, Jingrui He, Huan Liu, Xiaohua Hu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1168-1177
Number of pages10
ISBN (Electronic)9781538650356
DOIs
Publication statusPublished - 2019 Jan 22
Event2018 IEEE International Conference on Big Data, Big Data 2018 - Seattle, United States
Duration: 2018 Dec 102018 Dec 13

Publication series

NameProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

Conference

Conference2018 IEEE International Conference on Big Data, Big Data 2018
Country/TerritoryUnited States
CitySeattle
Period18/12/1018/12/13

Bibliographical note

Funding Information:
*Equally contributed and listed in alphabetical order; †Corresponding author; This work was partially supported by the Office of Naval Research under the MURI grant N00014-18-1-2670, and the Indo-UK Collaborative Project DST/INT/UKP158/2017.

Publisher Copyright:
© 2018 IEEE.

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Phishing URL Detection with Oversampling based on Text Generative Adversarial Networks'. Together they form a unique fingerprint.

Cite this