# Datasets ## Overview The datasets are collections of data used for training and evaluating the model. It is crucial to have well-structured datasets to ensure the development process is effective and efficient. ## Structure The datasets are organized into the following directories: - `raw/`: Contains the raw, unprocessed data. - `processed/`: Contains the **well-structured** processed data ready for training. - `scripts/`: Contains scripts for data processing and augmentation. ## Data Sources Classic datasets: - [Assassin](https://spamassassin.apache.org/old/publiccorpus) (Apache Open-Source Dataset, 2005) - [CEAS_08](https://plg.uwaterloo.ca/~gvcormac/ceascorpus/) (University of Waterloo, 2008) - [Enron](https://www.cs.cmu.edu/~enron/) (Carnegie Mellon University, 2000 - 2015) - [Nazario](https://monkey.org/~jose/phishing/) (José Nazario, 2005 - 2024) - Nigerian (1998 - 2007) - [TREC](https://plg.uwaterloo.ca/~gvcormac/treccorpus07/) (University of Waterloo, 2007) Improved datasets: - [Curated Datasets 1](https://figshare.com/articles/dataset/Phishing_Email_11_Curated_Datasets/24952503/1) - [Curated Datasets 2](https://figshare.com/articles/dataset/Curated_Dataset_-_Phishing_Email/24899952) The improved datasets are enhanced and optimized based on the classic datasets. In fact, they are still subsets of the classic datasets and do not include more recent phishing patterns. ## Processing Data processing is completed in two steps. The first step is performed in the datasets directory using scripts, where the raw datasets are cleaned and extracted, and organized into a well-structured, class-balanced format. The second step is carried out in the corresponding dataset's `LightningDataModule`, which handles tokenization, collation, and other operations, ultimately generating a `DataLoader` for model training and testing. ## Dataset-specific Information > WARNING! Corpus may contain viruses, fraudulent solicitations, and other files that may pose a security risk. Do not view any files in the folder with an ordinary browser or email client. Also note that virus or adware removal tools may damage the corpus. ### Classic Datasets Kaggle all-in-one solution [link](https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset). #### [Assassin](https://spamassassin.apache.org/old/publiccorpus) (Apache Open-Source Dataset, 2005) This dataset is contributed by the internet community and is one of the earliest spam email datasets, covering a variety of topics. - spam: 500 spam messages, all received from non-spam-trap sources. - easy_ham: 2500 non-spam messages. These are typically quite easy to differentiate from spam, since they frequently do not contain any spammish signatures (like HTML etc). - hard_ham: 250 non-spam messages which are closer in many respects to typical spam: use of HTML, unusual HTML markup, coloured text, "spammish-sounding" phrases etc. - easy_ham_2: 1400 non-spam messages. A more recent addition to the set. - spam_2: 1397 spam messages. Again, more recent. Total count: 6047 messages, with about a 31% spam ratio. #### [CEAS_08](https://plg.uwaterloo.ca/~gvcormac/ceascorpus/) (University of Waterloo, 2008) This dataset comes from the CEAS 2008 email detection challenge. All emails during the competition were collected in real time and cover a wide range of topics. The structure of the original dataset is relatively complex. You can use the existing CEAS_08 dataset available on Kaggle, such as [link](https://www.kaggle.com/datasets/doryanay/ceas-08). 39154 samples in total, 17312 (44.2%) benign, 21842 (55.8%) spam. #### [Enron](https://www.cs.cmu.edu/~enron/) (Carnegie Mellon University, 2000 - 2015) This dataset is a collection of emails from the Enron Corporation, which was involved in a major corporate scandal. Kaggle version can be found at [link](https://www.kaggle.com/datasets/wcukierski/enron-email-dataset). 517401 samples in total. #### [Nazario](https://monkey.org/~jose/phishing/) (José Nazario, 2005 - 2024) This dataset is a collection of phishing emails curated by José Nazario over nearly two decades. Its most notable feature is that it is entirely constructed from phishing emails received in his personal inbox. #### Nigerian (1998 - 2007) These "Nigerian" datasets typically focus on "Nigerian fraud emails," that is, phishing and scam emails, rather than the more common advertising or junk mail types found in spam/ham classification. #### [TREC](https://plg.uwaterloo.ca/~gvcormac/treccorpus07/) (University of Waterloo, 2007) ### Improved Datasets #### [Curated Datasets 1](https://figshare.com/articles/dataset/Phishing_Email_11_Curated_Datasets/24952503/1) #### [Curated Datasets 2](https://figshare.com/articles/dataset/Curated_Dataset_-_Phishing_Email/24899952)