Analyzing Dataset (Email) by Using Classification Approach to Authorship Identification
Keywords:
RapidMiner, K-Nearest Neighbor (K-NN), Authorship Analysis, Stylometric Features, Classification Analysis, Authorship CategorizationAbstract
With the widespread adoption of internet technologies and applications, the misuse of online emails for illicit purposes has become a significant concern. Authorship identification, a crucial text analysis task, involves determining the likely author of a given document. In the context of emails, this methodology proves valuable in attributing a particular email to its originator based on factors such as writing style, word choice, and other linguistic features. Classification analysis emerges as a prevalent approach for authorship identification, employing machine learning models trained on a dataset of known authors to predict the authorship of unknown texts.
The anonymous nature of online emails presents challenges in tracing identities, escalating the gravity of the issue. The internet has unfortunately become a playground for cybercriminals engaging in activities ranging from simple spamming to sophisticated phishing attacks. Authorship analysis stands out as a pivotal measure to counter such illicit cyber activities. This study delves into authorship identification, focusing on a dataset of emails to ascertain whether an anonymous email is created by a suspect [1].
The primary objective of this project is to discern the authorship of anonymous emails by leveraging stylometric features. These features encompass vocabulary richness, sentence length, and writing style. Through an examination of a dataset comprising known emails, the study aims to distinguish and confirm the authorship of anonymous emails. Authorship analysis has demonstrated effectiveness not only in countering illegal cyber activities but also in revealing the true authorship of anonymous emails. This research contributes to the ongoing efforts to bolster cybersecurity measures and address the challenges posed by the misuse of online communication.
Published
How to Cite
Issue
Section
This work is licensed under a Creative Commons Attribution 4.0 International License.