Thesis

Privacy-Preserving Spam Detection: Exploring Synthetic Email Generation for Model Training

Supervisor

Malte Josten, M.Sc.

Thesis: Master’s Thesis

Detecting spam and phishing requires large amounts of training data, but real emails often contain sensitive information, making their use legally and ethically problematic from a privacy perspective. This thesis explores how synthetic data generation (e.g., using Generative AI or classical techniques) can be used to create realistic yet anonymized “ham” emails (legitimate emails). The goal: Improve spam detection models without compromising real user data.

What Will You Do?

Generate synthetic data: Experiment with techniques like GANs, LLMs, or rule-based approaches to create realistic emails.
Ensure privacy compliance: Develop or apply methods to guarantee that generated data cannot be traced back to real users (e.g., through differential privacy or anonymization).
Train and evaluate models: Compare the performance of spam classifiers trained on synthetic vs. real data.
Assess practical applicability: Analyze whether synthetic data can serve as a valid alternative to sensitive training datasets.

Prerequisities

Required

Basic understanding of machine learning and articifial intelligence, e.g., Autoencoders or GANs (finished the course Foundations of Artificial Intelligence)
Familiarity with Natural Language Processing techniques
Proficiency in at least one programming language (preferably Python)
Proficiency in using LaTeX

Optional

You took the following courses:
- Internettechnologies & Web Engineering
- Advanced Methods of Machine Learning
- Security in Communication Networks
Familiarity with evaluation metrics for AI models
Basic knowledge of privacy-preserving techniques
Basic knowledge of principles related to spam and phishing

Verteilte Systeme

Faculty of Computer Science