Thesis
Privacy-Preserving Spam Detection: Exploring Synthetic Email Generation for Model Training
Supervisor
Thesis: Master’s Thesis
Detecting spam and phishing requires large amounts of training data, but real emails often contain sensitive information, making their use legally and ethically problematic from a privacy perspective. This thesis explores how synthetic data generation (e.g., using Generative AI or classical techniques) can be used to create realistic yet anonymized “ham” emails (legitimate emails). The goal: Improve spam detection models without compromising real user data.
What Will You Do?
- Generate synthetic data: Experiment with techniques like GANs, LLMs, or rule-based approaches to create realistic emails.
- Ensure privacy compliance: Develop or apply methods to guarantee that generated data cannot be traced back to real users (e.g., through differential privacy or anonymization).
- Train and evaluate models: Compare the performance of spam classifiers trained on synthetic vs. real data.
- Assess practical applicability: Analyze whether synthetic data can serve as a valid alternative to sensitive training datasets.
Prerequisities
Required
- Basic understanding of machine learning and articifial intelligence, e.g., Autoencoders or GANs (finished the course Foundations of Artificial Intelligence)
- Familiarity with Natural Language Processing techniques
- Proficiency in at least one programming language (preferably Python)
- Proficiency in using LaTeX
Optional
- You took the following courses:
- Internettechnologies & Web Engineering
- Advanced Methods of Machine Learning
- Security in Communication Networks
- Familiarity with evaluation metrics for AI models
- Basic knowledge of privacy-preserving techniques
- Basic knowledge of principles related to spam and phishing