Thesis

Privacy-Preserving Spam Detection: Exploring Synthetic Email Generation for Model Training

Supervisor

Thesis: Master’s Thesis

Detecting spam and phishing requires large amounts of training data, but real emails often contain sensitive information, making their use legally and ethically problematic from a privacy perspective. This thesis explores how synthetic data generation (e.g., using Generative AI or classical techniques) can be used to create realistic yet anonymized “ham” emails (legitimate emails). The goal: Improve spam detection models without compromising real user data.

What Will You Do?

  • Generate synthetic data: Experiment with techniques like GANs, LLMs, or rule-based approaches to create realistic emails.
  • Ensure privacy compliance: Develop or apply methods to guarantee that generated data cannot be traced back to real users (e.g., through differential privacy or anonymization).
  • Train and evaluate models: Compare the performance of spam classifiers trained on synthetic vs. real data.
  • Assess practical applicability: Analyze whether synthetic data can serve as a valid alternative to sensitive training datasets.

Prerequisities

Required

  • Basic understanding of machine learning and articifial intelligence, e.g., Autoencoders or GANs (finished the course Foundations of Artificial Intelligence)
  • Familiarity with Natural Language Processing techniques
  • Proficiency in at least one programming language (preferably Python)
  • Proficiency in using LaTeX

Optional

  • You took the following courses:
    • Internettechnologies & Web Engineering
    • Advanced Methods of Machine Learning
    • Security in Communication Networks
  • Familiarity with evaluation metrics for AI models
  • Basic knowledge of privacy-preserving techniques
  • Basic knowledge of principles related to spam and phishing