Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Dataset Watermarking ML & SEC 

In recent years, a number of watermarking methods have been proposed as a way for ML (machine learning) owners to demonstrate ownership of their models. Watermarking ML models are designed as a deterrence mechanism against both white-box model theft [1] where the adversary steals the model physically and black-box model stealing [2] where the adversary builds a surrogate model by querying the prediction API of the victim model. A same solution, dataset watermarking, can be also designed to protect the datasets, since they also have economic value. Recently, two simple watermarking methods have been proposed for image [3] and audio [4] classification datasets. 

This work will focus on analyzing the existing work, designing circumvention methods against the existing work by the statistical analysis of latent space, improving current methods based on the limitations found and exploring if there are potential alternative dataset watermarking techniques that are both effective and robust.

Note: There's going to be a programming pre-assignment.

Required skills:

...

Suppose Alice has collected a valuable dataset. She licenses it to Bob to be used for specific purposes (e.g., computing various aggregate statistics, or to evaluate different data anonymization techniques). Alice does not want Bob to build a machine learning (ML) model using her dataset and monetize the model, for example, by making its prediction API accessible via a cloud server to paying customers. Alice wants a mechanism by which if Bob misbehaves by building a model and making it available to customers, she will be able to demonstrate to an independent verifier that Bob's model was in fact built using her dataset. This is the problem of dataset watermarking.

A related problem is ML model watermarking where Alice inserts a watermark into ML models that she builds which she can later use to demonstrate her ownership of the model. Watermarking is intended to deter model theft. In recent years, a number of ML model watermarking techniques have been proposed, against both white-box model theft [1] where the adversary (Bob) steals the model physically and black-box model extraction [2] where the adversary builds a surrogate model by querying the prediction API of Alice's victim model. 

While model watermarking has been studied extensively, dataset watermarking is relatively under--explored. Recently, two simple dataset watermarking methods have been proposed for image [3] and audio [4] classification datasets. 

In this internship project, you will have the opportunity to work with senior researchers to analyse existing techniques for dataset watermarking, attempt to design circumvention methods against them, for example, by statistical analysis of latent space, and explore whether existing techniques can be improved or whether it is possible to design entirely new techniques that are effective and robust against such circumvention approaches.

Notes:

1) You will be required to complete a programming pre-assignment as part of the interview process.

2) This topic can be structured as either a short (3-4 month) internship or a thesis topic, depending on the background and skills of the selected student.

If you are a CCIS MSc student (majoring in security, ML, or a related discipline), and have the following "required skills", this internship will suit you. The "nice-to-have" skills listed below will be strong plusses.

Required skills:

  • Basic understanding of both standard ML techniques as well as deep neural networks (You should at least take Machine learning: basic principles course from SCI department or some other similar course)

  • Good knowledge of mathematical methods and algorithmic skills

  • Strong programming skills in Python/Java/Scala/C/C++/Javascript (Python preferred as de facto language)

  • Strong ability to work and interact in English

Nice to have:

  • Familiarity with statistical signal processing

  • Familiarity with deep learning frameworks (PyTorch, Tensorflow, Keras etc.)

  • Sufficient skills to work and interact in English

References: 

[1] Adi, Yossi, et al. 2018. "Turning your weakness into a strength: Watermarking deep neural networks by backdooring." 27th USENIX Security Symposium.

...

Unlike traditional machine learning approaches, federated learning allows training machine learning models at the edge devices (referred as to clients or data owners) and then combines the results of all models into a single global model stored in a server [4]. Watermarking solutions can be integrated into federated learning models when the server is the model owner and clients are data owners [5]. However, unlike the post processing methods, adversaries as malicious clients can directly manipulate the model in the training phase to remove the effect of watermark from the global model. In this work, we are going to design an adaptive attacker that tries to remove the watermark in the training phase of the federated learning and propose new defense strategies against this type of attackers. Note: There's going to be a programming pre-assignment

Notes:

1) You will be required to complete a programming pre-assignment as part of the interview process.

2) This topic can be structured as either a short (3-4 month) internship or a thesis topic, depending on the background and skills of the selected student.

Required skills:

  • MSc students in security, computer science, machine learning, robotics or automated systems

  • Basic understanding of both standard ML techniques as well as deep neural networks (You should at least take Machine learning: basic principles course from SCI department or some other similar course)

  • Good knowledge of mathematical methods and algorithmic skills

  • Strong programming skills in Python/Java/Scala/C/C++/Javascript (Python preferred as de facto language)

  • Sufficient skills to work and interact in English

Nice to have:

  • Familiarity with adversarial machine learning, federated learning or statistics

  • Familiarity with deep learning frameworks (PyTorch, Tensorflow, Keras etc.)Sufficient skills to work and interact in English

References: 

[1] Adi, Yossi, et al. 2018. "Turning your weakness into a strength: Watermarking deep neural networks by backdooring." 27th USENIX Security Symposium.

...