Page tree
Skip to end of metadata
Go to start of metadata

Introduction to Privacy and Data Protection

  • Privacy, as a concept, is a sociological concept and was created as a result of urbanisation. There was only a very limited form of privacy definition in rural and agricultural society; only masses provide relative anonymity which in turn is an enabler for privacy. You could go to a bar and drink and you could mostly count on nobody knowing who you were.
  • Information technology introduces concepts such as databases, machine learning and rapid correlation of information, as well as machine-assisted identification. The old, ”human”, privacy norms are difficult to enforce when the limits of human brain are not any more the limiting factor.
  • We are talking about information privacy here. This is about privacy-enabling treatment of information that is collected or processed through information technology. The information means information about a data subject, which is a person (also called a data subject).
  • In the European Union, privacy in information technology is mainly referred to as data protection (in Finnish, “tietosuoja”). This has its roots on the misuse of various registries during the second world war. Data protection and privacy are used mostly interchangeably in the European English-speaking privacy circles, but it is useful to understand their differences. Data protection for a bank does not mean anonymity-style privacy at all, but anonymity-preserving technologies can be used to provide data protection. Often the real meaning is only visible from the context.
  • Also communications privacy is in our scope as far as it is done through information technology means. There is also other types of communication that is not electronic that can be private, but is not in our treatment.
  • It is an interesting question whether non-natural persons would have anything like privacy - is there, for example, privacy for organisations or sentient neural networks that do not reside in human wetware. For the latter, the concept will have to evolve in the future. For organisatorial privacy, the general assumption is that it does not exist. Organisations have rights to trade secrets which are specifically regulated. However, human members of organisations have privacy rights as a part of them being humans, and their relationship to an organisation may be a privacy-related fact.
    • Privacy of dead people is particularly interesting as these are not natural persons, but data on their genetic makeup may be personal data of a living person.
  • There are also other privacy aspects such as bodily privacy (right to one’s own body) and territorial privacy (right to be left alone, privacy of home). These sometimes spill over to information privacy. Bodily privacy could be violated by the release of health data, and spam (unsolicited email) or online bullying violate the right to be left alone. For the purposes of security risk analysis of an information system, we are treating these as system-specific requirements and risks, and these should be taken into account through their information privacy angle.
  • Privacy is based on security: There is a saying that you can have security without privacy, but no privacy without security. As an easy example, consider a banking system.
  • Many privacy topics can be expressed in terms of the security needs that are required to implement the privacy topics. For example, a ”leak of private data” from a system is in essence a confidentiality failure, and in STRIDE modelling, would be discussed under the “I” (Information Disclosure).

Discuss: What makes it then specifically a privacy issue? Why aren’t all confidentiality issues also privacy issues?

The General Data Protection Regulation (GDPR)

  • What is a privacy issue is defined usually through legislation. The EU personal data concept covers any data that is linked, or can be linked, to a natural person. In the US, the term is usually Privately Identifiable Information or PII.
  • The US (federal) legislation is ”sectoral”, meaning that the US has multiple privacy laws and regulations that apply to specific contexts, such as HIPAA in health sector. EU privacy legislation is sector-neutral, and all use of personal data falls under the legislation irrespective of sector.
  • Privacy is a fairly large topic and would be a topic of a course of its own. The ”OECD Guidelines”, a well-known information privacy guideline from 1980, was updated in 2013. The Part 2 of the guidelines lists a number of principles that are usually understood as defining what privacy is. These principles are also enshrined in the EU legislation, so it is useful to understand them. The principles are as follows:
    • Collection limitation principle - Private information cannot be collected without limits, and the person about whom the data is, should be aware and may need to consent to the collection.
    • Data quality principle - Private information must be accurate and up-todate.
    • Purpose specification principle - Private information needs to be collected for a specified purpose, and collected information cannot be used for another purpose that was not known at the collection time.
    • Use limitation principle - Private information must not be used or disclosed to others than authorised parties.
    • Security safeguards principle - Private information must be protected by appropriate information security controls.
    • Openness principle - Private information collection must be transparent.
    • Individual participation principle - Individuals whose information has been collected have the right to know this fact, know what data has been collected, and request corrections (or even deletion) of the data.
    • Accountability principle - There needs to be accountability for following these principles, that is, enforcement of some sort.
  • From 25 May 2018, the European data protection legislation is based on the GDPR (General Data Protection Regulation). As a regulation, it is directly applicable in all member countries.
    • We will look at the GDPR from software engineering and security angle.
  • Here is a simple look at the GDPR.
    • GDPR applies to most personal data use. Only a handful of areas are not in scope, namely
      • EU common foreign policy
      • Natural persons conducting household activities
      • Authorities’ activities in criminal investigations, penalties and public security
      • Manual processing of personal data when the data is not in a “filing system” (for example, a pile of papers that are not structured according to some criteria).
    • GDPR regulates personal data processing. The term “processing” covers practically anything that one can do to personal data - including storage.
    • Someone is responsible for the data processing. There main actors are:
      • Controllers, who are the main responsible parties. Controllers are defined through their ability to define how data is processed and what it is used for.
      • Processors, who are parties that process personal data under the guidance of a Controller.
      • There is also a third concept of a Joint Controller which is essentially two Controllers mutually liable for processing (but it is definitely less common).
    • From a software engineering perspective, the product management should be very clear on whether you are a Controller (usually when you have customer relationships or use personal data for your own business purposes), or a Processor (usually when you provide a service to someone and get their personal data while doing that).
    • Data may only be processed if a specific legal basis is fulfilled. The legal bases are:
      • Consent (i.e., permission). This is fairly tricky, because consent must be freely given - there are many ways in which a user experience may make consent non-freely given.
      • Performance of a contract. Probably one of the most straightforward, if the personal data really is needed for, e.g., providing a service that you do under a contract.
      • Legal obligations. A common example is invoicing data that is needed for taxation purposes.
      • Vital interests. “Vital” here means life or death; some healthcare applications may well qualify.
      • Exercise of official authority or public interest. This happens when an authority (legally tasked with something) needs to process personal data in order to whatever they’ve been tasked with.
      • Legitimate interests. This is probably the “freest” way of using data, covering marketing and other various business needs. The catch here is that legitimate interest use must go through a balancing test where the interests are balanced against the ill effects on privacy. A legitimate interest fails if it causes privacy issues.
    • Some types of data are classified as “special categories” of data - also often referred to as “sensitive” data.
      • This covers, for example, health data including genetic data, racial or ethnic data, political, religious and philosophical beliefs, and information on sex life.
      • Processing of these special categories are even stricter. Notably, legitimate interest basis cannot be used.
    • Once you have defined your role (Controller or Processor), and have a legal basis for the data processing, then you need to fulfill processing requirements:
      • Compliance with the security requirements.
      • Compliance with the data subject rights. These include:
        • If you use consent, consent revocation UX that must be as easy as giving consent.
        • You may need to give a copy of the personal data you hold to the data subject;
        • This may have to be done in a machine-readable form;
        • You may have to rectify or erase data;
        • You may have to restrict personal data use temporarily.
      • These requirements result in both functional and non-functional requirements.
        • Software security in general; especially access control and confidentiality protection such as cryptography
        • Auditability and breach management preparations. This often ties into the general IT security of the operational environment, but in DevOps teams, it is very much the development teams’ area.
        • Data retention periods and the actual technical implementation of removal of data after it has expired.
        • Suitable data storage support - and in some cases, a suitable UI for administrator users, for rectification and removal.
        • UX for informing the user about the use of personal data, and for giving and withdrawing consent.
        • Mechanics for restricting data use temporarily.
        • Tracking recipients of data (when transferring data from one Controller to another).

Privacy Impact Assessment

  • Privacy Impact Assessment (PIA) is an activity where the potential privacy impacts are determined.
  • One could roughly define it as “threat modelling for privacy” and not be at all far off.
  • The GDPR specifies a special type of PIA called Data Protection Impact Assessment, or DPIA. In real life, there are non-DPIA PIAs and DPIAs.
    • A DPIA is likely to be done on the level of service design and business case determination.
    • “Plain” PIAs can be done anywhere.
  • There is a recent ISO standard for conducting PIAs: ISO 29134.
    • The vast majority of PIAs in the industry today are done through various home-cooked templates.
  • The main aspects of a typical PIA are:
    • Determine what personal data is being processed
    • Whether the lawful basis exists and is understood
    • Whether the data subject is being sufficiently informed
    • Where it is stored, and where it is sent
      • Unlike in security, where the responsibility for data is usually with the receiver of data, in privacy, the responsibility is usually with the sender
    • The lifecycle of the data
      • How long is it retained
      • How is it deleted after it expires
      • While it is retained, how the data subject rights can be executed (right to obtain data, rectify data, cease processing, erase data)
    • Whether the security and auditability for the identified processing is appropriate
  • In a typical PIA, this results in a Data Flow Diagram not very unlike the one that is being used for threat modelling
    • It usually makes a lot of sense to bundle a PIA into threat modelling activity, because the PIA has to answer to security-relevant questions as well

Extending data flow analysis beyond security: Data mapping for privacy

  • Data flow analysis that was used for threat modelling can be extended to provide the technical part of a PIA.
  • One can extend a data flow diagram by superimposing a set of privacy domains in addition to security domains, and by making sure that all the personal data flows and stores are visible.
    • Often the security and privacy domains are the same, for example, in a simple web application, the server can be its own security domain and it could also be the privacy domain (if the server is controlled by a single entity).
  • However, sometimes the security and privacy domains can differ. A particular example could be when the server is a cloud service. The cloud server is in the possession of a third party (not the application vendor), so even though the data is security-wise in the application, privacy-wise it is in the cloud provider’s domain.
  • It is very useful to draw the privacy domains on the data flow diagram, annotated with the regulatory domain (e.g., European Union, United States, or something else).

    Privacy domains drawn on a DFD
  • Once this has been done, you can follow the TRIM method. It is a kind of an extension to STRIDE.
    • Note that I have changed the definition of TRIM in 2018, so older material has a different definition of I.
    • Transferring data across borders - When a data flow extends from a privacy domain to another, does either the source or destination privacy domain consider this data to be personal data according to their legislation or contractual needs? If so, have you fulfilled the requirements for such transfer?
      • For example, the European Union by default forbids transfer of personal data out of EU/EEC unless you have a permission for that.
      • Also, in order to transfer data from a Controller to a Processor, a certain set of contractual clauses must be used. This is confusingly often called a Data Protection Agreement (or Addendum), DPA. (A DPA is also the Data Protection Authority.)
    • Retention periods - This is applicable to any data storage component. The question is that whether you have actually specified a maximum retention period for the data that is being stored, and whether you have the facility to track the use of data for ‘right to erasure’ requests. In most cases, you will have to define a maximum storage period and enforce purging of the data after the period has expired, as well as keep tabs on any data extracts. If you have not defined a retention period, you may be in violation of the purpose specification principle - you might not have a valid cause to store data indefinitely if its purpose is limited in time.
    • Interconnections and inference - Whether the personal data can be aggregated or combined to infer more personal data, or decrease the anonymity or pseudonymity. This is an important consideration, and it will be discussed in detail in the next section.
    • Minimisation - Again, when data is being transferred out from a privacy domain, you should discuss whether the data that is being transferred is the minimum set that is technically required to fulfill the usage scenario. This is related to the collection limitation and purpose specification principles. If the data that is being transferred contains more data than what is technically necessary, it should not be transferred.
  • Another option to use in conjunction with STRIDE is LINDDUN (Deng, see the Threat modeling book chapter 6). Similarly with TRIM, it has multiple considerations that can be applied to data flows and stores. We’re not going to go into that in detail here.

Privacy engineering concepts and patterns


  • One of the key concepts of privacy theory are the concepts of anonymity and pseudonymity.
  • Anonymity, strictly speaking, means that a personal data point cannot be correlated with any other point. Even with a large number of data points, there would be no patterns emerging.
  • In real life, almost nothing is anonymous. In most cases, we have pseudonymity, which means that there is some correlation with some data points.
    • An identifier that is not enough to identify a person, but can be correlated with other data and the total grouping may be used as an identifier, is known as a quasi-identifier.
  • At some point, a pseudonym becomes so strong that it can be readily correlated with an actual person. A term verinym has been used for this, although rare.
    • Theoretically, a government-issued name is also pseudonym, although a fairly strong one. Whether there is a real “true name” would probably delve into a discussion about identity in the psychological sense.
  • In the context of GDPR interpretation, anonymous data is not personal data; pseudonymous data is. It is likely that there are practical levels of pseudonymity that will be seen as “anonymous enough”, although the current opinions and guidelines do not leave room for this.
  • Privacy engineering has one key goal of managing the pseudonymity.
    • Data collection, storage, and analysis has the tendency of moving the level of pseudonymity away from anonymity and closer to a “real identity”.
    • Keeping the data closer to anonymity usually requires active work.

Differential privacy and privacy-preserving statistics

  • Personal data can be made less personal (closer to anonymity) by:
    • Processing individual quasi-identifiers separately. For example, if we have age and location, and needs statistics of each separately but not together, we can separate age and location technically.
    • Data can be grouped into larger groups. The earlier this happens in the collection phase, the less identifying the data is.
    • The specificity or resolution or data can be reduced. As an example, reduce the number of decimals in a geolocation or just use a name of a locality.
    • Applying a random permutation.
    • Adding noise.
  • Robustness of pseudonymity needs to pass three tests:
    • Whether a pseudonym can be mapped to a specific identified person; and
    • Whether two pseudonyms can be mapped to the same, but arbitrary, person; and
    • Whether, given a quasi-identifier of a pseudonym, a value of another quasi-identifier (that is not in the pseudonym) can be inferred.
  • Anonymity metrics:
    • k-anonymity: There is a block of k persons, within which the pseudonyms are not separable. (All the k persons have the same quasi-identifiers.)
    • l-diversity: In real-world situations, if there is not much diversity in a block of k persons, the pseudonymity can be broken by knowing the block of k persons that the target person belongs to. If that block now has an attribute that we are interested in, we now know the target person’s attribute. Here, l-diversity guarantees that in the block of k persons, the “interesting” or “sensitive” attributes have at least l different values, so knowing the k-anonymous group still does not leak the attribute of interest to the attacker.
    • t-closeness: The distribution of the interesting attributes in an l-diverse k-anonymous group is similar to the distribution of those attributes in the complete data set (with a threshold, t).
  • Differential privacy is a concept whose goal is to ensure that privacy risks to a data subject are not heightened by the inclusion of that data in a data collection.
    • A practical example could be that there is a survey that asks for whether a person as a diagnosed sexually transmitted disease. It would be useful if people would participate, and answer truthfully. Without privacy guarantees, though, people would either lie or not participate.
    • Differential privacy works either in database queries (as in the original paper) or the data collection phase, by adding noise to the data. Continuing the above example, if noise would be added to the disease status, an individual status would not be leaked as it could have been “flipped” by noise.
    • Still, because the shape of the noise is known, it can be eliminated in statistical analysis.
    • The amount of noise must be large enough to be able to mask a single person’s attribute, or if added at the database query stage, large enough to mask the change to the query output that one person could cause.
  • Adding differential privacy to database queries may not be straightforward at all, and essentially requires a very specific control on the queries.
    • Applying the practice on the inputs, on the other hand, is less risky. Google RAPPOR is one such example that works well in telemetry collection (see the list of additional material). A measurement is taken client-side, noise is added, and the measurement with noise is sent to the server and the database.

Breaking identifier correlation

  • In practice, systems use identifiers a lot. Often, there is a single key used for database lookups, such as a customer number.
  • If the system gives access to data to multiple parties, one way to reduce personal data exposure is to change the identifiers. For example, if a person is referred to using identifier A by one party, and B by the other party, these parties may not be able to deduce that the person is the same one (barring access to additional quasi-identifiers).
  • This method is used on both Android and iOS as the Advertising Identifier. This is a random identifier that can be reset at any time by the user, hopefully then breaking the aggregation of data points into the current identifier.

Mixing and anonymity networks

  • Mix networks are a way to provide privacy both in telecommunications and for specific data items.
  • A mix network consists of proxy servers (mixes) that receive and send traffic forward. Each mix will randomly permute the traffic.
    • In some cases, the mixes may also perform traffic shaping, by generating decoy messages to ensure that the traffic volumes cannot be used to trace the messages.
  • Mixes can be used for connection-oriented communications (like in Tor), or for object passing (as in the Mixmaster remailers and electronic voting systems, where votes may be shuffled using a mix network).
  • Onion routing is a type of mix network where the mixes are selected by the sender, close to the idea of source routing.

Additional material

  • The GDPR text itself is obviously a useful source. It is a bit difficult to digest at one sitting.
  • There are few good textbooks at the moment on the GDPR. The International Association of Privacy Professionals (IAPP) have published a couple of them, and for someone requiring GDPR understanding, I think these are close to the best bet at the moment:
    • Eduardo Ustaran (ed.): European Data Protection. IAPP 2017.
    • Hands-On Guide to GDPR Compliance. IAPP 2018.
  • On differential privacy ideas, Google RAPPOR is a clear and practical one to start with Erlingsson, Pihur, Korolova: RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. The term entered public discourse in Dwork: Differential Privacy.
  • The papers, respectively, on k-anonymityl-diversity, and t-closeness.
  • For a discussion on data flows in cloud services and their legal basis, see Christopher Millard (ed.): Cloud Computing Law, Chapter 10, How Do Restrictions on International Data Transfer Work in Clouds?


This is lecture support material for the course on Software Security held at Aalto University, Spring 2018. It is not intended as standalone study material.

Created by Antti Vähä-Sipilä <>, @anttivs. Big thanks to Sini Ruohomaa and Prof. N. Asokan.

  • No labels