Approach to Privacy

Replica’s approach to delivering insight to our customers is rooted in using the composite of data sources described above to do advanced modeling and simulation of the built environment. But we also know that this type of approach can raise its own challenges, particularly as it relates to personal privacy and the ethical use of the insights generated by our platform.

That’s why in all of our work, we are uncompromising in our belief that better insight should never come at the expense of personal privacy. At Replica, all of our work adheres to the following privacy-protecting principles:

  • Only procure de-identified data from our source vendors. We never receive, use, or output personally identifiable information.
  • Never share raw locational data with our customers — or any other third-parties.
  • Build models from different data sources independently so that we abstract out potentially identifying details of any individual before combining these models into our aggregate outputs.
  • Never join data sources on keys containing sensitive data.
  • Incorporate proven techniques, like statistical noise injection, into our algorithms to ensure that (1) it is impossible to ascertain if an individual’s information is part of our source data by inspecting our modeled outputs; (2) it is impossible to learn which specific locations were visited by an individual whose information was part of our source data by inspecting our modeled outputs.

We utilize the linked Data Protection Addendum in our customer agreements, to codify these principles in our business contracts.

The implementation of these principles throughout our technical methodology is described in the following sections. In this section we call out three specific protective measures.

Data Management Principles

Replica never provides third parties with access to raw locational data. The movement databases published via Replica’s platform are populated with synthetic data for an averaged “typical” weekday/weekend, where each day is a modeled representation of activities and travel on a “modeled-typical” rather than any actual day within the modeled season.

These databases are produced by privacy-preserving travel synthesis algorithms, further separating sensitive data from the end users. Internally, location data processing is entirely automated. Only a restricted number of employees have access rights to develop and modify the algorithms underlying this automated process, under strict guidelines to avoid internal misuse of sensitive information.

Physical Separation of Data Storage and Restriction of Access

System components that have access to sensitive data are logically separated in software and deployed as secure containers on isolated hardware systems. The only interface to sensitive data is established via automated processes that, by request, receive de-identified model parameters and predictions that incorporate privacy protecting measures upstream. No data leaves the secure perimeter, nor can the data be queried or inspected other than through a set of predefined and monitored requests.

Data services that provide auxiliary information (demographic information, land use, accessibility) for training travel behavior models are separated and deployed in isolated projects and are initialized within separate databases, isolating potentially sensitive query history logs.

Mathematical Methods with Strong Privacy Guarantees

Privacy and de-anonymization risks related to locational data are well studied and understood. Techniques exist to synthesize and publish locational data with strong privacy guarantees, in the form of both origin-destination pairs and complete trajectories. Replica algorithms include a set of measures that protect against (1) membership inclusion attacks (to learn if a particular individual with certain distinctive habits has been in the dataset) and (2) location inference attacks (to learn which locations individual data contributors have visited).

The former guarantees that if an individual’s information is in the original mobile data sources, it is impossible for anyone to know it with any certainty by inspecting the synthetic outputs. To achieve this, we use mathematical techniques of noise injection, which ensures that locational information can not be obtained by a combination of queries designed by a malicious party to extract the desired information on the travel itineraries and locations of individual people.

Aggregate ground truth observations of resident activity are “noised” with the necessary magnitude to guarantee protection against differentiating out any single person by a range of targeted queries. For example, all the observed counts of the number of visits that are used to identify the attractiveness of particular venues and points of interest at different times of day and days of the week are noised before they are used in modeling. According to differential privacy principles, the amount of added noise guarantees that if any one record is to be removed from the original mobile data source, the resulting Replica model would not change.