About The Project

A massive digitalization of personal information is currently underway. Individuals are receiving an ever increasing amount of important documents in digital form (financial, professional, medical, relative to insurance, administrative, linked to daily consumption, etc.), issued by their employers, banks, insurances companies, civil authorities, hospitals, schools, internet providers, telcos, etc. Legal obligations require such official documents to be kept (e.g., one year for bank statements), and these documents are used as evidence when performing subsequent administrative tasks (e.g., paying taxes) or applying to services (e.g., applying for a loan).

Indeed, many services require those documents to calibrate the level of service agreement to the particular situation of each applicant. For example, the characteristics of a personal loan (rate, duration, insurance fee…) are defined according to proofs of income, employment, title deeds, personal references, forms of collateral, medical records, past lines of credits, etc. To cite other examples, contracting insurance (health, car, job protection, etc.), social assistance or tax refund, also require providing evidence of one’s own specific situation.

Although privacy intrusive, the necessity of evaluating the particular situation of an applicant is unquestionable and is in the interest of both the service provider and the customer. However, the requested set of documents must be restricted to the minimum useful to take the decision. First, the reason is to protect the privacy of the applicant. Privacy legislations and directives worldwide enacted the Limited Data Collection principle to this end: this principle states that organizations should only collect the personal data strictly required to achieve a goal the user consents to [1], [2]. The second reason is to limit the cost of information leakage. Indeed, all too often personal data ends up being disclosed by negligence or hack. For the year 2011, the Open Security Foundation reported 369 data loss incidents affecting 126 millions records. The Privacy Rights Clearinghouse tracked 557 data breaches with a total of more than 30 million records involved. This is not only a serious privacy incident, but also a potential financial disaster for the companies in charge of the data. A recent study [3] estimates the cost of data breaches for US companies at an average $194 per compromised record, a value that has kept increasing since 2006. In addition, The New York Times reports that 90% of companies have experienced at least one data breach last year . Security companies have made breach cost calculators available online to draw attention to this phenomenon: their conclusion is that the more data exposed, the greater the cost in the event of a data breach.

The target of this study is to restrict the set of documents to expose to third parties to a minimum subset, in accordance with the Limited Data Collection principle, on specific use cases (e-admin, banking and insurance, etc.)

The Minimum Exposure project faces challenging problems at the intersection of databases, data mining, secure data computation, and operational research.

  1. The collection rules attached to application forms must cover any decision making system ranging from simple disjunction of conjunction of predicates enacted by law (e.g., e-administration scenario) to highly complex systems based on data mining techniques (e.g., neural networks used in credit scoring for bank loans applications).
  2. The decision making process is often private for the service provider, and the related collection rules must not be revealed, leading to adopt secure tokens (like smart cards) in the architecture.
  3. Identifying the minimum set of information to be sent while still preserving the final decision is a NP-Hard problem. Application forms can be very large in practice (e.g., hundreds fields in loan applications forms or tax declarations), leading to introduce approximation algorithms based on heuristics adapted to the topology of these rules.

Basic Minimum Exposure Framework

We consider the general scenario depicted in tjhe figure above which involves three main parties: Data Producers, Users, and Service Providers. Data Producers act as data sources. They include for example banks, employers, hospitals, or administrations. The information they deliver to users has an official value and is signed to prove integrity and origin (e.g., salary forms, bank records history, tax receipts, etc.). Users store the documents they receive in their personal digital spaces (their own PC, cloud storage, secure personal devices, etc). Service Providers may include banks, insurances companies, public welfares, administrations, etc. They propose customized services to their clients like bank loans, health insurance, social benefits, etc.

We call Minimum Exposure (ME) the process which identifies the minimum subset of documents produced by Data Producers to be exposed by a User to a Service Provider to trigger the desired service with the (set of) advantage(s) she can (and wants) to obtain. ME requires confronting the set of documents owned by the user, with the advantages, associated with collection rules describing the information requested by the service provider. The execution of ME must take place on the user’s side to escape the limited data collection paradox: indeed, the system in charge of running ME –including the service provider itself– needs to collect more documents than the minimum subset computed by ME.

The general scenario is thus as follows. When a user wants to obtain a service, she

  1. downloads collection rules published by the service provider
  2. computes locally the advantages she can obtain based on the documents she owns
  3. selects among the advantages the ones she desires to obtain
  4. uses ME to compute the minimum set of documents to expose to obtain the service with those advantages
  5. exposes these documents to the service provider, where their integrity and origin are checked. Note that with no loss of generality in what follows we assume that a user triggers all advantages and wants to obtain them all

References

[1] Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data. Official Journal of the EC, 23, 1995.
[2] OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal Data, 23rd Sept. 1980.
[3] Ponemon Institute, LLC. 2011 Annual Study: U.S. Cost of a Data Breach. 2012.