






Data-Purpose Algebra: Modeling Data Usage Policies
Chris Hanson (cph@csail.mit.edu) Tim Berners-Lee (timbl@w3.org)
Lalana Kagal (lkagal@csail.mit.edu) Gerald Jay Sussman (gjs@mit.edu)
Daniel Weitzner (djweitzner@csail.mit.edu)
Massachusetts Institute of Technology
Computer Science and Artificial Intelligence Laboratory
Cambridge, MA, USA
Abstract ance.
Data about individuals can be obtained easily through a
Data is often encumbered by restrictions on the ways in variety of ways including tracking user behavior on web-
which it may be used. These restrictions on usage may be sites, monitoring purchases on credit cards, querying gov-
determined by statute, by contract, by custom, or by com- ernment and commercial databases, and even using environ-
mon decency, and they are used to control collection of data, mental sensors. This data can be used to make inferences
diffusion of data, and the inferences that can be made over (of uncertain accuracy) about the individuals for a range
the data. In this paper, we present a data-purpose alge- of purposes from market research and customized market-
bra that can be used to model these kinds of restrictions ing to verifying the time during which the individual was
in various different domains. We demonstrate the utility of at work. These inferences can also lead to adverse conse-
our approach by modeling part of the Privacy Act (5 USC quences. For example, a person might be incorrectly iden-
§552a)1 , which states that data collected about US citizens tified as a terrorist and prevented from boarding a plane.
can be used only for the purposes for which it was collected. However, there are usually restrictions on the ways data can
We show (i) how this part of the Privacy act can be repre- be collected and used. These encumbrances may be deter-
sented as a set of restrictions on data usage, (ii) how the mined by statute, by contract, by custom, or by common
authorized purposes of data flowing through different gov- decency. Some of these restrictions are intended to control
ernment agencies can be calculated, and (iii) how these pur- the diffusion of the data, while others are intended to delimit
poses can be used to determine whether the Privacy Act is the consequences of actions predicated on that data.
being enforced appropriately. Data can also be sent from one individual or organiza-
tion to another. In this case, the allowable uses of data may
be further restricted by the sender: "I am telling you this
1. Introduction information in confidence. You may not use it to compete
with me, and you may not give it to any of my competitors."
Data may also be restricted by the receiver: "I don't want to
Privacy protection in large scale, decentralized informa- know anything about this that I may not tell my spouse."
tion systems such as the Web and large commercial data Although the details may become complex as data is
mining applications requires new technical tools and public passed from one individual or organization to another, the
policy approaches [15]. In particular, we believe that new restrictions on the uses to which the data may be put are
systems approaches are needed to enable assessment of ac- changed in ways that can often be formulated as algebraic
countability to support policies that govern uses of personal expressions. These expressions describe how the restric-
information. To this end, we describe here a data-purpose tions on the use of a particular data item may be computed
algebra that enables the expression of data-purpose and us- from the history of its transmission: the encumbrances that
age restrictions. Our goal is to build systems in which the are added or deleted at each step. A formalization of this
specific uses of personal data are transparent to authorized process is a Data-Purpose Algebra description of the pro-
observers and are subject to effective accountability assess- cess.
ments by those who seek to assure privacy policy compli-
One pervasive assumption behind our formalization is
1 For an overview of the Privacy Act of 1974 see the website: that the restrictions on a data item do not depend solely
http://www.usdoj.gov/oip/04 7 1.html on the content of the item, but rather on the content and
its annotated provenance. For example, a law-enforcement of Records (SORs), specified by the associated Systems of
official may not act on improperly obtained evidence, but Records Notices (SORNs), as defined by the Privacy Act (5
if the same information were redundantly obtained through USC §552a).
lawful channels the official may act, assuming that the in- Let r be a data repository; if the repository is a SOR,
dependence meets certain constitutional requirements. it has an associated SORN n = N (r), which gives infor-
mation about the permissible uses of the data in the SOR.
2. Data-Purpose Algebra The SORN specifies input conditions: the allowed sources
OS (n) from which data may be collected, the data cate-
To formally describe the ways that the use of data may gories KS (n) that may be collected, and the purposes PS (n)
be restricted and the way in which the restrictions are trans- for which data that is collected may be used. It also speci-
formed as the data is processed and passed from one agent fies a set of routine uses U (n) for data extracted from that
to another, we annotate each data item with extra informa- SOR.
tion. Each data item i, in addition to its content q = QD (i), Each routine use u U (n) specifies a set of possible re-
is annotated with its agent a = AD (i), a category k = cipient organizations OR (u), categories of data KR (u) that
KD (i), and a set of purposes p = PD (i) for which it can may be transferred to those organizations, and the set of au-
be used. An item is constructed from its content, agent, thorized purposes PR (u) for which the specified recipient
category, and purposes i = I(q, a, k, p). The agent is the organizations may use the given data. Any particular re-
producer of this data. The category is a set of data items cipient r1 may be a sub-organization of a possible recipient
containing this particular item. This set may be named but organization r2 specified in a SORN. This (non-strict) rela-
is not likely to be enumerated; for example a typical legal tion is notated r1 r2 . Likewise, any particular category
category is "US citizen." The set of purposes is explicit; a k1 may be a subset of a category k2 specified in a SORN.
typical purpose in the set of purposes is "criminal law en- This relation is notated k1 k2 .
forcement." The purposes allowed for data i that has been transferred
A data item i may be processed by some agent a to pro- from a SOR s to a SOR r depend on the purposes that came
duce a new data item. (See figure 1.) The new data item with the data and the input conditions on the SORN for r.
has the same kinds of annotations as the original one, but So, if s is not one of the allowed sources or the category of
the process generates new content and new annotations as the data is not one of the allowed categories the data may
functions of the original. The functions are specific to the not be used for any purpose:
kind of process performed by the agent.
For example, medical data about an identified person RIN (i, s, r) = PS (r) if o((o OS (N (r))) (s o))
may be severely restricted as to its allowable uses. But it k((k KS (N (r))) (KD (i) k))
may be anonymized for use in the education of medical doc- = {} otherwise
tors. In such a case, the allowed purposes of the anonymized
data may be wider than those of the original data, and the The set of applicable routine uses A(i, s, r) for transfer
category of the data will be different. On the other hand, of data item i from a SOR s to a recipient r is just the set
when a person enters a medical establishment for treatment of those entries for which the recipient is allowed by the
a record is made of the patient's name, address, and date of SORN for s and for which the category of the data KD (i) is
birth. This data itself typically has few usage restrictions, in the allowed categories KR (u) for that routine use u:
but the fact of it appearing in a medical record adds restric-
tions required by the HIPAA law. A(i, s, r) = {u U (N (s)) |
An agent may combine data from multiple agents to pro- o((o OR (u)) (r o))
duce new data. (See figure 2.) In this case the functions may
be considerably more complex. For example, a person at the k((k KR (u)) (KD (i) k))}
medical office may use a public source, such as a telephone
The restriction on authorized purposes of a transfer from
directory, to verify the recorded telephone number of a pa-
a source to a recipient is that the purposes must be autho-
tient. This process combines highly restricted information
rized by one or more of the applicable routine uses.
from a medical record with unrestricted public information,
but the result remains restricted.
ROUT (i, s, r) = PR (u)
uA(i,s,r)
3. An Example Formalization: Privacy Act
The authorized purposes Z(i, s, r) to which a recipient
In the illustration that follows we consider a simplified r may put a data item i extracted from a source s is then
formulation of the rules for data passed among Systems restricted to be those purposes particular to that data item
2
Q
Q ( Q d ( i ) )
Q d ( i )
c o n t e n t :
K
A d ( i )
a '
s o u r c e :
K
( K d ( i ) )
K d ( i )
c a t e g o r y :
P
P d ( i )
( P ( i ) , A ( i ) , ,
d d
P
p u r p o s e :
a '
K ( i ) )
d
e n t a '
D a t a i t e m ( i ) A g
N e w d a t a i t e m
Figure 1. Unary Process: A data item i may be processed by some agent a to produce a new data
item. The new content is some function Q(QD (i)) of the given content. The agent of the new data
item is a , the new category is a function K(KD (i)) of the given category, and the allowed purposes of
the new data item is a more complex function P(PD (i), AD (i), a , KD (i)) that may depend on the original
purposes, the agents, and the category of the original data.
that are also allowed by the authorized purposes specified are reached in violation of the Privacy Act) that generates a
in the SORN: series of adverse consequences for him.
A transaction log that leads up to an adverse
Z(i, s, r) = PD (i) RIN (i, s, r) ROUT (i, s, r) consequence--such as John Doe being arrested--is passed
through the data-purpose algebra of the Privacy Act. This
So Z(i, s, r) is the set of purposes of the new item held causes each data item in the log to be annotated with its au-
by the recipient r with the content of the old item i held by thorized purposes, as described in the earlier section. A data
the source s. item that has an empty list of purposes shows the point at
The result of a transfer process AXFER of an item i from a which a policy violation occurred. All further uses of these
source s to a recipient r is a new item: data items without purposes, including inferences made, in
the log are also invalid.
I(QD (i), AXFER , KD (i), Z(i, s, r)) We have built an implementation of the data-purpose al-
gebra in Scheme [10]. To illustrate how closely the program
4. Implementation corresponds to the mathematics, figure 3 shows an imple-
mentation of ROUT (i, s, r) and A(i, s, r) in Scheme. The
overall correspondence between code and mathematics is
We are constructing a system [15] that uses the data-
quite close.
purpose algebra to derive usage purposes for data, and con-
In order to enable users to understand where and how
sequently analyze usage of the data by government agen-
the Privacy Act has been violated in a particular transaction
cies. Our system expects as input the historical log of data
log, we are developing a user interface for our system. The
collection, analysis, and transfer between individuals and
UI provides several views into the annotated log, including:
government agencies. Based upon current government ef-
(a) a time-line of events; (b) flow of data through the records
forts, we presume that this log as well as case activities
systems; (c) transaction details; and (d) data-purpose calcu-
will exist in XML [2]. We plan to automatically convert
lations.
the XML transactional data into the Resource Description
Framework (RDF) [9]; annotate the transactional data with
agent, category, and purposes as described in this paper; and 5. Related Work
then encode the derivation steps in the Proof Markup Lan-
guage [5], thereby providing an interoperable justification Our algebra is closely related to work on privacy policy
representation for explanation and accountability. We are representation and enforcement. W3C's P3P framework al-
also defining notices (in RDF) for the Systems of Records lows websites to publish their privacy policy, which can be
that appear in the transaction log. In our hypothetical sce- matched against users' privacy preferences [16]. Comple-
nario, a traveler named John Doe from New York boards a mentary languages for defining privacy preferences such as
flight in New York and sets in motion a chain of inferences APPEL [4] and XPREF [1] have also been defined. The P3P
(some of which are factually incorrect and some of which framework is specifically aimed at web privacy in terms of
3
D a t a i t e m ( i )
Q d ( i )
A d ( i )
K d ( i )
P d ( i )
Q ( Q d ( i ) , Q d ( j ) )
Q
a ' '
K
( K d ( i ) , K d ( j ) )
K
P
( P d ( i ) , P d ( j ) , A d ( i ) ,
Q d ( j )
A d ( j ) , a ' ' , K d ( i ) ,
P
A d ( j )
K d ( j ) )
K d ( j )
P d ( j )
A g e n t a ' '
N e w d a t a i t e m
D a t a i t e m ( j )
Figure 2. Binary Process: Two data items, i and j, may be processed by some agent a to produce
a new data item. The new content is some function Q(QD (i), QD (j)) of the content of i and j. The
agent of the new data item is a and the new category is a function K(KD (i), KD (j)) of the categories
of i and j. The allowed purposes of the new data item is a complex function that may depend on the
original purposes, the agents, and the categories of i and j.
collecting, storing, and sharing user information and deals plicit restrictions on data usage may result in contradictory
solely with cookies and clickstream data. In contrast, our conclusions about what purpose restrictions apply. Further
data-purpose algebra has a broader scope and models re- work is required to define means by which these conflicting
strictions on how data can be used in general. results can be resolved.
While P3P is concerned with privacy protection issues Anonymization is often used to remove identifying char-
for web users, EPAL 1.1 [8] specifies enterprise-level poli- acteristics of data. However, it has been shown that sets
cies for data objects in the enterprise. Like P3P, EPAL has a of "anonymized" data can often be combined to discover
limited scope and does not provide a general model of data the identities of the parties [11]. Is the combined data re-
restrictions. stricted? It depends on the laws. If the law requires that
Policy languages such as Extensible Access Control medical records are restricted, then that conclusion is in-
Markup Language (XACML) [7] and KAoS [13, 14] are dependent of how they are derived. On the other hand, it
used to describe access control policies. Using these lan- is possible to combine two restricted pieces of information
guages it is possible to define restrictions on actions that to produce a less restricted deduction. For example, if the
deal with specific data thereby enforcing restrictions on data same information is available from two different sources,
usage. However, access control policies are enforced at the then the restriction on the combination may be relaxed to
time of the user request, whereas our work is focused on be the uses allowed by each source separately, or it may
accountability. not, depending on the details.
Privacy approaches such as ContextBroker [3] and Se- The data-purpose algebra is designed to compute al-
mantic eWallet [6] are being developed for pervasive com- lowed purposes based on how data is derived, but as in
puting environments where the behavior of users may be the case of re-constituted medical records, this is not al-
monitored and used. These approaches enforce users' pri- ways sufficient. We believe that this model can be extended
vacy preferences by preventing their data from being used to handle such cases by adding content-dependent global
by or modifying the data available to context aware services. rules, for example by using the category of a data item to
indicate when such a rule should be applied.
6. Future Work
7. Summary
The example in this paper shows how to cover many
kinds of formalizable requirements, such as those of the Pri- The algebraic approach is well suited to modeling the
vacy Act. But there are harder problems. Informal and im- allowable uses of information when the restrictions on that
4
(define (restriction-on-output record source recipient)
(rdf-set:union*
(append-map ts:routine-use.purposes
(applicable-routine-uses record source recipient))))
(define (applicable-routine-uses record source recipient)
(keep-matching-items (ts:sorn.routine-uses (ts:sor.notice source))
(lambda (use)
(and (there-exists? (ts:routine-use.recipients use)
(lambda (recipients)
(org-elt? recipient recipients)))
(there-exists? (ts:data-record.categories record)
(lambda (category)
(there-exists? (ts:routine-use.categories use)
(lambda (category*)
(rdf-subset? category category*)))))))))
Figure 3. Scheme implementation of ROUT (i, s, r) and A(i, s, r).
use are determined by the path by which the information is [5] P. P. da Silva, D. L. McGuinness, and R. Fikes. A proof
obtained, but it is not so good at dealing with restrictions markup language for semantic web services. Information
that are time dependent or inherent in the content of the Systems, 31(45):381395, JuneJuly 2006.
information, independent of the path. [6] F. Gandon and N. Sadeh. Semantic web technologies to rec-
oncile privacy and context awareness. Web Semantics Jour-
When formalized algebraically, computations are di-
nal, 2004.
rectly representable as purely functional computer pro- [7] S. Godik and T. Moses. OASIS extensible access control
grams. This makes it easy to verify that a program that markup language (XACML). OASIS Committee Specifica-
implements the data-purpose algebraic computations is cor- tion cs-xacml-specification-1.0, Nov. 2002.
rect. [8] IBM. EPAL 1.1. http://www.zurich.ibm.com/security/enterprise-
privacy/epal/Specification/index.html, 2003.
[9] F. Manola and E. Miller. RDF primer.
Acknowledgements http://www.w3.org/TR/rdf-primer/.
[10] G. L. Steele, Jr. and G. J. Sussman. The revised report on
scheme, a dialect of lisp. MIT AI Memo 452, Jan. 1978.
Hal Abelson, Deborah McGuinness, and K. Krasnow [11] L. Sweeney. Guaranteeing anonymity when sharing medical
Waterman read drafts of this paper and provided many use- data, the datafly system. Journal of the American Medical
ful and insightful comments. The work reported in this pa- Informatics Association, 1997.
per is supported by the US National Science Foundation Cy- [12] U. S. Department of Justice and Department of Homeland
bertrust (05-518) program. Security. National information exchange model (NIEM).
http://www.niem.gov/.
[13] A. Uszok, J. Bradshaw, P. Hayes, R. Jeffers, M. Johnson,
References S. Kulkarni, M. Breedy, J. Lott, and L. Bunch. DAML real-
ity check: A case study of KAoS domain and policy ser-
[1] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu. XPref: A vices. In International Semantic Web Conference (ISWC
preference language for P3P. Comput. Networks, 48(5):809 03), Sanibel Island, Florida, 2003.
827, 2005. [14] A. Uszok, J. M. Bradshaw, R. Jeffers, M. Johnson, A. Tate,
J. Dalton, and S. Aitken. Policy and contract management
[2] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler,
for semantic web services. In AAAI Spring Symposium, First
and F. Yergeau. Extensible markup language (XML) 1.0.
International Semantic Web Services Symposium, 2004.
http://www.w3.org/TR/REC-xml/.
[15] D. Weitzner, H. Abelson, T. Berners-Lee, C. Hanson,
[3] H. Chen, T. Finin, and A. Joshi. An intelligent broker for
J. Hendler, L. Kagal, D. McGuinness, G. Sussman, and
context-aware systems. In Adjunct Proceedings of Ubicomp
K. K. Waterman. Transparent accountable inferencing for
2003, Seattle, Washington, USA, October 1215, 2003, Oct.
privacy risk management. In The Semantic Web meets eGov-
2003.
ernment, AAAI Spring Symposium, 2006.
[4] L. Cranor, M. Langheinrich, and M. Marchiori. [16] World Wide Web Consortium. Platform for privacy prefer-
A P3P preference exchange language (APPEL). ences (P3P) project. http://www.w3.org/P3P/.
http://www.w3.org/TR/P3P-preferences/.
5