-
Notifications
You must be signed in to change notification settings - Fork 3
DAS Implementation Details
We use the term "measurement" to describe a noisy estimator of a
function of the enumerated confidential data; that is, a noisy
estimator of a function computed on the 2020 Census Edited File. In
the DAS, queries are generalized marginal queries: individual
attributes may be coarsened (aggregating some of their levels
together, and optionally dropping some levels), then arbitrary
Cartesian products of these coarsened attributes may be used as
predicates for the query. This procedure yields a special class of
counting queries for which the levels within each query are mutually
exclusive (adding or deleting a single record can alter at most the
count corresponding to one scalar level in each query), and in which
the unbounded (add-delete-a-record) L1 sensitivity of each change is
at most 1. Although the theory underlying the DAS (see theorems in Census TopDown Algorithm: Differentially Private Data, Incremental Schemas, and Consistency with Public Knowledge) can support queries with
arbitrary finite sensitivity, these special "coarsened marginal"
queries correspond closely to the queries specified in the
published tables from a decennial census, and have been the primary
focus of development on the DAS system and its tuning. Coarsened
marginal queries correspond to the SumOverQuery
and
SumOverGroupedQuery
classes in programs/queries/querybase.py
,
which are the only two query types returned by the
makeTabularGroupQuery
staticmethod
in this same file, which in
turn is the only query type recognized by the current repository's
implementation of Schema.buildQuery
in
programs/schema/schema.py
. Consequently, users must limit queries to
the coarsened marginal type.
To give a concrete example, measuring the variable 'cenrace' for a
particular county results in a query whose values are any of the 63
census race combinations and whose counts are the number of
individuals in the county who reported each particular combination of
races. This query can be implemented by fully coarsening all
dimensions except CENRACE, and not coarsening the CENRACE dimension at
all. Compound queries indicating multiple attributes involve Cartesian
products of the individual attributes. For example, the query cenrace x hispanic
has 126 possible values (hispanic has two values, either
yes or no).
Evaluation and management of these queries is typically further optimized through efficient Kronecker product representations, which are especially convenient for working with matrices as representations of Cartesian products of individual coarsened attributes.
Noise injection consists of adding the random output of the Discrete Gaussian Mechanism to the count query generated by the process described in the measurement section. The result of adding the random perturbations to the queries on the confidential data is called a "noisy measurement." The noisy measurements provably protect the original data (in the sense described in Background page of this wiki). The confidential attributes of the individuals are protected with a guarantee provided for each attribute singly, and each combination of attributes, depending on the amount and distribution of the privacy-loss budget. The noisy measurements may be negative and inconsistent. The noisy measurements are post-processed into a single, consistent microdata representation that is used for publication of the final tabulations.
Mathematical optimization is used to infer a "best-fitting" set of microdata (conveniently represented as a histogram) from the noisy measurements. These same optimization procedures also ensure that designated invariants and constraints over the resulting MDF are satisfied.
Optimization is organized in the DAS into Nonnegative Least Squares (NNLS) passes, which generally return float-valued estimates of histograms, and Rounder passes, which target a less rich set of queries and use the L1 rather than L2 norm, but, under suitable technical conditions, efficiently and provably generate nonnegative integer-valued histograms. These nonnegative integer-valued histograms can then be interpreted as counts of the number of records of each type, in the current geographic unit. They are essentially identical to microdata, but written in a form more convenient for matrix algebra than for human reading and review.
Once the privacy-protected histograms have been generated down to the
census block geographic level, they are used to generate a
privacy-protected microdata file. This transformation is just a change
in format: each index in the numpy
ndarray
representing each
histogram corresponds to a record type, and its nonnegative count
indicates the number of records of that type. The DAS
writers
are responsible for performing this conversion, generating a
number of microdata records identical the count in each ndarray
index. The privacy-protected microdata records generated in this way
can then be used for conventional statistical tabulations, or be
easily read by humans, etc.