Skip to content

DAS Implementation Details

Philip Leclerc edited this page Jun 22, 2023 · 8 revisions

Measurement

We use the term "measurement" to describe a noisy estimator of a function of the enumerated confidential data; that is, a noisy estimator of a function computed on the 2020 Census Edited File. In the DAS, queries are generalized marginal queries: individual attributes may be coarsened (aggregating some of their levels together, and optionally dropping some levels), then arbitrary Cartesian products of these coarsened attributes may be used as predicates for the query. This procedure yields a special class of counting queries for which the levels within each query are mutually exclusive (adding or deleting a single record can alter at most the count corresponding to one scalar level in each query), and in which the unbounded (add-delete-a-record) L1 sensitivity of each change is at most 1. Although the theory underlying the DAS (see theorems in Census TopDown Algorithm: Differentially Private Data, Incremental Schemas, and Consistency with Public Knowledge) can support queries with arbitrary finite sensitivity, these special "coarsened marginal" queries correspond closely to the queries specified in the published tables from a decennial census, and have been the primary focus of development on the DAS system and its tuning. Coarsened marginal queries correspond to the SumOverQuery and SumOverGroupedQuery classes in programs/queries/querybase.py, which are the only two query types returned by the makeTabularGroupQuery staticmethod in this same file, which in turn is the only query type recognized by the current repository's implementation of Schema.buildQuery in programs/schema/schema.py. Consequently, users must limit queries to the coarsened marginal type.

To give a concrete example, measuring the variable 'cenrace' for a particular county results in a query whose values are any of the 63 census race combinations and whose counts are the number of individuals in the county who reported each particular combination of races. This query can be implemented by fully coarsening all dimensions except CENRACE, and not coarsening the CENRACE dimension at all. Compound queries indicating multiple attributes involve Cartesian products of the individual attributes. For example, the query cenrace x hispanic has 126 possible values (hispanic has two values, either yes or no).

Evaluation and management of these queries is typically further optimized through efficient Kronecker product representations, which are especially convenient for working with matrices as representations of Cartesian products of individual coarsened attributes.

Noise Injection

Noise injection consists of adding the random output of the Discrete Gaussian Mechanism to the count query generated by the process described in the measurement section. The result of adding the random perturbations to the queries on the confidential data is called a "noisy measurement." The noisy measurements provably protect the original data (in the sense described in Background page of this wiki). The confidential attributes of the individuals are protected with a guarantee provided for each attribute singly, and each combination of attributes, depending on the amount and distribution of the privacy-loss budget. The noisy measurements may be negative and inconsistent. The noisy measurements are post-processed into a single, consistent microdata representation that is used for publication of the final tabulations.

Microdata-as-Histograms from Noisy Measurements

Mathematical optimization is used to infer a "best-fitting" set of microdata (conveniently represented as a histogram) from the noisy measurements. These same optimization procedures also ensure that designated invariants and constraints over the resulting MDF are satisfied.

Optimization is organized in the DAS into Nonnegative Least Squares (NNLS) passes, which generally return float-valued estimates of histograms, and Rounder passes, which target a less rich set of queries and use the L1 rather than L2 norm, but, under suitable technical conditions, efficiently and provably generate nonnegative integer-valued histograms. These nonnegative integer-valued histograms can then be interpreted as counts of the number of records of each type, in the current geographic unit. They are essentially identical to microdata, but written in a form more convenient for matrix algebra than for human reading and review.

Histograms to Microdata

Once the privacy-protected histograms have been generated down to the census block geographic level, they are used to generate a privacy-protected microdata file. This transformation is just a change in format: each index in the numpy ndarray representing each histogram corresponds to a record type, and its nonnegative count indicates the number of records of that type. The DAS writers are responsible for performing this conversion, generating a number of microdata records identical the count in each ndarray index. The privacy-protected microdata records generated in this way can then be used for conventional statistical tabulations, or be easily read by humans, etc.