Skip to content

Commit

Permalink
feat: add GCMS and OCMS.
Browse files Browse the repository at this point in the history
  • Loading branch information
PrivacyGo committed Jan 24, 2025
1 parent 260c45b commit 1ab6781
Show file tree
Hide file tree
Showing 14 changed files with 806 additions and 0 deletions.
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,12 @@ MPC-DualDP is a distributed protocol for generating shared differential privacy
## AnonPSI
The widely used ECDH-PSI, while keeping all data encrypted, discloses the size of the intersection set during protocol execution. We refer to such protocols as size-revealing PSI. AnonPSI offers a framework for systematically assessing the privacy of intersection-size-revealing PSI protocols by employing carefully designed set membership inference attacks. It enables an adversary to infer whether a targeted individual is in the intersection, which is also known as membership information. For more detailed information, please refer to [this folder](./anonpsi). AnonPSI was recently accepted for NDSS24, and we look forward to engaging in discussions during the offline sessions at NDSS.

## GCMS/OCMS
Local Differential Privacy (LDP) protocols enable the collection of randomized client messages for data analysis, without the necessity of a trusted data curator. Such protocols have been successfully deployed in real-world scenarios by major tech companies like Google, Apple, and Microsoft.

We propose a Generalized Count Mean Sketch (GCMS) protocol that captures many existing frequency estimation protocols. Our method significantly improves the three-way trade-offs between communication, privacy, and accuracy. We also introduce a general utility analysis framework that enables optimizing parameter designs. Based on that, we propose an
Optimal Count Mean Sketch (OCMS) framework that minimizes the variance for collecting items with targeted frequencies. Moreover, we present a novel protocol for collecting data within unknown domain, as our frequency estimation protocols only work effectively with known data domain. For more detailed information, please refer to [this folder](./ldp).

## Contribution

Please check [Contributing](CONTRIBUTING.md) for more details.
Expand Down
123 changes: 123 additions & 0 deletions ldp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Privacy-Preserving Data Collection with Local Differential Privacy

Local Differential Privacy (LDP) is a privacy-preserving technique in which data is randomized before being shared, ensuring that each individual’s information remains hidden even from data collectors. By adding controlled noise or transformations on the client side, LDP enables useful aggregate analysis without compromising user privacy.

## General Count Mean Sketch Protocol for Local Differential Privacy

We propose the Generalized Count-Mean-Sketch (GCMS) protocol, which builds on Apple's CMS by accounting for randomness in responses and hash collisions, improving accuracy under the LDP model. GCMS reduces communication costs while maintaining privacy and provides a utility guarantee for optimal parameter design.

To further enhance privacy, we use the Encryption-Shuffling-Analysis (ESA) framework. Clients' encoded outputs are encrypted and shuffled by a trusted shuffler, breaking the link between clients and responses and amplifying privacy. The server then aggregates and analyzes the shuffled data.

The overview of the protocol is as follows:

Preparation:
Before data collection, the server generates $k$ independent hash functions $\mathcal{H} \overset{\Delta}{=} \{h_1,h_2,...,h_k\}$, where each hash function deterministically maps any input to a discrete number in $[\mathbf{m}]=[1,...,m]$, with $m$ being the hashing range, $\mathcal{H}$ being the universe of hash functions. The server then sends $\mathcal{H}$ and its public key $pk$ to each client.

<div align="center">
<img src="./images/LDP_CMS.png" alt="My local image 2" title="Protocol Overview" width="600" />
</div>

The entire data collection pipeline is illustrated in Figure above. It consists of three distinct phases, each operating on different platforms:
1. The initial phase involves all on-device algorithms, including hash encoding and privatization, and encryption with the server's public key $pk$.
2. The encrypted privatized data is then transmitted through an end-to-end encrypted channel and received by the shuffler. The shuffler then forwards the data to the server after a random shuffling operation.
3. Finally, the server decrypts the input data, performs data aggregation, and obtains the Frequency Lookup Table - a sketch matrix $\mathcal{M}$.

Phase 1: On-device operations.

We present the process for a single data privatization and encryption, which consists of the following steps.
1. Hash Encoding: Each client first uniformly selects a hash function from $\{h_1, h_2, \ldots, h_k\}$;and calculates a hashed value $r=h_j(d)$ of their raw data $d$.
2. Probabilistic Inclusion: Each client initializes their privatized vector $\textbf{x}$ as an empty set, then adds $r$ to $\textbf{x}$ with probability $p \in [0.5,1]$.
3. Probabilistic Extension: Set the extension domain for each client as $[\mathbf{m}]/r = \{1, 2, \ldots, r-1, r+1, \ldots, m\}$. If $r$ is added to $\textbf{x}$, then uniformly select $s-1$ elements from $[\mathbf{m}]/r$ and append them to $\textbf{x}$. If $r$ is not added to $\textbf{x}$, then uniformly select $s$ elements from $[\mathbf{m}]/r$ and append them to $\textbf{x}$. Return the privatized vector along with the selected hash index $\langle \textbf{x}, j \rangle$.
4. Encryption and Release: Each client encrypts $\langle \textbf{x}, j \rangle$ with the server's public key $pk$ to obtain $v = E_{pk}[\textbf{x}, j]$, then releases $v$ to the shuffler.

Phase 2: Shuffler's anonymization and shuffling.

This phase, which is standard, involves Anonymization and Shuffling.

Phase 3: Server's aggregation and frequency estimation.

The server first decrypts messages with the secret key and obtains: π(⟨x₁, j₁⟩, ⟨x₂, j₂⟩, …, ⟨xₙ, jₙ⟩).
Then, the server-side algorithm constructs a sketch matrix, in which the rows are indexed by hash functions, and each row $j$ is the sum of the privatized vector of clients who selected the hash function $h_j$. To estimate the frequency of a message $d$, the server calculates all the hashed values of $d$: $h_j[d]^k_{j=1}$, and then aggregates the total count from the sketch matrix $\mathcal{M}$:
$C(d) = \sum_{j=1}^k \mathcal{M}_j[h_j[d]].$
An unbiased estimator for estimating the numbers of $d$ occurring is
$\hat{f}(d) = \frac{C(d) - \frac{p n}{m} - q n\left(1 - \frac{1}{m}\right)}{(p - q)\left(1 - \frac{1}{m}\right)}$

where $q= \frac{s-p}{m - 1}$.

For more details, please refer to our [latest paper](https://arxiv.org/pdf/2412.17303).

## Privacy-Preserving Data Collection from Unknown Domains

Collecting data from unknown domains is a common challenge in real-world applications, such as identifying new words in text inputs, or tracking emerging URLs. Current solutions often suffer from high computational and communication costs. To address this, we propose a Quasi-Local Privacy Protocol for collecting items from unknown domains. Our protocol combines central Differential Privacy (DP) techniques with cryptographic tools within the Encryption-Shuffling-Analysis (ESA) framework. This approach provides privacy guarantees similar to Local Differential Privacy (LDP) while significantly reducing computational overhead.

Our protocol uses an auxiliary server to construct histograms without accessing the original data. This allows the protocol to achieve accuracy similar to the central DP model, while still providing privacy protections akin to Local Differential Privacy (LDP). The technique leverages a stability-based histogram method integrated into the ESA framework, where the auxiliary server builds histograms from encrypted messages, ensuring privacy without tracing the original data. By eliminating the need for message segmentation and reconstruction, our protocol delivers central DP-level accuracy while maintaining LDP-like privacy guarantees.

### Protocol Overview
<div align="center">
<img src="./images/unknown_domain.png" alt="My local image" title="Protocol Overview" width="600" />
</div>

Our protocol uses the stability-based histogram technique [1] for collecting new data. To prevent direct data collection and tracing by the server, we integrate the ESA framework. Additionally, we introduce an auxiliary server to construct the histogram, similar to a central curator in the central DP model, but with a crucial distinction: the auxilary server does not access the original data messages. Instead, it only receives an encrypted version of the data $d$ encrypted with the server's public key $E_{pk1}[d]$, along with its hashed value $H[d]$. This ensures the auxiliary server gains no knowledge of the original message, except for the irreversible hash.
To construct the histogram, the auxiliary server counts the number of message $d$ by counting the corresponding hashed value $H[d]$. Each historgram bin is represented by a sampled $E_{pk1}[d]$, which can only be decrypted by the server. To enhance security, the messages passing through the shuffler between the auxiliary server and clients are further encrypted using the auxiliary server's public key: $E_{pk2} [E_{pk1}[d] || H[d]]$.

The overall framework contains four phases and is shown in Figure 2. We now describe each step in detail.

Before data collection begins, the server and the auxiliary server send their public keys, $pk1$ and $pk2$, to each client. They also agree on a set of privacy parameters $(\epsilon,\delta)$ for the DP guarantees for the release.

Phase 1: On-Device Processing

1. Data Encryption with Server's Public Key: The client encrypts the data with the server's public key, and the cyphertext is denoted as $E_{pk1}[d]$.
2. Data Hashing: The private data is then hashed by a hash function, denoted as $H$ (which is unique for each client but identical across all clients). The hashed result is denoted as $H[d]$.
3. Encryption with the Auxiliary Server's Public Key: Finally, the encrypted data and its hashed value are encrypted with the auxiliary server's public key. The encrypted message is denoted as $v = E_{pk2} [E_{pk1}[d] || H[d]]$

Phase 2: Shuffler's Anonymization and Shuffling

1. The encrypted message is passed on to the shuffler through an end-to-end encrypted channel. The shuffler then performs anonymization and shuffling after receiving each client's input, which is standard.

Phase 3: The auxiliary server applies Differential Privacy (DP) protection when releasing the item names.
1. Decrypt Messages: The first step is to decrypt the messages received from the shuffler using the secret key. The server then observes the following encrypted data: $\{E_{pk1}[d_1] || H[d_1],...,E_{pk1}[d_n] || H[d_n]\}$
2. Hash Frequency Calculation: Since each client uses the same hash function, the hashed results for different clients with identical items must be identical. The auxiliary server calculates the frequency of each hashed result and attaches the corresponding encrypted data to it.
3. Add DP Noise: The auxiliary server then adds Laplacian noise with scale $b$ to the frequency of each hashed result. For those hashed results with noisy frequencies above a threshold $T$, the auxiliary server randomly samples from the corresponding encrypted data and releases it to the server.

Phase 4: Server Decrypts Messages

1. In the final phase, the server decrypts the messages received from the auxiliary server to obtain the plaintext, which contains the item names.
Privacy and Utility Analysis
Privacy Analysis: The proposed protocol ensures differential privacy. Specifically, the released encrypted item set $S$ is $(\epsilon,\delta)$-differentially private with
$\epsilon = \max\left(\frac{1}{b}, \log\left(1 + \frac{1}{2 e^{(T-1)/b} - 1}\right)\right)$

and $\delta=\frac{1}{2}\exp(\epsilon(1-T)).$

For a detailed analysis, please refer to our [latest paper](https://arxiv.org/pdf/2412.17303).

## How to Use

### Repository Structure

The `src` folder contains the following files:
- `gcms.py`: A demo illustration of General Count Mean Sketch (GCMS) Framework.
- `gcms_client.py`: On-device LDP Algorithm in General Count Mean Sketch (GCMS) Framework.
- `gcms_server.py`: Server in General Count Mean Sketch (GCMS) Framework.
- `gcms_shuffler.py`: Shuffler in General Count Mean Sketch (GCMS) Framework.
- `unknown_domain.py`: A demo illustration of Privacy-Preserving Data Collection with Unknown Domain.
- `unknown_domain_client.py`: On-device algorithm in Privacy-Preserving Data Collection with Unknown Domain.
- `unknown_domain_aux_server.py`: Auxiliary Server in Privacy-Preserving Data Collection with Unknown Domain.
- `unknown_domain_server.py`: Server in Privacy-Preserving Data Collection with Unknown Domain.

### Requirements
- Python3
- Install all dependencies via `python3 -m pip install -r requirements.txt`

### How to simulate two protocols
Here we give two examples to simulate the protocol of GCMS and Privacy-Preserving Data Collection with Unknown Domain respectively. You can run it via `python3 gcms.py` and `python3 unknown_domain.py`.

## License

PrivacyGo is Apache-2.0 License licensed, as found in the [LICENSE](LICENSE) file.

## Disclaimers

This software is not an officially supported product of TikTok. It is provided as-is, without any guarantees or warranties, whether express or implied.

## Reference
[1] A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas, “Releasing search queries and clicks privately,” in Proceedings of the 18th international conference on World wide web, 2009, pp. 171–180.
Binary file added ldp/images/LDP_CMS.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added ldp/images/unknown_domain.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions ldp/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
numpy
pandas
cryptography
83 changes: 83 additions & 0 deletions ldp/src/gcms.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Copyright 2023 TikTok Pte. Ltd.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''A demo illustration of General Count Mean Sketch (GCMS) Framework.'''

from cryptography.hazmat.primitives.asymmetric import rsa

from typing import List
from gcms_utils import Paremeters
from gcms_client import GCMSClient
from gcms_shuffler import GCMSShuffler
from gcms_server import GCMSServer

import math


def gcms_demo(data: List[str], bench_nums: int, estimate_message: str):
'''
A demo illustration of General Count Mean Sketch (GCMS) Framework.
Args:
data: The raw data to be privatized and encrypted.
bench_nums: The number of bench times.
estimate_message: The message to be estimated.
'''
#0. Prepare the parameters for the algorithm.
ocms_parms = Paremeters(k=1000, m=1024, s=56, p=0.5)
epsilon = math.log((ocms_parms.m - ocms_parms.s) * ocms_parms.p / ((1 - ocms_parms.p) * ocms_parms.s))
print("epsilin is ", epsilon)

server_private_key = rsa.generate_private_key(public_exponent=65537, key_size=2048)
server_public_key = server_private_key.public_key()

estimate_frequencies = []
estimate_frequencies_debug = []
for i in range(bench_nums):
#1. clients perform on-device ldp operations on the raw data.
encrypted_messages, hash_indexs_debug, plaintext_messages_debug = GCMSClient().on_device_ldp_algorithm(
ocms_parms.k, ocms_parms.m, data, ocms_parms.s, ocms_parms.p, server_public_key)

#2. Shuffler shuffle the encrypted messages.
shuffled_message = GCMSShuffler.shuffle(encrypted_messages)

#3.0 Init Server
server = GCMSServer(ocms_parms.k, ocms_parms.m, server_private_key)

#3.1 Server decrypt the shuffled messages.
plaintext_messages, hash_indexs = server.decrypt_message(shuffled_message)

#3.2 Server construct the sketch matrix.
server.construct_sketch_matrix(plaintext_messages, hash_indexs)
# for debug
# print(sorted(hash_indexs_debug) == sorted(hash_indexs))
# print(sorted(plaintext_messages_debug) == sorted(plaintext_messages_debug))

#3.3. Server estimate the frequency of the specific message.
estimate_frequency_i = server.estimate_frequency(estimate_message, ocms_parms.p, ocms_parms.s)
estimate_frequencies.append(estimate_frequency_i)

server_debug = GCMSServer(ocms_parms.k, ocms_parms.m, server_private_key)
server_debug.construct_sketch_matrix(plaintext_messages_debug, hash_indexs_debug)
estimate_frequency_i_debug = server_debug.estimate_frequency(estimate_message, ocms_parms.p, ocms_parms.s)
estimate_frequencies_debug.append(estimate_frequency_i_debug)

print(f"average estimated frequency of {estimate_message}:", sum(estimate_frequencies) / bench_nums)
print(f"average debug estimated frequency of {estimate_message}:", sum(estimate_frequencies_debug) / bench_nums)


if __name__ == "__main__":
data = ['123' for i in range(100)]
data.extend(['456' for i in range(50)])
data.extend(['789' for i in range(25)])
gcms_demo(data, 10, '123')
Loading

0 comments on commit 1ab6781

Please sign in to comment.