diff --git a/README.md b/README.md
index 127e256..58756d9 100644
--- a/README.md
+++ b/README.md
@@ -23,6 +23,12 @@ MPC-DualDP is a distributed protocol for generating shared differential privacy
## AnonPSI
The widely used ECDH-PSI, while keeping all data encrypted, discloses the size of the intersection set during protocol execution. We refer to such protocols as size-revealing PSI. AnonPSI offers a framework for systematically assessing the privacy of intersection-size-revealing PSI protocols by employing carefully designed set membership inference attacks. It enables an adversary to infer whether a targeted individual is in the intersection, which is also known as membership information. For more detailed information, please refer to [this folder](./anonpsi). AnonPSI was recently accepted for NDSS24, and we look forward to engaging in discussions during the offline sessions at NDSS.
+## GCMS/OCMS
+Local Differential Privacy (LDP) protocols enable the collection of randomized client messages for data analysis, without the necessity of a trusted data curator. Such protocols have been successfully deployed in real-world scenarios by major tech companies like Google, Apple, and Microsoft.
+
+We propose a Generalized Count Mean Sketch (GCMS) protocol that captures many existing frequency estimation protocols. Our method significantly improves the three-way trade-offs between communication, privacy, and accuracy. We also introduce a general utility analysis framework that enables optimizing parameter designs. Based on that, we propose an
+Optimal Count Mean Sketch (OCMS) framework that minimizes the variance for collecting items with targeted frequencies. Moreover, we present a novel protocol for collecting data within unknown domain, as our frequency estimation protocols only work effectively with known data domain. For more detailed information, please refer to [this folder](./ldp).
+
## Contribution
Please check [Contributing](CONTRIBUTING.md) for more details.
diff --git a/ldp/README.md b/ldp/README.md
new file mode 100644
index 0000000..983e11d
--- /dev/null
+++ b/ldp/README.md
@@ -0,0 +1,123 @@
+# Privacy-Preserving Data Collection with Local Differential Privacy
+
+Local Differential Privacy (LDP) is a privacy-preserving technique in which data is randomized before being shared, ensuring that each individual’s information remains hidden even from data collectors. By adding controlled noise or transformations on the client side, LDP enables useful aggregate analysis without compromising user privacy.
+
+## General Count Mean Sketch Protocol for Local Differential Privacy
+
+We propose the Generalized Count-Mean-Sketch (GCMS) protocol, which builds on Apple's CMS by accounting for randomness in responses and hash collisions, improving accuracy under the LDP model. GCMS reduces communication costs while maintaining privacy and provides a utility guarantee for optimal parameter design.
+
+To further enhance privacy, we use the Encryption-Shuffling-Analysis (ESA) framework. Clients' encoded outputs are encrypted and shuffled by a trusted shuffler, breaking the link between clients and responses and amplifying privacy. The server then aggregates and analyzes the shuffled data.
+
+The overview of the protocol is as follows:
+
+Preparation:
+ Before data collection, the server generates $k$ independent hash functions $\mathcal{H} \overset{\Delta}{=} \{h_1,h_2,...,h_k\}$, where each hash function deterministically maps any input to a discrete number in $[\mathbf{m}]=[1,...,m]$, with $m$ being the hashing range, $\mathcal{H}$ being the universe of hash functions. The server then sends $\mathcal{H}$ and its public key $pk$ to each client.
+
+
+
data:image/s3,"s3://crabby-images/b6026/b602634f4c5f5a5139e9dcc3a279fc63470bb224" alt="Protocol Overview My local image 2"
+
+
+The entire data collection pipeline is illustrated in Figure above. It consists of three distinct phases, each operating on different platforms:
+1. The initial phase involves all on-device algorithms, including hash encoding and privatization, and encryption with the server's public key $pk$.
+2. The encrypted privatized data is then transmitted through an end-to-end encrypted channel and received by the shuffler. The shuffler then forwards the data to the server after a random shuffling operation.
+3. Finally, the server decrypts the input data, performs data aggregation, and obtains the Frequency Lookup Table - a sketch matrix $\mathcal{M}$.
+
+Phase 1: On-device operations.
+
+We present the process for a single data privatization and encryption, which consists of the following steps.
+1. Hash Encoding: Each client first uniformly selects a hash function from $\{h_1, h_2, \ldots, h_k\}$;and calculates a hashed value $r=h_j(d)$ of their raw data $d$.
+2. Probabilistic Inclusion: Each client initializes their privatized vector $\textbf{x}$ as an empty set, then adds $r$ to $\textbf{x}$ with probability $p \in [0.5,1]$.
+3. Probabilistic Extension: Set the extension domain for each client as $[\mathbf{m}]/r = \{1, 2, \ldots, r-1, r+1, \ldots, m\}$. If $r$ is added to $\textbf{x}$, then uniformly select $s-1$ elements from $[\mathbf{m}]/r$ and append them to $\textbf{x}$. If $r$ is not added to $\textbf{x}$, then uniformly select $s$ elements from $[\mathbf{m}]/r$ and append them to $\textbf{x}$. Return the privatized vector along with the selected hash index $\langle \textbf{x}, j \rangle$.
+4. Encryption and Release: Each client encrypts $\langle \textbf{x}, j \rangle$ with the server's public key $pk$ to obtain $v = E_{pk}[\textbf{x}, j]$, then releases $v$ to the shuffler.
+
+Phase 2: Shuffler's anonymization and shuffling.
+
+ This phase, which is standard, involves Anonymization and Shuffling.
+
+Phase 3: Server's aggregation and frequency estimation.
+
+The server first decrypts messages with the secret key and obtains: π(⟨x₁, j₁⟩, ⟨x₂, j₂⟩, …, ⟨xₙ, jₙ⟩).
+Then, the server-side algorithm constructs a sketch matrix, in which the rows are indexed by hash functions, and each row $j$ is the sum of the privatized vector of clients who selected the hash function $h_j$. To estimate the frequency of a message $d$, the server calculates all the hashed values of $d$: $h_j[d]^k_{j=1}$, and then aggregates the total count from the sketch matrix $\mathcal{M}$:
+ $C(d) = \sum_{j=1}^k \mathcal{M}_j[h_j[d]].$
+An unbiased estimator for estimating the numbers of $d$ occurring is
+$\hat{f}(d) = \frac{C(d) - \frac{p n}{m} - q n\left(1 - \frac{1}{m}\right)}{(p - q)\left(1 - \frac{1}{m}\right)}$
+
+where $q= \frac{s-p}{m - 1}$.
+
+For more details, please refer to our [latest paper](https://arxiv.org/pdf/2412.17303).
+
+## Privacy-Preserving Data Collection from Unknown Domains
+
+Collecting data from unknown domains is a common challenge in real-world applications, such as identifying new words in text inputs, or tracking emerging URLs. Current solutions often suffer from high computational and communication costs. To address this, we propose a Quasi-Local Privacy Protocol for collecting items from unknown domains. Our protocol combines central Differential Privacy (DP) techniques with cryptographic tools within the Encryption-Shuffling-Analysis (ESA) framework. This approach provides privacy guarantees similar to Local Differential Privacy (LDP) while significantly reducing computational overhead.
+
+Our protocol uses an auxiliary server to construct histograms without accessing the original data. This allows the protocol to achieve accuracy similar to the central DP model, while still providing privacy protections akin to Local Differential Privacy (LDP). The technique leverages a stability-based histogram method integrated into the ESA framework, where the auxiliary server builds histograms from encrypted messages, ensuring privacy without tracing the original data. By eliminating the need for message segmentation and reconstruction, our protocol delivers central DP-level accuracy while maintaining LDP-like privacy guarantees.
+
+### Protocol Overview
+
+
data:image/s3,"s3://crabby-images/29eed/29eed52f0f1e280a0d90771090c816a723f16d5c" alt="Protocol Overview My local image"
+
+
+Our protocol uses the stability-based histogram technique [1] for collecting new data. To prevent direct data collection and tracing by the server, we integrate the ESA framework. Additionally, we introduce an auxiliary server to construct the histogram, similar to a central curator in the central DP model, but with a crucial distinction: the auxilary server does not access the original data messages. Instead, it only receives an encrypted version of the data $d$ encrypted with the server's public key $E_{pk1}[d]$, along with its hashed value $H[d]$. This ensures the auxiliary server gains no knowledge of the original message, except for the irreversible hash.
+To construct the histogram, the auxiliary server counts the number of message $d$ by counting the corresponding hashed value $H[d]$. Each historgram bin is represented by a sampled $E_{pk1}[d]$, which can only be decrypted by the server. To enhance security, the messages passing through the shuffler between the auxiliary server and clients are further encrypted using the auxiliary server's public key: $E_{pk2} [E_{pk1}[d] || H[d]]$.
+
+The overall framework contains four phases and is shown in Figure 2. We now describe each step in detail.
+
+Before data collection begins, the server and the auxiliary server send their public keys, $pk1$ and $pk2$, to each client. They also agree on a set of privacy parameters $(\epsilon,\delta)$ for the DP guarantees for the release.
+
+Phase 1: On-Device Processing
+
+1. Data Encryption with Server's Public Key: The client encrypts the data with the server's public key, and the cyphertext is denoted as $E_{pk1}[d]$.
+2. Data Hashing: The private data is then hashed by a hash function, denoted as $H$ (which is unique for each client but identical across all clients). The hashed result is denoted as $H[d]$.
+3. Encryption with the Auxiliary Server's Public Key: Finally, the encrypted data and its hashed value are encrypted with the auxiliary server's public key. The encrypted message is denoted as $v = E_{pk2} [E_{pk1}[d] || H[d]]$
+
+Phase 2: Shuffler's Anonymization and Shuffling
+
+1. The encrypted message is passed on to the shuffler through an end-to-end encrypted channel. The shuffler then performs anonymization and shuffling after receiving each client's input, which is standard.
+
+Phase 3: The auxiliary server applies Differential Privacy (DP) protection when releasing the item names.
+1. Decrypt Messages: The first step is to decrypt the messages received from the shuffler using the secret key. The server then observes the following encrypted data: $\{E_{pk1}[d_1] || H[d_1],...,E_{pk1}[d_n] || H[d_n]\}$
+2. Hash Frequency Calculation: Since each client uses the same hash function, the hashed results for different clients with identical items must be identical. The auxiliary server calculates the frequency of each hashed result and attaches the corresponding encrypted data to it.
+3. Add DP Noise: The auxiliary server then adds Laplacian noise with scale $b$ to the frequency of each hashed result. For those hashed results with noisy frequencies above a threshold $T$, the auxiliary server randomly samples from the corresponding encrypted data and releases it to the server.
+
+Phase 4: Server Decrypts Messages
+
+1. In the final phase, the server decrypts the messages received from the auxiliary server to obtain the plaintext, which contains the item names.
+Privacy and Utility Analysis
+Privacy Analysis: The proposed protocol ensures differential privacy. Specifically, the released encrypted item set $S$ is $(\epsilon,\delta)$-differentially private with
+$\epsilon = \max\left(\frac{1}{b}, \log\left(1 + \frac{1}{2 e^{(T-1)/b} - 1}\right)\right)$
+
+and $\delta=\frac{1}{2}\exp(\epsilon(1-T)).$
+
+For a detailed analysis, please refer to our [latest paper](https://arxiv.org/pdf/2412.17303).
+
+## How to Use
+
+### Repository Structure
+
+The `src` folder contains the following files:
+- `gcms.py`: A demo illustration of General Count Mean Sketch (GCMS) Framework.
+- `gcms_client.py`: On-device LDP Algorithm in General Count Mean Sketch (GCMS) Framework.
+- `gcms_server.py`: Server in General Count Mean Sketch (GCMS) Framework.
+- `gcms_shuffler.py`: Shuffler in General Count Mean Sketch (GCMS) Framework.
+- `unknown_domain.py`: A demo illustration of Privacy-Preserving Data Collection with Unknown Domain.
+- `unknown_domain_client.py`: On-device algorithm in Privacy-Preserving Data Collection with Unknown Domain.
+- `unknown_domain_aux_server.py`: Auxiliary Server in Privacy-Preserving Data Collection with Unknown Domain.
+- `unknown_domain_server.py`: Server in Privacy-Preserving Data Collection with Unknown Domain.
+
+### Requirements
+- Python3
+- Install all dependencies via `python3 -m pip install -r requirements.txt`
+
+### How to simulate two protocols
+Here we give two examples to simulate the protocol of GCMS and Privacy-Preserving Data Collection with Unknown Domain respectively. You can run it via `python3 gcms.py` and `python3 unknown_domain.py`.
+
+## License
+
+PrivacyGo is Apache-2.0 License licensed, as found in the [LICENSE](LICENSE) file.
+
+## Disclaimers
+
+This software is not an officially supported product of TikTok. It is provided as-is, without any guarantees or warranties, whether express or implied.
+
+## Reference
+[1] A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas, “Releasing search queries and clicks privately,” in Proceedings of the 18th international conference on World wide web, 2009, pp. 171–180.
diff --git a/ldp/images/LDP_CMS.png b/ldp/images/LDP_CMS.png
new file mode 100644
index 0000000..6e1c1ea
Binary files /dev/null and b/ldp/images/LDP_CMS.png differ
diff --git a/ldp/images/unknown_domain.png b/ldp/images/unknown_domain.png
new file mode 100644
index 0000000..a8b5e40
Binary files /dev/null and b/ldp/images/unknown_domain.png differ
diff --git a/ldp/requirements.txt b/ldp/requirements.txt
new file mode 100644
index 0000000..d2e5ad6
--- /dev/null
+++ b/ldp/requirements.txt
@@ -0,0 +1,3 @@
+numpy
+pandas
+cryptography
diff --git a/ldp/src/gcms.py b/ldp/src/gcms.py
new file mode 100644
index 0000000..c1fb680
--- /dev/null
+++ b/ldp/src/gcms.py
@@ -0,0 +1,83 @@
+# Copyright 2023 TikTok Pte. Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''A demo illustration of General Count Mean Sketch (GCMS) Framework.'''
+
+from cryptography.hazmat.primitives.asymmetric import rsa
+
+from typing import List
+from gcms_utils import Paremeters
+from gcms_client import GCMSClient
+from gcms_shuffler import GCMSShuffler
+from gcms_server import GCMSServer
+
+import math
+
+
+def gcms_demo(data: List[str], bench_nums: int, estimate_message: str):
+ '''
+ A demo illustration of General Count Mean Sketch (GCMS) Framework.
+
+ Args:
+ data: The raw data to be privatized and encrypted.
+ bench_nums: The number of bench times.
+ estimate_message: The message to be estimated.
+ '''
+ #0. Prepare the parameters for the algorithm.
+ ocms_parms = Paremeters(k=1000, m=1024, s=56, p=0.5)
+ epsilon = math.log((ocms_parms.m - ocms_parms.s) * ocms_parms.p / ((1 - ocms_parms.p) * ocms_parms.s))
+ print("epsilin is ", epsilon)
+
+ server_private_key = rsa.generate_private_key(public_exponent=65537, key_size=2048)
+ server_public_key = server_private_key.public_key()
+
+ estimate_frequencies = []
+ estimate_frequencies_debug = []
+ for i in range(bench_nums):
+ #1. clients perform on-device ldp operations on the raw data.
+ encrypted_messages, hash_indexs_debug, plaintext_messages_debug = GCMSClient().on_device_ldp_algorithm(
+ ocms_parms.k, ocms_parms.m, data, ocms_parms.s, ocms_parms.p, server_public_key)
+
+ #2. Shuffler shuffle the encrypted messages.
+ shuffled_message = GCMSShuffler.shuffle(encrypted_messages)
+
+ #3.0 Init Server
+ server = GCMSServer(ocms_parms.k, ocms_parms.m, server_private_key)
+
+ #3.1 Server decrypt the shuffled messages.
+ plaintext_messages, hash_indexs = server.decrypt_message(shuffled_message)
+
+ #3.2 Server construct the sketch matrix.
+ server.construct_sketch_matrix(plaintext_messages, hash_indexs)
+ # for debug
+ # print(sorted(hash_indexs_debug) == sorted(hash_indexs))
+ # print(sorted(plaintext_messages_debug) == sorted(plaintext_messages_debug))
+
+ #3.3. Server estimate the frequency of the specific message.
+ estimate_frequency_i = server.estimate_frequency(estimate_message, ocms_parms.p, ocms_parms.s)
+ estimate_frequencies.append(estimate_frequency_i)
+
+ server_debug = GCMSServer(ocms_parms.k, ocms_parms.m, server_private_key)
+ server_debug.construct_sketch_matrix(plaintext_messages_debug, hash_indexs_debug)
+ estimate_frequency_i_debug = server_debug.estimate_frequency(estimate_message, ocms_parms.p, ocms_parms.s)
+ estimate_frequencies_debug.append(estimate_frequency_i_debug)
+
+ print(f"average estimated frequency of {estimate_message}:", sum(estimate_frequencies) / bench_nums)
+ print(f"average debug estimated frequency of {estimate_message}:", sum(estimate_frequencies_debug) / bench_nums)
+
+
+if __name__ == "__main__":
+ data = ['123' for i in range(100)]
+ data.extend(['456' for i in range(50)])
+ data.extend(['789' for i in range(25)])
+ gcms_demo(data, 10, '123')
diff --git a/ldp/src/gcms_client.py b/ldp/src/gcms_client.py
new file mode 100644
index 0000000..20b0a6c
--- /dev/null
+++ b/ldp/src/gcms_client.py
@@ -0,0 +1,93 @@
+# Copyright 2023 TikTok Pte. Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""On-device LDP Algorithm in General Count Mean Sketch (GCMS) Framework."""
+
+from gcms_utils import hash_encode
+from gcms_utils import bernoulli_sample
+from gcms_utils import serialize_integers_to_bytes
+
+from typing import List, Tuple
+import pandas as pd
+import secrets
+
+from cryptography.hazmat.primitives.asymmetric import padding
+from cryptography.hazmat.primitives.asymmetric import rsa
+from cryptography.hazmat.primitives import hashes
+
+
+class GCMSClient:
+ '''Client is responsible for privatizing and encrypting the raw messages using the on-device LDP algorithm.
+ '''
+
+ @staticmethod
+ def on_device_ldp_algorithm(k: int, m: int, d: List[str], s: int, p: float,
+ pk: rsa.RSAPublicKey) -> Tuple[List[bytes], List[int], List[List[int]]]:
+ """On device LDP Algorithm in GCMS.
+ The detailed process for data privatization and encryption.
+
+ Args:
+ k: the number of hash functions.
+ m: the module of hash encode function.
+ d: the raw messages.
+ s: the size of the result messages.
+ p: the inclusion probability.
+ pk: the public key of the server.
+
+ Returns:
+ A vector of encrypted privatized messages.
+ """
+
+ hash_indexs = [] # for debug only
+ plaintext_messages = [] # for debug only
+ encrypted_messages = []
+
+ for raw_message in d:
+ if not (pd.isna(raw_message)):
+ # Randomly select k.
+ random_index = secrets.randbelow(k)
+ hash_indexs.append(random_index)
+ # Calculate the hashed value r.
+ hash_result_r = hash_encode(raw_message, random_index, m)
+
+ # Initiate output vector x as an empty set.
+ message_x = []
+
+ # Add r to x with probability of p.
+ if bernoulli_sample(p):
+ # Randomly select s − 1 elements from [m]/r;
+ message_x.append(hash_result_r)
+ while len(message_x) < s:
+ random_element = secrets.randbelow(m)
+ if random_element != hash_result_r and random_element not in message_x:
+ message_x.append(random_element)
+ else:
+ # Randomly select s elements from [m]/r;
+ while len(message_x) < s:
+ random_element = secrets.randbelow(m)
+ if random_element != hash_result_r and random_element not in message_x:
+ message_x.append(random_element)
+ plaintext_messages.append(message_x) # for debug only
+
+ # Encrypt with Server’s public key
+ m_bytes_length = (m.bit_length() + 7) // 8
+ k_bytes_length = (k.bit_length() + 7) // 8
+
+ plaintext_x = serialize_integers_to_bytes(message_x, m_bytes_length)
+ plaintext_x.extend(random_index.to_bytes(k_bytes_length, byteorder='big'))
+
+ ciphertext = pk.encrypt(
+ bytes(plaintext_x),
+ padding.OAEP(mgf=padding.MGF1(algorithm=hashes.SHA256()), algorithm=hashes.SHA256(), label=None))
+ encrypted_messages.append(ciphertext)
+ return encrypted_messages, hash_indexs, plaintext_messages
diff --git a/ldp/src/gcms_server.py b/ldp/src/gcms_server.py
new file mode 100644
index 0000000..6a2622d
--- /dev/null
+++ b/ldp/src/gcms_server.py
@@ -0,0 +1,106 @@
+# Copyright 2023 TikTok Pte. Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''Server in General Count Mean Sketch (GCMS) Framework.'''
+
+from cryptography.hazmat.primitives.asymmetric import padding
+from cryptography.hazmat.primitives.asymmetric import rsa
+from cryptography.hazmat.primitives import hashes
+
+from gcms_utils import deserialize_integers_from_bytes
+from gcms_utils import hash_encode
+
+import numpy as np
+
+from typing import List, Tuple
+
+
+class GCMSServer:
+ '''Server is responsible for aggregation and frequency estimation.
+
+ Attributes:
+ k: the number of hash functions.
+ m: the module of hash encode function.
+ sk: the private key of the server.
+ matrix: the sketch matrix.
+ m_bytes_length: the length of the m in bytes.
+ k_bytes_length: the length of the k in bytes.
+ n: the number of messages.
+ '''
+
+ def __init__(self, k: int, m: int, sk: rsa.RSAPrivateKey) -> None:
+ self.k = k
+ self.m = m
+ self.sk = sk
+ self.matrix = np.zeros((k, m))
+ self.m_bytes_length = (m.bit_length() + 7) // 8
+ self.k_bytes_length = (k.bit_length() + 7) // 8
+ self.n = 0
+
+ def decrypt_message(self, encrypted_messages: List[bytes]) -> Tuple[List[int], int]:
+ '''Decrypt encrypted messages with the server's secret key.
+
+ Args:
+ encrypted_messages: A list of encrypted messages.
+
+ Returns:
+ A tuple of the plaintext messages and the hash indexs.
+ '''
+ hash_indexs = []
+ plaintext_messages = []
+ for message in encrypted_messages:
+ decrypted_message = self.sk.decrypt(
+ message, padding.OAEP(mgf=padding.MGF1(algorithm=hashes.SHA256()),
+ algorithm=hashes.SHA256(),
+ label=None))
+ hash_index = int.from_bytes(decrypted_message[-self.k_bytes_length:], byteorder='big')
+ plaintext_message = deserialize_integers_from_bytes(decrypted_message[:-self.k_bytes_length],
+ self.m_bytes_length)
+ hash_indexs.append(hash_index)
+ plaintext_messages.append(plaintext_message)
+
+ return plaintext_messages, hash_indexs
+
+ def construct_sketch_matrix(self, plaintext_messages: List[List[int]], hash_indexs: List[int]) -> None:
+ '''Construct the sketch matrix.
+
+ Args:
+ plaintext_messages: A list of plaintext messages.
+ hash_indexs: A list of hash indexs.
+ '''
+ self.n += len(plaintext_messages)
+ for i in range(len(plaintext_messages)):
+ for j in plaintext_messages[i]:
+ self.matrix[hash_indexs[i]][j] += 1
+
+ def estimate_frequency(self, message: str, p: float, s: int) -> float:
+ '''Estimate the frequency of the given message.
+
+ Args:
+ message: The message to be estimated.
+ p: The inclusion probability.
+ s: The size of the result messages.
+
+ Returns:
+ The estimated frequency of the given message.
+ '''
+ hash_k = []
+ total_count = 0
+ for i in range(self.k):
+ hash_result_i = hash_encode(message, i, self.m)
+ total_count += self.matrix[i][hash_result_i]
+ hash_k.append(hash_result_i)
+ q = (p * (s - 1) + (1 - p) * s) / (self.m - 1)
+ estimated_frequency = (total_count - (p * self.n / self.m) - (q * self.n *
+ (1 - 1 / self.m))) / ((p - q) * (1 - 1 / self.m))
+ return estimated_frequency
diff --git a/ldp/src/gcms_shuffler.py b/ldp/src/gcms_shuffler.py
new file mode 100644
index 0000000..f93e648
--- /dev/null
+++ b/ldp/src/gcms_shuffler.py
@@ -0,0 +1,39 @@
+# Copyright 2023 TikTok Pte. Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Shuffler in General Count Mean Sketch (GCMS) Framework."""
+
+from typing import Any, List
+import secrets
+
+
+class GCMSShuffler:
+ """Shuffler is responsible for shuffling and anonymizing the input list from clients."""
+
+ @staticmethod
+ def shuffle(data: List[Any]) -> List[Any]:
+ """Shuffle the input list.
+
+ Args:
+ input_list: The input list to be shuffled.
+
+ Returns:
+ The shuffled list.
+ """
+ data_copy = data[:]
+
+ for i in range(len(data_copy) - 1, 0, -1):
+ j = secrets.randbelow(i + 1)
+ data_copy[i], data_copy[j] = data_copy[j], data_copy[i]
+
+ return data_copy
diff --git a/ldp/src/gcms_utils.py b/ldp/src/gcms_utils.py
new file mode 100644
index 0000000..66433d7
--- /dev/null
+++ b/ldp/src/gcms_utils.py
@@ -0,0 +1,113 @@
+# Copyright 2023 TikTok Pte. Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Utils in General Count Mean Sketch (GCMS) Framework."""
+
+from typing import List
+from hashlib import sha256
+import secrets
+import numpy as np
+
+
+def hash_encode(messgae: str, index: int, module: int) -> int:
+ """Hash the message with a hash function index, and reduce it with a module.
+
+ Args:
+ messgae: A message to be encoded.
+ index: A index of the hash function.
+ module: A module of the encoded result.
+
+ Returns:
+ A hash encoded message.
+ """
+ sha_input = messgae + "$$$" + str(index)
+ return int(sha256(sha_input.encode('utf-8')).hexdigest(), 16) % module
+
+
+def bernoulli_sample(p: float) -> int:
+ """Generate a sample value of 0 or 1 with probability p.
+
+ Args:
+ p: The probability of generating 1.
+
+ Returns:
+ 0 or 1
+ """
+ return 1 if secrets.randbelow(1000000) < p * 1000000 else 0
+
+
+def serialize_integers_to_bytes(integers: List[int], fixed_length: int) -> bytearray:
+ '''Serialize a list of integers to a bytearray in big endian.
+
+ Args:
+ integers: A list of integers.
+ fixed_length: The fixed length of each integer.
+
+ Returns:
+ A bytearray of the serialized integers.
+ '''
+ serialized_data = bytearray()
+ for num in integers:
+ padded_bytes = num.to_bytes(fixed_length, byteorder='big')
+ serialized_data.extend(padded_bytes)
+ return serialized_data
+
+
+def deserialize_integers_from_bytes(serialized_data: bytearray, fixed_length: int) -> List[int]:
+ '''Deserialize a list of integers from a bytearray in big endian.
+
+ Args:
+ serialized_data: A bytearray of the serialized integers.
+ fixed_length: The fixed length of each integer.
+
+ Returns:
+ A list of integers.
+ '''
+ integers = []
+ num_integers = len(serialized_data) // fixed_length
+ for i in range(num_integers):
+ start_idx = i * fixed_length
+ end_idx = start_idx + fixed_length
+ number = int.from_bytes(serialized_data[start_idx:end_idx], byteorder='big')
+ integers.append(number)
+ return integers
+
+
+class Paremeters:
+ '''
+ Attributes:
+ k: the number of hash functions.
+ m: the module of hash encode function.
+ s: the size of the result messages.
+ p: the inclusion probability.
+ '''
+
+ def __init__(self, k: int, m: int, s: int, p: float) -> None:
+ self.k = k
+ self.m = m
+ self.s = s
+ self.p = p
+
+
+def generate_laplace_noise(loc: float, scale: float) -> float:
+ '''Generate a laplace noise with the given location and scale.
+ This is only for experimental use, please do not use in production environment.
+
+ Args:
+ loc: The location of the laplace noise.
+ scale: The scale of the laplace noise.
+
+ Returns:
+ A laplace noise.
+ '''
+ return np.random.laplace(loc=loc, scale=scale)
diff --git a/ldp/src/unknown_domain.py b/ldp/src/unknown_domain.py
new file mode 100644
index 0000000..828750f
--- /dev/null
+++ b/ldp/src/unknown_domain.py
@@ -0,0 +1,73 @@
+# Copyright 2023 TikTok Pte. Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''A demo illustration of Privacy-Preserving Data Collection with Unknown Domain.'''
+
+from cryptography.hazmat.primitives.asymmetric import rsa
+
+from typing import List
+from gcms_shuffler import GCMSShuffler
+from unknown_domain_client import UnknownDomainClient
+from unknown_domain_aux_server import UnknownDomainAuxServer
+from unknown_domain_server import UnknownDomainServer
+
+import math
+
+
+def unknown_domain_demo(data: List[str], bench_nums: int, delta: float, epsilon: float):
+ '''
+ A demo illustration of Privacy-Preserving Data Collection with Unknown Domain.'
+
+ Args:
+ data: The raw data to be privatized and encrypted.
+ bench_nums: The number of bench times.
+ estimate_message: The message to be estimated.
+ '''
+ #0. Prepare the parameters for the algorithm.
+ T = 1 + 1 / epsilon * math.log(1 / (2 * delta))
+ scale = 1 / epsilon
+
+ server_private_key = rsa.generate_private_key(public_exponent=65537, key_size=2048)
+ server_public_key = server_private_key.public_key()
+
+ aux_server_private_key = rsa.generate_private_key(public_exponent=65537, key_size=3072)
+ aux_server_public_key = aux_server_private_key.public_key()
+
+ for i in range(bench_nums):
+ #1. clients perform on-device ldp operations on the raw data.
+ encrypted_messages = UnknownDomainClient.on_device_algorithm(data, server_public_key, aux_server_public_key)
+
+ #2. Shuffler shuffle the encrypted messages.
+ shuffled_message = GCMSShuffler.shuffle(encrypted_messages)
+
+ #3.0 Init Aux Server
+ aux_server = UnknownDomainAuxServer(aux_server_private_key)
+
+ #3.1 Aux Server perform DP protection.
+ encrypted_messages_with_server_pk = aux_server.dp_protection(shuffled_message, T, scale)
+
+ #4.0 Init Server
+ server = UnknownDomainServer(server_private_key)
+
+ #4.1 Server decrypts messages.
+ result = server.decrypt_message(encrypted_messages_with_server_pk)
+ print("result", result)
+
+
+if __name__ == "__main__":
+ data = ['123' for i in range(100)]
+ data.extend(['456' for i in range(50)])
+ data.extend(['789' for i in range(25)])
+ deltas = [1 / (10 * len(data)), 1 / (100 * len(data))]
+ epsilon_list = [x / 10 for x in range(1, 100)]
+ unknown_domain_demo(data, bench_nums=1, delta=deltas[0], epsilon=epsilon_list[0])
diff --git a/ldp/src/unknown_domain_aux_server.py b/ldp/src/unknown_domain_aux_server.py
new file mode 100644
index 0000000..a247a92
--- /dev/null
+++ b/ldp/src/unknown_domain_aux_server.py
@@ -0,0 +1,66 @@
+# Copyright 2023 TikTok Pte. Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''Auxiliary Server in Privacy-Preserving Data Collection with Unknown Domain.'''
+
+from cryptography.hazmat.primitives.asymmetric import padding
+from cryptography.hazmat.primitives.asymmetric import rsa
+from cryptography.hazmat.primitives import hashes
+
+from gcms_utils import generate_laplace_noise
+
+import secrets
+
+from typing import List, Tuple
+
+
+class UnknownDomainAuxServer:
+ '''Server is responsible for providing DP protection for releasing the item names.
+
+ Attributes:
+ sk: the secret key of the auxiliary server.
+ hash_len: the length of the hash value.
+ '''
+
+ def __init__(self, sk: rsa.RSAPrivateKey, hash_len: int = 32) -> None:
+ self.sk = sk
+ self.hash_len = hash_len
+
+ def dp_protection(self, encrypted_messages: List[bytes], t: int, b: int) -> List[bytes]:
+ '''DP protection for releasing the item names.
+
+ Args:
+ encrypted_messages: A list of shuffled encrypted messages encrypted by auxiliary server.
+
+ Returns:
+ A list of encrypted messages encrypted by server.
+ '''
+ encrypted_messages_with_server_pk = []
+ key_value_dict = {}
+ for message in encrypted_messages:
+ decrypted_message = self.sk.decrypt(
+ message, padding.OAEP(mgf=padding.MGF1(algorithm=hashes.SHA256()),
+ algorithm=hashes.SHA256(),
+ label=None))
+ key = (decrypted_message[-self.hash_len:]).hex()
+ if key in key_value_dict:
+ key_value_dict[key].append(decrypted_message[:-self.hash_len])
+ else:
+ key_value_dict[key] = [decrypted_message[:-self.hash_len]]
+
+ for key, value in key_value_dict.items():
+ if len(value) + generate_laplace_noise(0, b) >= t:
+ idx = secrets.randbelow(len(value))
+ encrypted_messages_with_server_pk.append(value[idx])
+
+ return encrypted_messages_with_server_pk
diff --git a/ldp/src/unknown_domain_client.py b/ldp/src/unknown_domain_client.py
new file mode 100644
index 0000000..074da00
--- /dev/null
+++ b/ldp/src/unknown_domain_client.py
@@ -0,0 +1,51 @@
+# Copyright 2023 TikTok Pte. Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''On-device algorithm in Privacy-Preserving Data Collection with Unknown Domain.'''
+
+from typing import List
+from hashlib import sha256
+
+from cryptography.hazmat.primitives.asymmetric import padding
+from cryptography.hazmat.primitives.asymmetric import rsa
+from cryptography.hazmat.primitives import hashes
+
+
+class UnknownDomainClient:
+ '''Client is responsible for collecting and encrypting the raw messages using the on-device LDP algorithm.
+ '''
+
+ @staticmethod
+ def on_device_algorithm(d: List[str], pk1: rsa.RSAPublicKey, pk2: rsa.RSAPublicKey) -> List[bytes]:
+ """On device LDP Algorithm for unknown data string collection.
+
+ Args:
+ d: the raw messages.
+ pk1: the public key of the server.
+ pk2: the public key of the auxiliary server.
+
+ Returns:
+ A vector of encrypted messages.
+ """
+ encrypted_messages = []
+ for message in d:
+ hash_message = sha256(message.encode('utf-8')).digest()
+
+ ciphertext_with_pk1 = pk1.encrypt(
+ bytes(message, encoding='utf-8'),
+ padding.OAEP(mgf=padding.MGF1(algorithm=hashes.SHA256()), algorithm=hashes.SHA256(), label=None))
+ ciphertext_with_pk2 = pk2.encrypt(
+ ciphertext_with_pk1 + hash_message,
+ padding.OAEP(mgf=padding.MGF1(algorithm=hashes.SHA256()), algorithm=hashes.SHA256(), label=None))
+ encrypted_messages.append(ciphertext_with_pk2)
+ return encrypted_messages
diff --git a/ldp/src/unknown_domain_server.py b/ldp/src/unknown_domain_server.py
new file mode 100644
index 0000000..9db9b55
--- /dev/null
+++ b/ldp/src/unknown_domain_server.py
@@ -0,0 +1,50 @@
+# Copyright 2023 TikTok Pte. Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''Server in Privacy-Preserving Data Collection with Unknown Domain.'''
+
+from cryptography.hazmat.primitives.asymmetric import padding
+from cryptography.hazmat.primitives.asymmetric import rsa
+from cryptography.hazmat.primitives import hashes
+
+from typing import List
+
+
+class UnknownDomainServer:
+ '''Server is responsible for decrypting messages received from the auxiliary server.
+
+ Attributes:
+ sk: the secret key of the server.
+ hash_len: the length of the hash value.
+ '''
+
+ def __init__(self, sk: rsa.RSAPrivateKey) -> None:
+ self.sk = sk
+
+ def decrypt_message(self, encrypted_messages: List[bytes]) -> List[str]:
+ '''Decrypt the encrypted messages received from the auxiliary server.
+
+ Args:
+ encrypted_messages: A list of encrypted messages encrypted by server.
+
+ Returns:
+ A list of plaintext messages.
+ '''
+ result = []
+ for message in encrypted_messages:
+ decrypted_message = self.sk.decrypt(
+ message, padding.OAEP(mgf=padding.MGF1(algorithm=hashes.SHA256()),
+ algorithm=hashes.SHA256(),
+ label=None))
+ result.append(str(decrypted_message, encoding='utf-8'))
+ return result