Open Access
Research Article
Issue
Security and Safety
Volume 1, 2022
Article Number 2022008
Number of page(s) 18
Section Social Governance
DOI https://doi.org/10.1051/sands/2022008
Published online 25 July 2022

© The Author(s) 2022. Published by EDP Sciences and China Science Publishing & Media Ltd.

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Recent years have witnessed a rapid development of data mining techniques [1], which are found popular in a wide variety of fields, from disease prevention [2, 3] and credit evaluation [4, 5] to marketing analysis [6, 7] and anomaly detection [8, 9]. In order to provide accurate services and make effective decisions, both public and private organizations are committed to collecting and analyzing individual data [10]. For example, Walmart collected the shopping basket information of customers and found that beer was the most common commodity bought together with diapers [11]. However, the intentional or accidental disclosure of privacy information [12], such as personal credit information, online transaction history, or medical records, has raised concerns about personal privacy and even caused widespread social panic and significant economic losses. Netflix1 has released a dataset including movie ratings from 500 000 subscribers. The dataset was intended for researchers of recommender systems. Although the subscribers were anonymous, when using the Internet Movie Database (IMDb)2 as background knowledge, Narayanan et al. [13] were still able to link some records in the Netflix database to known individuals. When the Facebook3 database was hacked, privacy information from more than 50 million users was leaked, leading to both huge financial loss and severe administrative litigation of the company. Hence, it is pressing to design more effective privacy protection methods that meet both legal requirements and user expectations.

As a matter of fact, a variety of work has been put forward for privacy-preserving issues. These methods govern the explicit use of privacy attributes and can be roughly divided as follows: (1) Data distorting approaches [14, 15], which work by adding noise that obeys a specific distribution to privacy attributes. (2) Data anonymity approaches [16, 17], which map the values of a privacy attribute to a more generalized space to protect privacy information. (3) Data encryption approaches [1820], which protect privacy attributes by encrypting the privacy information or dealing with them in the ciphertext space.

The above-mentioned privacy-preserving methods are designed for the preservation of traditional privacy attributes. However, there exists another special and imperceptible kind of privacy that could lead to privacy disclosure. It is strongly related to privacy attributes. The study for this privacy starts with a set of examples.

The department store Target4 often uses customers’ shopping history to infer their pregnancy status and sells baby products according to this information. The pregnancy status is an explicit privacy attribute, which represents representative personal confidential and sensitive information. On the contrary, the purchase record is not defined as an explicit privacy attribute, but it is strongly related to the explicit privacy attribute. Therefore, if an attribute or a combination of attributes can potentially infer an explicit privacy attribute, then these attributes are called implicit privacy attributes. As another example, the popularity of Internet medical treatment has prompted users to search for their diseases (the explicit privacy attribute) through search engines, so users’ browsing history is closely related to their diseases. Hence, users’ browsing history is an implicit privacy attribute that implies users’ diseases. Besides, warfarin is a drug for the prevention and treatment of thromboembolic diseases [21]. Patients with different genotypes take different doses of warfarin. There is a strong correlation between the dosage of warfarin and the patient’s genotype (the explicit privacy attribute). Therefore, the dosage of warfarin is an implicit privacy attribute which indicates the patient’s genotype information. Although traditional privacy protection methods have achieved significant performance in explicit privacy, they cannot guarantee good data utility while eliminating the strong correlation between implicit privacy and explicit privacy attributes [2224]. Furthermore, implicit privacy attributes are not only associated with explicit privacy attributes, but also with class labels. If the values are changed to a constant or deleted directly, it may lead to low data utility [25]. Therefore, for implicit privacy protection, we need to eliminate the association between implicit and explicit privacy attributes, and at the same time, ensure the association between implicit privacy attributes and class labels of data as much as possible.

Recently, Generative Adversarial Network (GAN) and its improved versions have been proved powerful in a wide range of applications [2629], including privacy protection. For privacy protection, those GAN-based approaches are mainly proposed for explicit privacy. They can be roughly classified into two categories: (1) Gradient perturbation approaches that add noise into the gradient descent process for model optimization. For example, DPGAN [30] adds noise to the discriminator’s gradient in the training procedure to achieve differential privacy guarantees. (2) Output perturbation approaches that add noise into the output of the model. PATE-GAN [31] applies the Private Aggregation of Teacher Ensembles (PATE) framework to GAN and bounds the impact of any single sample on the model to provide rigorous differential privacy guarantees.

Although proved useful, existing studies only consider explicit privacy protection. This paper proposes an ex-ante implicit privacy-preserving framework based on data generation, called IMPOSTER, which is composed of two modules. In particular, the implicit privacy detection module uses normalized mutual information to detect the attributes that are strongly related to explicit privacy attributes. The implicit privacy protection module is constructed by adding a discriminator to GAN equipped with a Variational AutoEncoder (VAE). As a matter of fact, it contains one encoder E, one generator (or decoder) G, and two discriminators D1 and D2. As in the standard GAN, G is trained to generate artificial data that simulate the distribution of real samples Pr. While D1 is trained to judge whether the generated data are close to the real data, D2 is to eliminate the correlation between explicit privacy attributes and implicit ones. The E and G together constitute VAE, which is trained to minimize the reconstruction errors between real samples and fake samples generated by G. When the model converges, the generator produces high-usability data without implicit privacy. It is worth noting that attribute inference attacks [21] require attackers to infer explicit privacy attributes by accessing trained models and non-privacy attributes. This paper focuses on inferring explicit privacy attributes according to the strong correlation between explicit privacy attributes and other attributes, that is, attackers can infer explicit privacy attributes from other attributes with a certain probability, which indirectly causes the privacy information’s disclosure. The implicit privacy disclosure problem requires that attackers do not need to access the trained model.

We give a summary of main contributions as follows:

  • We address a special and imperceptible class of privacy disclosure problem, called implicit privacy, and give a definition of implicit privacy attributes.

  • We propose an ex-ante implicit privacy-preserving framework based on data generation, called IMPOSTER, which contains an implicit privacy detection module and a protection module. Specifically, the former uses normalized mutual information to detect the attributes that are strongly related to privacy attributes. Based on the idea of data generation, the latter equips the GAN framework with an additional discriminator, which is used to eliminate the correlation between explicit privacy attributes and implicit privacy attributes.

  • We illustrate the superiority and effectiveness of IMPOSTER by theoretical derivation and experimental results on a public dataset.

The other sections of this work are arranged as follows: Section 2 reviews the related research. Section 3 introduces the preliminaries of the IMPOSTER framework. In Section 4, we detail our proposed framework for implicit privacy preservation based on data generation. In Section 5, we conduct detailed experiments on a public dataset and illustrate the effectiveness of our framework. Finally, in Section 6, we give the conclusion and future work.

2. Related work

We summarize the previous work related to traditional privacy protection approaches.

2.1. Traditional privacy-preserving methods

According to different methods of data processing, the existing privacy-preserving approaches mainly include: data distortion approaches, data encryption approaches, and data anonymity approaches.

Data distortion approaches. Data distortion approaches mainly include: randomization, condensation and differential privacy. Randomization works by injecting random noise into raw data and then publishes disturbed data. Warner et al. [32] proposed the randomized response (RR) mechanism that provides plausible deniability for individuals with sensitive information. In order to reconstruct the data distribution without disclosing individual privacy, Charu et al. [33] proposed a condensation approach that transforms the original dataset into a new anonymized one, which includes the correlation between different dimensions. Considering the background knowledge attack and differential attack that steal individual privacy information, Dwork et al. [34] presented the differential privacy, which provides mathematically provable guarantees for privacy preservation.

Data anonymity approaches. Data anonymity approaches mainly include: k-anonymity, l-diversity and t-closeness. Specifically, Sweeney et al. [35] proposed the k-anonymity algorithm, which realizes that any sample cannot be differentiated from other k − 1 samples in the same equivalence class, so as to alleviate privacy leakage caused by linking attacks. Since the k-anonymity does not impose any constraints on sensitive attribute columns, attackers can use homogeneity attacks and background knowledge attacks to discover users’ corresponding sensitive data, resulting in privacy disclosure. To overcome the shortcomings of the k-anonymity, Machanavajjhala et al. [16] proposed the l-diversity algorithm to guarantee that sensitive attributes have at least l different values in the same equivalence class. In order to defend against similarity attacks, Li et al. [17] presented a novel privacy-preserving method named t-closeness, which guarantees that the difference between the distribution of sensitive attribute values in each equivalence class and that in the original dataset shall not exceed a threshold t.

Data encryption approaches. Data encryption approaches are to conceal sensitive data through data encryption technologies. The representative methods include: secure multi-party computing (SMC), homomorphic encryption (HE), and federated learning (FELE). SMC [36] refers to the scenario that multiple participants holding their own private data jointly execute a calculation logic (such as the maximum calculation) and obtain the calculation results in the absence of a reputable third party. Therefore, each participant will not disclose the calculation of their own data. HE [37, 38] takes an encryption algorithm satisfying the properties of ciphertext homomorphic operations. When data are homomorphically encrypted, the algorithm performs particular calculations on the ciphertext, and the results obtained are in the equivalent homomorphism. To perform the same above calculations directly on plaintext data is equal to the decrypted plaintext. FELE [18] is a distributed machine learning technology that breaks data islands. By exchanging encrypted intermediate results, it provides participants with the ability to joint data modeling without privacy disclosure.

2.2. Summary

Traditional privacy-preserving methods have been proved useful in explicit privacy. However, they cannot guarantee good data utility while eliminating the strong correlation between explicit privacy and other attributes. As a matter of fact, traditional privacy protection methods would cause low data utility if used for implicit privacy directly. Therefore, further efforts should be made to protect implicit privacy.

3. Preliminaries

3.1. Randomized response

The Randomized Response (RR) mechanism was first proposed by Warner [32] to provide plausible deniability for individuals responding to sensitive information. For example, consider a questionnaire: “Do you smoke?” For this question, the RR allows respondents to flip an unbiased coin secretly, and respondents tell truth if it comes up heads, otherwise, flip the coin again and answer “Yes” or “No” according to the result of second toss, that is, answer “Yes" if it comes up heads, otherwise, answer “No”. This paper uses the k-ary Randomized Response (kRR) mechanism [39] as a benchmark to compare with our proposed framework. Specifically, considering there are n individual records item1, item2, …, itemn in a dataset D, each record itemi has some attribute value si ∈ 𝕊 regarding an attribute S. 𝕊 denotes the value space of an attribute S. ℝ denotes an output alphabet of sanitized S (𝕊 = ℝ and |𝕊|=k). We map si stochastically to ri ∈ ℝ by Eq (1), i.e., some attribute value si remains unchanged with probability , and flips to other values in the same attribute value space with probability . The specific implementation process of the kRR mechanism is shown in Figure 1.

(1)

thumbnail Figure 1.

An illustration of the k-ary Randomized Response (kRR) mechanism

where ε denotes the privacy budget. In general, the smaller the ε, the higher the level of privacy protection, the lower the data utility.

3.2. GAN and CGAN

As a representative generative model, Generative Adversarial Network (GAN) [40] has the following superiorities: (1) It is not dependent on prior assumptions; (2) It generates synthetic samples similar to the distribution of real samples. GAN produces high-quality output through the mutual game learning of a discriminator D and a generator G. Specifically, G is trained to learn the distribution of real samples from noise distribution, while D distinguishes synthetic data produced by G from real data distribution. In general, the objective function of D can be expressed as follows:

(2)

where Pr represents the distribution of real samples, D(x) denotes the probability that x obeys Pr rather than the distribution of generated samples Pg, Pz(z) represents a prior distribution of noise variable z, and G(z) represents that G produces synthetic samples from a prior distribution Pz. The objective function of G can be expressed as follows:

(3)

Therefore, G and D play the minimax game with a value function V(G, D), which is given by:

(4)

Figure 2 illustrates the structure of GAN. Theoretical analysis shows that GAN aims to minimize the distance between Pg and Pr, and V(G, D) has a global optimal value for Pr = Pg [40].

thumbnail Figure 2.

An illustration of GAN

As an improvement of the traditional GAN, Conditional Generative Adversarial Network (CGAN) [41] takes auxiliary information ζ as a condition to guide G and D to realize a conditional generative model. The objective function of CGAN is given by:

(5)

Notably, CGAN combines the noise distribution Pz(z) and ζ into a joint latent representation to input into G. Similarly, x and ζ are also input into D.

4. Overview of our approach

4.1. Problem statement

To clarify explicit privacy and implicit privacy, we take the case of the department store Target as an example.

Consider a scenario in which retailers advertise products to customers. Figure 3a shows the use of explicit privacy. Retailers get female customers’ medical records illegally, and then decide to recommend product advertisements according to customers’ pregnancy status (an explicit privacy attribute). Such explicit privacy disclosure will threaten the life and property of users, such as bullying and credit card fraud. For the disclosure of explicit privacy attributes, we can utilize the current representative and excellent differential privacy (DP) to prevent it. In addition to the explicit privacy mentioned above, there is a special and imperceptible class of non-privacy attribute, called implicit privacy, which is strongly associated with privacy attributes. As shown in Figure 3b, retailers do not directly use the pregnancy status in the medical records, but use the customer’s recent purchase records. If the customer often purchases pregnancy-related products P1 and P2 recently, retailers can conclude that the customer or one of her family is pregnant, and then recommend pregnancy-related product advertisements to her. If the customer often purchases products O1 and O2, retailers will recommend other types of product advertisements to the customer. Here, the purchase record is an implicit privacy attribute, through which we can accurately infer the customer’s pregnancy status. In contrast to explicit privacy, implicit privacy is not defined as a privacy attribute, but it strongly correlates with privacy attributes. Attackers can use it to infer explicit privacy indirectly, resulting in a series of privacy disclosure problems. Based on the above case, we define explicit privacy attributes and implicit privacy attributes as follows:

thumbnail Figure 3.

A scenario in which retailers recommend product advertisements to customers.

Note that explicit privacy attributes refer to traditional privacy attributes in this paper.

Generally speaking, the stronger the correlation between xp and s, the stronger the predictive ability from xp to s. For example, in the feature engineering, we also select the features that are strongly related to the class label [4749]. In a word, our ultimate goal is to eliminate the correlation between explicit and implicit privacy attributes while preserving good data utility.

4.2. Our framework IMPOSTER

As shown in Figure 4, our framework IMPOSTER consists of two modules: (1) the implicit privacy detection module, and (2) the implicit privacy protection module.

thumbnail Figure 4.

The framework of IMPOSTER

4.2.1. Implicit privacy detection module

The explicit privacy attribute can be inferred from other attributes in the dataset, which will also result in the users’ privacy disclosure. As shown in Figure 5, attribute x3 can infer an explicit privacy attribute s with a certain probability, so attribute x3 is the implicit privacy attribute for s. Therefore, it is necessary to determine the implicit privacy attributes for s in advance and protect them. Commonly used metrics to measure the correlation between two random variables are Pearson correlation coefficient, mutual information (MI), and normalized mutual information (NMI). Pearson’s correlation coefficient mainly measures the degree of linear correlation between two random variables. Both MI and NMI can measure the degree of linear correlation or nonlinear correlation between two random variables. Further, NMI is a normalization of the MI score to scale the results between 0 (statistical independence) and 1 (perfect correlation), which reduces the adverse effects of abnormal sample data. Therefore, we use NMI to measure the correlation between the explicit privacy attribute and other attributes. The formula of NMI is as follows:

(7)

thumbnail Figure 5.

An illustration of implicit privacy detection

where Hs denotes Shannon entropy. We calculate the relevance between the explicit privacy attribute s and other attributes xi, and get the attribute set that is strongly related to s. According to Definition 2, we use a classification algorithm f to measure the prediction ability from the attribute set to s, and finally get the implicit privacy attribute set.

4.2.2. Implicit privacy protection module

Based on the idea of data generation, the implicit privacy protection module adds a discriminator into the GAN framework equipped with a VAE model to eliminate the association between implicit privacy attributes and explicit privacy attributes. Although GAN is trained to learn a distribution of synthetic samples similar to the distribution of real samples, it is not good at capturing the element-wise errors between synthetic samples and real samples. In order to alleviate this limitation, our model incorporates a VAE [50] to minimize the reconstruction errors between real samples and synthetic samples. VAE includes an encoder E that compresses the original input xp to a latent representation zl ∼ E(xp)=p(zl|xp) and a decoder G that decompresses zl to reconstructed output . The total loss function of the VAE includes reconstruction errors and a regularization term, which can be expressed as:

(8)

where KL(*) refers to the Kullback–Leibler divergence.

After the data pass through the VAE, in order to eliminate the correlation between implicit and explicit privacy attributes, we adopt an improved GAN, which consists of one generator (the decoder of VAE) G and two discriminators D1 and D2. The G generates from a prior noise distribution Pz to match Pr.

(9)

D1 is a classifier that distinguishes a real sample from a generated faked sample . Here, s is a binary attribute. Therefore, the discriminator D2 is also a binary classifier, which is an important part of eliminating the correlation between implicit and explicit privacy attributes. The specific game process of G, D1 and D2 will be given in detail later.

Therefore, our improved CGAN sub-module in the implicit privacy protection module is formalized as a minimax game and the value function is given by:

(10)

where

(11)

(12)

The hyperparameter λ is a trade-off coefficient, which is used to balance data utility and privacy level of generated data.

Similar to the traditional CGAN, the value function V1 indicates that G and D1 play a zero-sum game. Specifically, D1 learns to accurately distinguish between generated samples and real samples, while G learns to generate fake samples similar to real data to fool D1. In order to make the generated samples contain the information that supports predicting the value of the explicit privacy attributes as little as possible, the second value function V2 shows that D2 and G also play a zero-sum game. Specifically, D2 learns to accurately predict the value of s, while G learns to fool D2.

The total objective function of the implicit privacy protection module can be formalized as follows:

(13)

When the reconstruction errors between the real samples and the generated samples in VAE are within an acceptable range, once the implicit privacy protection module converges, the synthetic samples generated by G approximately obey the distribution of real samples, and the correlation between explicit and implicit privacy attributes is eliminated as much as possible.

4.3. Algorithm

Algorithm 1 displays the pseudo code of the implicit privacy detection module. Algorithm 2 displays the pseudo code of the implicit privacy protection module.

Firstly, we sample a minibatch of samples from the output of E and a minibatch of noise samples from Pg to train D1 and G (from Line 2 to 7). Secondly, we sample a minibatch of samples , and sample another batch of samples to train D2 and G (from Line 8 to 10). Finally, when model converges, we can get G to generate data without implicit privacy.

4.4. Theoretical analysis

Different from the traditional GAN, our proposed implicit privacy protection module in IMPOSTER adds an additional discriminator to protect implicit privacy. In addition, we introduce VAE to capture the element-wise errors between synthetic samples and real samples, so as to make the distance between them as close as possible. Therefore, we give a theoretical analysis of the convergence of the implicit privacy protection module when the reconstruction errors between real samples and generated samples in VAE are within an acceptable range.

Note that the training objective for D1 is to estimate whether xp comes from Pr or Pg. D2 is used to eliminate the association between the explicit privacy attribute s and implicit privacy attributes xp. Given a fixed encoder E, the optimal discriminators and , Eq (13) can be changed as follows:

(16)

The detailed derivation process is described in the Appendix of the Supporting Information.

The objective function of VAE includes cross entropy and Kullback–Leibler divergence. For the Eq (16), since Jensen–Shannon divergence, Kullback–Leibler divergence and cross entropy are convex functions [51], C(G) can converge to a global minimum. Therefore, as for C(G), we give a theorem as follows:

5. Experimental evaluation

We evaluate the effectiveness of our proposed framework IMPOSTER from the following aspects: (1) whether the generated synthetic data eliminate the correlation between explicit privacy attributes and implicit privacy attributes; (2) whether the generated synthetic data preserve good data utility; (3) parameter sensitivity analysis.

5.1. Dataset

We evaluate our proposed privacy-preserving framework IMPOSTER on a real-world dataset. We present some details and statistics of the dataset as follows:

UCI adult dataset5. The dataset contains 48 842 instances. Each instance contains 7 numerical and 7 categorical variables, and the class label represents whether the annual income exceeds $50k.

5.2. Evaluation metrics

This paper adopts several metrics to verify the performance of our framework. These metrics are listed as follows.

Accuracy. Accuracy measures the proportion of the number of rightly predicted samples to the entire number of predicted samples and can be expressed as follows:

(17)

where T P refers to true positive, F P refers to false positive, T N refers to true negative, and F N refers to false negative.

F1-score. Both precision and recall are important performance metrics in classification problems. However, they are a pair of opposite metrics. In general, the higher the recall, the lower the precision. In order to comprehensively consider these two indicators, we adopt F1-score to measure the prediction performance of classifiers. The F1-score denotes a weighted average of the recall and precision. It can be given by:

(18)

where and .

5.3. Experimental setting

For the purpose of exploring the effectiveness of our framework, we adopt several state-of-the-art and representative classifiers: XGBoost [52], gradient boosting decision tree (GBDT) [53, 54], random forest (RF) [55], multi-layer perceptron classifier (MLP) [56] and logistic regression (LR) [57]. We train several classifiers on two different settings.

  • Setting A: classifiers are trained on real samples and tested on real samples.

  • Setting B: classifiers are trained on generated samples and tested on generated samples.

We are mainly concerned with two comparisons. On the one hand, compared with setting A, if the classifiers trained on synthetic data have poor prediction performance for the explicit privacy attribute on synthetic data (setting B), our framework IMPOSTER is able to eliminate the correlation between implicit and explicit privacy attributes. On the other hand, compared with setting A, if the classifiers trained on synthetic data have good prediction performance for the class label on synthetic data (setting B), the generated synthetic data can capture the corresponding relationship between attributes and labels, and the association between attributes, that is, data utility.

5.4. Correlation elimination

In this subsection, we conduct elaborate experiments to illustrate whether our framework IMPOSTER is able to eliminate the correlation between implicit and explicit privacy attributes. Specifically, we first use the implicit privacy detection module to explore the correlation between the explicit privacy attribute and other attributes in the original dataset. Figure 6 illustrates the normalized mutual information between corresponding attributes in Adult. Here, we treat “gender” as an explicit privacy attribute and set θ = 0.01. We then adopt implicit privacy attributes to construct classifiers to infer the attribute “gender” on setting A and B. From Table 1, we can observe that, compared with setting A, all the classifiers trained on synthetic data generated by IMPOSTER have a significantly decreased performance in predicting the explicit privacy attribute (setting B). Therefore, our framework IMPOSTER is able to eliminate the correlation between implicit and explicit privacy attributes.

thumbnail Figure 6.

The normalized mutual information between corresponding attributes in Adult

Table 1.

Accuracy and F1-score of predicting “gender” (the explicit privacy attribute) and “income” (the class label) in classifiers on setting A and B

5.5. Data utility

In order to evaluate whether our framework can guarantee the data utility, we use the synthetic dataset generated by IMPOSTER to predict the class label “income” on setting A and B. From Table 1, we can see that, compared with setting A, even though the performance of all classifiers trained on setting B in predicting the class label decreases slightly, the prediction accuracy of training on data generated by IMPOSTER is higher than 81%. This result reflects that synthetic samples generated by IMPOSTER have captured the relationship between attributes and class labels well. Therefore, our framework can guarantee data utility while protecting implicit privacy.

5.6. Comparative experiment

To verify the superiority of IMPOSTER, we use the traditional privacy protection method kRR mechanism, a state-of-the-art and representative algorithm in differential privacy, to protect implicit privacy. Specifically, given a privacy budget ε, we keep the value of each implicit privacy attribute unchanged with high probability and flip it to another value in the same attribute value space with low probability. Then, we utilize the data disturbed by kRR to train the GBDT classifier to predict the class label and the explicit privacy attribute, respectively.

Figure 7 indicates the prediction performance of data perturbed by the comparative method kRR mechanism on the class label and the explicit privacy attribute. From Figure 7, we can observe that, with the privacy budget ε changing from 1 to 100, the prediction performance of disturbed implicit privacy attributes for the class label and the explicit privacy attribute increases together. By comparing the data of Table 1 and Figure 7, when the accuracy of the predicted explicit privacy attribute reaches 68.91% on the data disturbed by kRR, the accuracy of the predicted class label is only 75.17%. However, the accuracy of the predicted explicit privacy attribute and class label is 67.26% and 82.29%, respectively on the data generated by IMPOSTER. We can conclude that, the highest prediction performance for “gender” and “income” does not exceed the prediction performance on the original data, either on data disturbed by the kRR mechanism or on the data generated by IMPOSTER. However, compared with the kRR mechanism, our proposed framework can maintain better data utility while providing the same privacy-preserving level.

thumbnail Figure 7.

The prediction performance of data perturbed by the kRR mechanism on the class label and the explicit privacy attribute, respectively. The x-axis denotes different values of privacy budget ε, and the y-axis denotes F1-score and accuracy

5.7. Parameter sensitivity analysis

In this part, we evaluate the sensitivity of parameters θ and λ, respectively. Except for the parameters explored, all the other parameters take default values. As a correlation threshold, the larger θ indicates that each attribute selected has a higher correlation with the explicit privacy attribute. We first get the candidate set of implicit privacy attributes by varying θ and then train an XGBoost classifier to evaluate the accuracy and F1-score of the predicted “gender” (the explicit privacy attribute) and “income” (the class label) on settings A and B, as shown in Table 2. We can observe that with the decrease in θ, the accuracy and F1-score of the predicted “gender” and “income” tend to increase on setting A, i.e., smaller values of θ offer higher accuracy and F1-score. On setting B, however, with the decrease in θ, the accuracy and F1-score of the predicted “gender” are lower than 67% and 46%, respectively, while the accuracy and F1-score of the predicted “income” are slightly lower than those on setting A under the same θ. Therefore, our framework IMPOSTER can maintain good data utility while protecting implicit privacy effectively.

Table 2.

Accuracy and F1-score of predicting “gender” (the explicit privacy attribute) and “income” (the class label) with different θ on setting A and B

The trade-off coefficient λ is an important parameter of IMPOSTER, which is used to balance the data utility and privacy level of synthetic data. We evaluate how parameter λ affects the synthetic datasets generated by IMPOSTER in terms of two dimensions: correlation elimination and data utility. For correlation elimination, when λ becomes larger, the framework IMPOSTER tends to generate synthetic data with a lower correlation between implicit and explicit privacy attributes. In essence, the game of G and D1 is to generate synthetic data similar to the original data, including capturing the correlation between attributes, which limits the extent to which IMPOSTER can eliminate the correlation. Therefore, when parameter λ increases to a certain degree, the accuracy and F1-score show fluctuations in a certain interval. This can be observed in Figure 8a, where we train an XGBoost classifier on the synthetic data generated by IMPOSTER to predict the explicit privacy attribute “gender”. For data utility, we adopt the synthetic data generated by IMPOSTER to construct an XGBoost classifier to predict the class label “income”. Figure 8b shows the performance curves of the predicted class label “income” with different λ. From Figure 8b, we can observe that with the increase of λ, accuracy and F1-score keep relatively steady only with a slight fluctuation. Observations from Figures 8a and b illustrate that IMPOSTER can get rid of the correlation between implicit and explicit privacy attributes while preserving data utility. Note that when λ is around 1, our framework IMPOSTER achieves the best performance, and when λ = 0, it degenerates to CGAN equipped with VAE, which cannot be used to remove data correlation between implicit and explicit privacy attributes.

thumbnail Figure 8.

The parameter sensitivity of IMPOSTER with different λ, where the x-axis denotes different values of trade-off coefficient λ, whereas the y-axis denotes F1-score and accuracy. (a) Correlation elimination (performance of the predicted explicit privacy attribute “gender”) in synthetic datasets from IMPOSTER with different λ. (b) Data utility (performance of the predicted class label “income”) in synthetic datasets from IMPOSTER with different λ

6. Conclusion and future work

This paper addresses a special and imperceptible class of privacy, called implicit privacy, and then proposes an ex-ante implicit privacy-preserving framework based on data generation, called IMPOSTER, which consists of implicit privacy detection and protection modules. Specifically, the former uses normalized mutual information to detect attributes strongly related to privacy attributes. The latter equips the standard GAN framework with an additional discriminator, which is used to eliminate the association between explicit and implicit privacy attributes. Experimental results demonstrate that IMPOSTER can learn a generator producing data without implicit privacy while preserving good data utility.

In future work, on the one hand, we will adopt the Rényi entropy [58] to explore the correlation between multi-attributes and explicit privacy attributes, which is an open and interesting question. On the other hand, we will apply the proposed IMPOSTER framework to address the implicit privacy issue of time series data in the financial risk control scenario, generate data that eliminate the implicit privacy as much as possible to meet user expectations and regulatory requirements, and replace the real data with the generated data to train the financial anti-fraud model and improve the robustness of financial risk control systems.

Conflict of Interest

The authors declare that they have no conflict of interest.

Data Availability

The data supporting the finding of this study are publicly available in UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/adult.

Authors’ Contributions

Qing Yang contributed to the theoretical development, experimental design and manuscript drafting. Cheng Wang made a significant contribution to the theoretical development and revising it for intellectual content. Teng Hu contributed to the experimental design and revised the content of the manuscript. Xue Chen contributed to revising and polishing the content of the manuscript. Changjun Jiang made a significant contribution to the theoretical development and jointly wrote this paper.

Acknowledgments

We are very grateful to all editors and reviewers for their constructive comments. In addition, we thank Jipeng Cui for revising and polishing the content of the manuscript.

Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2018YFB2100801, in part by the National Natural Science Foundation of China (NSFC) under Grant 61972287, and in part by the Fundamental Research Funds for the Central Universities under Grant 22120210524.

Supporting Information

Appendix: (Access here)


References

  1. Han J, Pei J and Kamber M. Data Mining: Concepts and Techniques. The Netherlands: Elsevier, 2011. [Google Scholar]
  2. Jia JS, Lu X and Yuan Y et al. Population flow drives spatio-temporal distribution of COVID-19 in China. Nature 2020, 582: 389–94. [Google Scholar]
  3. Park Y and Ho JC. Tackling overfitting in boosting for noisy healthcare data. IEEE Trans Knowl Data Eng 2021; 33: 2995–3006. [Google Scholar]
  4. Wang W, Lesner C and Ran A et al. Using small business banking data for explainable credit risk scoring. Proc AAAI Conf Artif Intell 2020; 34: 13396–401. [Google Scholar]
  5. Liu Y, Ao X and Zhong Q et al. Alike and unlike: Resolving class imbalance problem in financial credit risk assessment. In: Proc. 29th ACM Int. Conf. Inf. Knowl. Manag. Virtual Event, Ireland, Oct. 19-23, 2020, 2125–8. [Google Scholar]
  6. De Montjoye YA, Radaelli L and Singh VK et al. Unique in the shopping mall: On the reidentifiability of credit card metadata. Science 2015; 347: 536–9. [Google Scholar]
  7. Zhang L, Shen J and Zhang J et al. Multimodal marketing intent analysis for effective targeted advertising. IEEE Trans Multim 2022; 24: 1830–43. [Google Scholar]
  8. Deng A and Hooi B. Graph neural network-based anomaly detection in multivariate time series. Proc AAAI Conf Artif Intell 2021; 35: 4027–35. [Google Scholar]
  9. Hu W, Gao J and Li B et al. Anomaly detection using local kernel density estimation and context-based regression. IEEE Trans Knowl Data Eng 2018; 32: 218–33. [Google Scholar]
  10. Wu FJ and Luo T. Crowdprivacy: Publish more useful data with less privacy exposure in crowdsourced location-based services. ACM Trans Priv Secur 2020; 23: 6:1–25. [Google Scholar]
  11. Holt JD and Chung SM. Efficient mining of association rules in text databases. In: Proc. 1999 ACM CIKM Int. Conf. Inf. Knowl. Manag., Kansas City, Missouri, USA, Nov. 2-6, 1999, 234–42. [Google Scholar]
  12. Sankar L, Rajagopalan SR and Poor HV. Utility-privacy tradeoffs in databases: An information-theoretic approach. IEEE Trans Inf Forensics Secur 2013, 8: 838–52. [Google Scholar]
  13. Narayanan A and Shmatikov V. Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symp. Secur. Priv. (SP), Oakland, CA, USA, May 18-22, 2008, 111–25. [CrossRef] [Google Scholar]
  14. Li S, Ji X and You W. A personalized differential privacy protection method for repeated queries. In: 2019 IEEE 4th Int. Conf. Big Data Anal. (ICBDA), Suzhou, China, Mar. 15-18, 2019, 274–280. [CrossRef] [Google Scholar]
  15. Dwork C and Roth A. The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci 2014; 9: 211–407. [Google Scholar]
  16. Machanavajjhala A, Kifer D and Gehrke J et al. L-diversity: Privacy beyond k-anonymity. ACM Trans Knowl Discov Data 2007; 1: 3. [Google Scholar]
  17. Li N, Li T and Venkatasubramanian S. t-closeness: Privacy beyond k-anonymity and l-diversity. In: Proc. 23rd Int. Conf. Data Eng., The Marmara Hotel, Istanbul, Turkey, Apr. 15-20, 2007, 106–15. [Google Scholar]
  18. Yang Q, Liu Y and Chen T et al. Federated machine learning: Concept and applications. ACM Trans Intell Syst Technol 2019; 10: 12:1–19. [Google Scholar]
  19. Mohassel P and Zhang Y. Secureml: A system for scalable privacy-preserving machine learning. In: 2017 IEEE Symp. Secur. Priv. (SP), San Jose, CA, USA, May 22-26, 2017, 19–38. [CrossRef] [Google Scholar]
  20. Chen H, Dai W and Kim M et al. Efficient multi-key homomorphic encryption with packed ciphertexts with application to oblivious neural network inference. In Proc. 2019 ACM SIGSAC Conf. Comput. Commun. Secur., CCS 2019, London, UK, Nov. 11-15, 2019, 395–412. [Google Scholar]
  21. Fredrikson M, Lantz E and Jha S et al. Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. In: Proc. 23rd USENIX Secur. Symp., San Diego, CA, USA, Aug. 20-22, 2014, 17–32. [Google Scholar]
  22. Krause A and Horvitz E. A utility-theoretic approach to privacy and personalization. In: Proc. Twenty-Third AAAI Conf. Artif. Intell., Chicago, Illinois, USA, July 13-17, 2008, Vol. 8, 1181–8. [Google Scholar]
  23. Gross R, Airoldi E and Malin B et al. Integrating Utility into Face De-identification. Berlin, Heidelberg: Springer, 2006. [Google Scholar]
  24. Yang Q, Wang C and Wang C et al. Fundamental limits of data utility: A case study for data-driven identity authentication. IEEE Trans Comput Soc Syst 2020; 8: 398–409. [Google Scholar]
  25. Datta A, Fredrikson M and Ko G et al. Use privacy in data-driven systems: Theory and experiments with machine learnt programs. In: Proc. 2017 ACM SIGSAC Conf. Comput. Commun. Secur., CCS 2017, Dallas, TX, USA, Oct. 30-Nov. 03, 2017, 1193–210. [Google Scholar]
  26. Tseng BW and Wu PY. Compressive privacy generative adversarial network. IEEE Trans Inf Forensics Secur 2020; 15: 2499–513. [Google Scholar]
  27. Kim H, Park J and Min K et al. Anomaly monitoring framework in lane detection with a generative adversarial network. IEEE Trans Intell Transp Syst 2020; 22: 1603–15. [Google Scholar]
  28. Ruffino C, Hérault R and Laloy E et al. Pixel-wise conditioned generative adversarial networks for image synthesis and completion. Neurocomputing 2020; 416: 218–30. [Google Scholar]
  29. Zhang K, Zhong G and Dong J et al. Stock market prediction based on generative adversarial network. Proc Comput Sci 2019; 147: 400–6. [Google Scholar]
  30. Xie L, Lin K and Wang S et al. Differentially private generative adversarial network, arXiv preprint arXiv:1802.06739, 2018. [Google Scholar]
  31. Jordon J, Yoon J and van der Schaar M. PATE-GAN: Generating synthetic data with differential privacy guarantees. In: 7th Int. Conf. Learn. Represent., ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. [Google Scholar]
  32. Warner SL. Randomized response: A survey technique for eliminating evasive answer bias. J Am Stat Assoc 1965; 60: 63–9. [Google Scholar]
  33. Aggarwal CC and Philip SY. A Condensation Approach to Privacy Preserving Data Mining. Berlin, Heidelberg: Springer, 2004. [Google Scholar]
  34. Dwork C. Differential privacy. In: Autom. Lang. Program. 33rd Int. Colloq. ICALP 2006, Venice, Italy, Jul. 10-14, 2006, Proc. Part II, 2006, 1–12. [Google Scholar]
  35. Sweeney L. k-anonymity: A model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 2002; 10: 557–70. [Google Scholar]
  36. Zhao C, Zhao S and Zhao M et al. Secure multi-party computation: Theory, practice and applications. Inf Sci 2019; 476: 357–72. [Google Scholar]
  37. Elgamal T. A public key cryptosystem and a signature scheme based on discrete logarithms. IEEE Trans Inf Theor 1985; 31: 469–72. [Google Scholar]
  38. Ullah S, Li XY and Hussain MT et al. Kernel homomorphic encryption protocol. J Inf Secur Appl 2049; 48: 102366. [Google Scholar]
  39. Kairouz P, Oh S and Viswanath P. Extremal mechanisms for local differential privacy. J Mach Learn Res 2016; 17: 492–542. [Google Scholar]
  40. Goodfellow I, Pouget-Abadie J and Mirza M et al. Generative adversarial nets. In: Adv. Neural Inf. Process. Syst. 27: Annu. Conf. Neural Inf. Process. Syst. 2014, Montreal, Quebec, Canada, Dec. 8-13, 2014, 2672–2680. [Google Scholar]
  41. Mirza M and Osindero S. Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784, 2014. [Google Scholar]
  42. Torra V. Data Privacy: Foundations, New Developments and the Big Data Challenge. Heidelberg: Springer, 2017. [CrossRef] [Google Scholar]
  43. Xu L, Jiang C and Wang J et al. Information security in big data: Privacy and data mining. IEEE Access 2014; 2: 1149–76. [Google Scholar]
  44. Estévez PA, Tesmer M and Perez CA et al. Normalized mutual information feature selection. IEEE Trans Neural Netw 2009; 20: 189–201. [Google Scholar]
  45. Jaynes ET. Information theory and statistical mechanics. Phys Rev 1957; 106: 620. [CrossRef] [Google Scholar]
  46. Benesty J, Chen J and Huang Y et al. Pearson correlation coefficient. Berlin, Heidelberg: Springer, 2009. [Google Scholar]
  47. Nargesian F, Samulowitz H and Khurana U et al. Learning feature engineering for classification, In: Proc. Twenty-Sixth Int. Jt. Conf. Artif. Intell., IJCAI 2017, Melbourne, Australia, Aug. 19-25, 2017, 2529–35. [Google Scholar]
  48. Datta A, Sen S and Zick Y. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In: 2016 IEEE Symp. Secur. Priv. (SP), San Jose, CA, USA, May 22-26, 2016, 598–617. [CrossRef] [Google Scholar]
  49. Hild II KE, Erdogmus D and Torkkola K et al. Feature extraction using information-theoretic learning. IEEE Trans Pattern Anal Mach Intell 2006; 28: 1385–92. [Google Scholar]
  50. Kingma DP and Welling M. Auto-encoding variational bayes. In: Bengio Y and LeCun Y, editors, 2nd Int. Conf. Learn. Represent., ICLR 2014, Ban, AB, Canada, Apr. 14-16, 2014, Conf. Track Proc., 2014. [Google Scholar]
  51. Menéndez ML, Pardo JA and Pardo L et al. The jensen-shannon divergence. J Frankl Inst 1997; 334: 307–18. [Google Scholar]
  52. Chen T and Guestrin C. XGBoost: A scalable tree boosting system. In: Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, San Francisco, CA, USA, Aug. 13-17, 2016, 785–94. [CrossRef] [Google Scholar]
  53. Friedman JH. Greedy function approximation: A gradient boosting machine. Annal Stat 2001; 29: 1189–232. [Google Scholar]
  54. Li Q, Wen Z and He B. Practical federated gradient boosting decision trees. Proc AAAI Conf Artif Intell 2020; 34: 4642–9. [Google Scholar]
  55. Breiman L. Random forests. Mach Learn 2001; 45: 5–32. [Google Scholar]
  56. Anthony M and Bartlett PL. Neural Network Learning: Theoretical Foundations. Cambridge: Cambridge University Press, 2009. [Google Scholar]
  57. Pan X and Xu Y. A safe feature elimination rule for L1-regularized logistic regression. IEEE Trans Pattern Anal Mach Intell 2021. [PubMed] [Google Scholar]
  58. Fehr S and Berens S. On the conditional Rényi entropy. IEEE Trans Inf Theor 2014; 60: 6801–10. [Google Scholar]
Qing Yang

Qing Yang is now a Ph.D. student at the Department of Computer Science and Technology, Tongji University in Shanghai, China. His research interests include data mining, data privacy and identity authentication.

Cheng Wang

Cheng Wang received his Ph.D. degree from the Department of Computer Science and Technology, Tongji University in 2011. He is currently a professor at the Department of Computer Science and Technology, Tongji University. His research interests include cyberspace security and intelligent information service.

Teng Hu

Teng Hu is now a Ph.D. student at the Department of Computer Science and Technology, Tongji University in Shanghai, China. His research interests include data mining, machine learning and fraud detection.

Xue Chen

Xue Chen is now a Ph.D. student at the Department of Computer Science and Technology, Tongji University in Shanghai, China. Her research interests include data privacy and machine learning.

Changjun Jiang

Changjun Jiang received his Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 1995. He is currently the Leader of the Key Laboratory of the Ministry of Education for Embedded System and Service Computing, Tongji University, Shanghai, China. He is an academician of the Chinese Academy of Engineering, an IET Fellow and an Honorary Professor with Brunel University London, Uxbridge, England. He has been the recipient of one international prize and seven prizes in the field of science and technology.

Supporting Information

Appendix: (Access here)

All Tables

Table 1.

Accuracy and F1-score of predicting “gender” (the explicit privacy attribute) and “income” (the class label) in classifiers on setting A and B

Table 2.

Accuracy and F1-score of predicting “gender” (the explicit privacy attribute) and “income” (the class label) with different θ on setting A and B

All Figures

thumbnail Figure 1.

An illustration of the k-ary Randomized Response (kRR) mechanism

In the text
thumbnail Figure 2.

An illustration of GAN

In the text
thumbnail Figure 3.

A scenario in which retailers recommend product advertisements to customers.

In the text
thumbnail Figure 4.

The framework of IMPOSTER

In the text
thumbnail Figure 5.

An illustration of implicit privacy detection

In the text
thumbnail Figure 6.

The normalized mutual information between corresponding attributes in Adult

In the text
thumbnail Figure 7.

The prediction performance of data perturbed by the kRR mechanism on the class label and the explicit privacy attribute, respectively. The x-axis denotes different values of privacy budget ε, and the y-axis denotes F1-score and accuracy

In the text
thumbnail Figure 8.

The parameter sensitivity of IMPOSTER with different λ, where the x-axis denotes different values of trade-off coefficient λ, whereas the y-axis denotes F1-score and accuracy. (a) Correlation elimination (performance of the predicted explicit privacy attribute “gender”) in synthetic datasets from IMPOSTER with different λ. (b) Data utility (performance of the predicted class label “income”) in synthetic datasets from IMPOSTER with different λ

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.