Implicit privacy preservation: a framework based on data generation

This paper addresses a special and imperceptible class of privacy, called implicit privacy. In contrast to traditional (explicit) privacy, implicit privacy has two essential properties: (1) It is not initially deﬁned as a privacy attribute; (2) it is strongly associated with privacy attributes. In other words, attackers could utilize it to infer privacy attributes with a certain probability, indirectly resulting in the disclosure of private information. To deal with the implicit privacy disclosure problem, we give a measurable deﬁnition of implicit privacy, and propose an ex-ante implicit privacy-preserving framework based on data generation, called IMPOSTER. The framework consists of an implicit privacy detection module and an implicit privacy protection module. The former uses normalized mutual information to detect implicit privacy attributes that are strongly related to traditional privacy attributes. Based on the idea of data generation, the latter equips the Generative Adversarial Network (GAN) framework with an additional discriminator, which is used to eliminate the association between traditional privacy attributes and implicit ones. We elaborate a theoretical analysis for the convergence of the framework. Experiments demonstrate that with the learned generator, IMPOSTER can alleviate the disclosure of implicit privacy while maintaining good data utility.


Introduction
Recent years have witnessed a rapid development of data mining techniques [1], which are found popular in a wide variety of fields, from disease prevention [2,3] and credit evaluation [4,5] to marketing analysis [6,7] and anomaly detection [8,9]. In order to provide accurate services and make effective decisions, both public and private organizations are committed to collecting and analyzing individual data [10]. For example, Walmart collected the shopping basket information of customers and found that beer was the most common commodity bought together with diapers [11]. However, the intentional or accidental disclosure of privacy information [12], such as personal credit information, online transaction history, or medical records, has raised concerns about personal privacy and even caused widespread social panic and significant economic losses. Netflix 1 has released a dataset including movie ratings from 500 000 the distribution of real samples P r . While D 1 is trained to judge whether the generated data are close to the real data, D 2 is to eliminate the correlation between explicit privacy attributes and implicit ones. The E and G together constitute VAE, which is trained to minimize the reconstruction errors between real samples and fake samples generated by G. When the model converges, the generator produces highusability data without implicit privacy. It is worth noting that attribute inference attacks [21] require attackers to infer explicit privacy attributes by accessing trained models and non-privacy attributes. This paper focuses on inferring explicit privacy attributes according to the strong correlation between explicit privacy attributes and other attributes, that is, attackers can infer explicit privacy attributes from other attributes with a certain probability, which indirectly causes the privacy information's disclosure. The implicit privacy disclosure problem requires that attackers do not need to access the trained model.
We give a summary of main contributions as follows: • We address a special and imperceptible class of privacy disclosure problem, called implicit privacy, and give a definition of implicit privacy attributes. • We propose an ex-ante implicit privacy-preserving framework based on data generation, called IMPOSTER, which contains an implicit privacy detection module and a protection module. Specifically, the former uses normalized mutual information to detect the attributes that are strongly related to privacy attributes. Based on the idea of data generation, the latter equips the GAN framework with an additional discriminator, which is used to eliminate the correlation between explicit privacy attributes and implicit privacy attributes. • We illustrate the superiority and effectiveness of IMPOSTER by theoretical derivation and experimental results on a public dataset.
The other sections of this work are arranged as follows: Section 2 reviews the related research. Section 3 introduces the preliminaries of the IMPOSTER framework. In Section 4, we detail our proposed framework for implicit privacy preservation based on data generation. In Section 5, we conduct detailed experiments on a public dataset and illustrate the effectiveness of our framework. Finally, in Section 6, we give the conclusion and future work.

Related work
We summarize the previous work related to traditional privacy protection approaches.

Traditional privacy-preserving methods
According to different methods of data processing, the existing privacy-preserving approaches mainly include: data distortion approaches, data encryption approaches, and data anonymity approaches.
Data distortion approaches. Data distortion approaches mainly include: randomization, condensation and differential privacy. Randomization works by injecting random noise into raw data and then publishes disturbed data. Warner et al. [32] proposed the randomized response (RR) mechanism that provides plausible deniability for individuals with sensitive information. In order to reconstruct the data distribution without disclosing individual privacy, Charu et al. [33] proposed a condensation approach that transforms the original dataset into a new anonymized one, which includes the correlation between different dimensions. Considering the background knowledge attack and differential attack that steal individual privacy information, Dwork et al. [34] presented the differential privacy, which provides mathematically provable guarantees for privacy preservation.
Data anonymity approaches. Data anonymity approaches mainly include: k-anonymity, l-diversity and t-closeness. Specifically, Sweeney et al. [35] proposed the k-anonymity algorithm, which realizes that any sample cannot be differentiated from other k − 1 samples in the same equivalence class, so as to alleviate privacy leakage caused by linking attacks. Since the k-anonymity does not impose any constraints on sensitive attribute columns, attackers can use homogeneity attacks and background knowledge attacks to discover users' corresponding sensitive data, resulting in privacy disclosure. To overcome the shortcomings of the k-anonymity, Machanavajjhala et al. [16] proposed the l-diversity algorithm to guarantee that sensitive attributes have at least l different values in the same equivalence class. In order to defend against similarity attacks, Li et al. [17] presented a novel privacy-preserving method named t-closeness, which guarantees that the difference between the distribution of sensitive attribute values in each equivalence class and that in the original dataset shall not exceed a threshold t.
Data encryption approaches. Data encryption approaches are to conceal sensitive data through data encryption technologies. The representative methods include: secure multi-party computing (SMC), homomorphic encryption (HE), and federated learning (FELE). SMC [36] refers to the scenario that multiple participants holding their own private data jointly execute a calculation logic (such as the maximum calculation) and obtain the calculation results in the absence of a reputable third party. Therefore, each participant will not disclose the calculation of their own data. HE [37,38] takes an encryption algorithm satisfying the properties of ciphertext homomorphic operations. When data are homomorphically encrypted, the algorithm performs particular calculations on the ciphertext, and the results obtained are in the equivalent homomorphism. To perform the same above calculations directly on plaintext data is equal to the decrypted plaintext. FELE [18] is a distributed machine learning technology that breaks data islands. By exchanging encrypted intermediate results, it provides participants with the ability to joint data modeling without privacy disclosure.

Summary
Traditional privacy-preserving methods have been proved useful in explicit privacy. However, they cannot guarantee good data utility while eliminating the strong correlation between explicit privacy and other attributes. As a matter of fact, traditional privacy protection methods would cause low data utility if used for implicit privacy directly. Therefore, further efforts should be made to protect implicit privacy.

Randomized response
The Randomized Response (RR) mechanism was first proposed by Warner [32] to provide plausible deniability for individuals responding to sensitive information. For example, consider a questionnaire: "Do you smoke?" For this question, the RR allows respondents to flip an unbiased coin secretly, and respondents tell truth if it comes up heads, otherwise, flip the coin again and answer "Yes" or "No" according to the result of second toss, that is, answer "Yes" if it comes up heads, otherwise, answer "No". This paper uses the k-ary Randomized Response (kRR) mechanism [39] as a benchmark to compare with our proposed framework. Specifically, considering there are n individual records item 1 , item 2 , . . . , item n in a dataset D, each record item i has some attribute value s i ∈ S regarding an attribute S. S denotes the value space of an attribute S. R denotes an output alphabet of sanitized S (S = R and |S| = k). We map s i stochastically to r i ∈ R by Eq (1), i.e., some attribute value s i remains unchanged with probability e ε |S|−1+e ε , and flips to other values in the same attribute value space with probability 1 |S|−1+e ε . The specific implementation process of the kRR mechanism is shown in Figure 1.
where ε denotes the privacy budget. In general, the smaller the ε, the higher the level of privacy protection, the lower the data utility.

GAN and CGAN
As a representative generative model, Generative Adversarial Network (GAN) [40] has the following superiorities: (1) It is not dependent on prior assumptions; (2) It generates synthetic samples similar to the distribution of real samples. GAN produces high-quality output through the mutual game learning of a discriminator D and a generator G. Specifically, G is trained to learn the distribution of real samples from noise distribution, while D distinguishes synthetic data produced by G from real data distribution. In general, the objective function of D can be expressed as follows: where P r represents the distribution of real samples, D(x) denotes the probability that x obeys P r rather than the distribution of generated samples P g , P z (z) represents a prior distribution of noise variable z, and G(z) represents that G produces synthetic samples from a prior distribution P z . The objective function of G can be expressed as follows: Therefore, G and D play the minimax game with a value function V (G, D), which is given by: Figure 2 illustrates the structure of GAN. Theoretical analysis shows that GAN aims to minimize the distance between P g and P r , and V (G, D) has a global optimal value for P r = P g [40].
As an improvement of the traditional GAN, Conditional Generative Adversarial Network (CGAN) [41] takes auxiliary information ζ as a condition to guide G and D to realize a conditional generative model. The objective function of CGAN is given by: Notably, CGAN combines the noise distribution P z (z) and ζ into a joint latent representation to input into G. Similarly, x and ζ are also input into D.
4 Overview of our approach

Problem statement
To clarify explicit privacy and implicit privacy, we take the case of the department store Target as an example.

Figure 2. An illustration of GAN
Consider a scenario in which retailers advertise products to customers. Figure 3a shows the use of explicit privacy. Retailers get female customers' medical records illegally, and then decide to recommend product advertisements according to customers' pregnancy status (an explicit privacy attribute). Such explicit privacy disclosure will threaten the life and property of users, such as bullying and credit card fraud. For the disclosure of explicit privacy attributes, we can utilize the current representative and excellent differential privacy (DP) to prevent it. In addition to the explicit privacy mentioned above, there is a special and imperceptible class of non-privacy attribute, called implicit privacy, which is strongly associated with privacy attributes. As shown in Figure 3b, retailers do not directly use the pregnancy status in the medical records, but use the customer's recent purchase records. If the customer often purchases pregnancy-related products P 1 and P 2 recently, retailers can conclude that the customer or one of her family is pregnant, and then recommend pregnancy-related product advertisements to her. If the customer often purchases products O 1 and O 2 , retailers will recommend other types of product advertisements to the customer. Here, the purchase record is an implicit privacy attribute, through which we can accurately infer the customer's pregnancy status. In contrast to explicit privacy, implicit privacy is not defined as a privacy attribute, but it strongly correlates with privacy attributes. Attackers can use it to infer explicit privacy indirectly, resulting in a series of privacy disclosure problems. Based on the above case, we define explicit privacy attributes and implicit privacy attributes as follows: Definition 1. (Explicit privacy attributes, i.e., privacy attributes [42,43]). In a dataset, attributes that directly represent personal confidential and sensitive information are called explicit privacy attributes (e.g., disease and salary).
Note that explicit privacy attributes refer to traditional privacy attributes in this paper. Definition 2. ((θ, β)-implicit privacy attribute). Given a dataset D = (x, s), including the public attributes x and an explicit privacy attribute s (e.g., income, disease status), each attribute in x p ⊆ x is said to be a (θ, β)-implicit privacy attribute for s with the correlation metric ρ and the performance metric τ if a classification algorithm f : x p → s exists, such that where is a function that measures the correlation between two attributes, such as the normalized mutual information [44,45] and Pearson correlation coefficient [46], and τ represents performance measurement of classifiers, such as Accuracy and F1-score.
Here the performance threshold β with higher values indicates the higher prediction performance from x p to s by f . The selection of β depends on the tolerance of implicit privacy disclosure. With a high risk  of implicit privacy disclosure tolerated, we can select a larger β to obtain higher data utility. Otherwise, we can select a smaller β to achieve better privacy protection. Similarly, as a correlation threshold, the higher θ indicates that each attribute selected has a higher correlation with s.
Generally speaking, the stronger the correlation between x p and s, the stronger the predictive ability from x p to s. For example, in the feature engineering, we also select the features that are strongly related to the class label [47][48][49]. In a word, our ultimate goal is to eliminate the correlation between explicit and implicit privacy attributes while preserving good data utility.

Our framework IMPOSTER
As shown in Figure 4, our framework IMPOSTER consists of two modules: (1) the implicit privacy detection module, and (2) the implicit privacy protection module.

Implicit privacy detection module
The explicit privacy attribute can be inferred from other attributes in the dataset, which will also result in the users' privacy disclosure. As shown in Figure 5, attribute x 3 can infer an explicit privacy attribute s with a certain probability, so attribute x 3 is the implicit privacy attribute for s. Therefore, it is necessary to determine the implicit privacy attributes for s in advance and protect them. Commonly used metrics to measure the correlation between two random variables are Pearson correlation coefficient, mutual information (MI), and normalized mutual information (NMI). Pearson's correlation coefficient mainly measures the degree of linear correlation between two random variables. Both MI and NMI can measure the degree of linear correlation or nonlinear correlation between two random variables. Further, NMI is a normalization of the MI score to scale the results between 0 (statistical independence) and 1 (perfect correlation), which reduces the adverse effects of abnormal sample data. Therefore, we use NMI to measure the correlation between the explicit privacy attribute and other attributes. The formula of NMI is as follows: where H s denotes Shannon entropy. We calculate the relevance between the explicit privacy attribute s and other attributes x i , and get the attribute set that is strongly related to s. According to Definition 2, we use a classification algorithm f to measure the prediction ability from the attribute set to s, and finally get the implicit privacy attribute set.

Implicit privacy protection module
Based on the idea of data generation, the implicit privacy protection module adds a discriminator into the GAN framework equipped with a VAE model to eliminate the association between implicit privacy attributes and explicit privacy attributes. Although GAN is trained to learn a distribution of synthetic
6: end if 7: end for 8: if τ (f (Pset), s) ≥ β, then 9: return Pset. 10: else 11: return φ. 12: end if samples similar to the distribution of real samples, it is not good at capturing the element-wise errors between synthetic samples and real samples. In order to alleviate this limitation, our model incorporates a VAE [50] to minimize the reconstruction errors between real samples and synthetic samples. VAE includes an encoder E that compresses the original input x p to a latent representation z l ∼ E(x p ) = p(z l |x p ) and a decoder G that decompresses z l to reconstructed outputx p ∼ G(z l ) = p(x p |z l ). The total loss function of the VAE includes reconstruction errors and a regularization term, which can be expressed as: where KL(*) refers to the Kullback-Leibler divergence.
After the data pass through the VAE, in order to eliminate the correlation between implicit and explicit privacy attributes, we adopt an improved GAN, which consists of one generator (the decoder of VAE) G and two discriminators D 1 and D 2 . The G generatesx p ∼ P g from a prior noise distribution P z to match P r .x D 1 is a classifier that distinguishes a real samplex p = G(E(x p )) from a generated faked samplex p . Here, s is a binary attribute. Therefore, the discriminator D 2 is also a binary classifier, which is an important part of eliminating the correlation between implicit and explicit privacy attributes. The specific game process of G, D 1 and D 2 will be given in detail later. Therefore, our improved CGAN sub-module in the implicit privacy protection module is formalized as a minimax game and the value function is given by: where The hyperparameter λ is a trade-off coefficient, which is used to balance data utility and privacy level of generated data. Similar to the traditional CGAN, the value function V 1 indicates that G and D 1 play a zero-sum game. Specifically, D 1 learns to accurately distinguish between generated samples and real samples, while G learns to generate fake samples similar to real data to fool D 1 . In order to make the generated samples contain the information that supports predicting the value of the explicit privacy attributes as little as possible, the second value function V 2 shows that D 2 and G also play a zero-sum game. Specifically, D 2 learns to accurately predict the value of s, while G learns to fool D 2 .
The total objective function of the implicit privacy protection module can be formalized as follows: When the reconstruction errors between the real samples and the generated samples in VAE are within an acceptable range, once the implicit privacy protection module converges, the synthetic samples generated by G approximately obey the distribution of real samples, and the correlation between explicit and implicit privacy attributes is eliminated as much as possible.

Algorithm
Algorithm 1 displays the pseudo code of the implicit privacy detection module. Algorithm 2 displays the pseudo code of the implicit privacy protection module.
Firstly, we sample a minibatch of samples from the output of E and a minibatch of noise samples from P g to train D 1 and G (from Line 2 to 7). Secondly, we sample a minibatch of samples (x p |s = 0) ∼ P g (x p |s = 0), and sample another batch of samples (x p |s = 1) ∼ P g (x p |s = 1) to train D 2 and G (from Line 8 to 10). Finally, when model converges, we can get G to generate data without implicit privacy.

Theoretical analysis
Different from the traditional GAN, our proposed implicit privacy protection module in IMPOSTER adds an additional discriminator to protect implicit privacy. In addition, we introduce VAE to capture the element-wise errors between synthetic samples and real samples, so as to make the distance between them as close as possible. Therefore, we give a theoretical analysis of the convergence of the implicit privacy protection module when the reconstruction errors between real samples and generated samples in VAE are within an acceptable range. Proposition 1. Given a fixed encoder E and a fixed generator G, the optimal discriminators D * 1 and D * 2 can be formalized as follows: Note that the training objective for D 1 is to estimate whether x p comes from P r or P g . D 2 is used to eliminate the association between the explicit privacy attribute s and implicit privacy attributes x p . Given a fixed encoder E, the optimal discriminators D * 1 and D * 2 , Eq (13) can be changed as follows: = − (2 + λ) log 4 + 2 × JS(P r (x p |s)||P g (x p |s)) + 2λ × JS(P g (x p |s = 0)||P g (x p |s = 1)) + V VAE (E, G).
The detailed derivation process is described in the Appendix of the Supporting Information. The objective function of VAE includes cross entropy and Kullback-Leibler divergence. For the Eq (16), since Jensen-Shannon divergence, Kullback-Leibler divergence and cross entropy are convex functions [51], C(G) can converge to a global minimum. Therefore, as for C(G), we give a theorem as follows: Theorem 1. Given a fixed encoder E, the optimal discriminators D * 1 and D * 2 , there exists a global minimum for the function C(G).
Proof. The detailed proof is described in the Appendix of the Supporting Information.

Experimental evaluation
We evaluate the effectiveness of our proposed framework IMPOSTER from the following aspects: (1) whether the generated synthetic data eliminate the correlation between explicit privacy attributes and implicit privacy attributes; (2) whether the generated synthetic data preserve good data utility; (3) parameter sensitivity analysis.

Dataset
We evaluate our proposed privacy-preserving framework IMPOSTER on a real-world dataset. We present some details and statistics of the dataset as follows: Page 10 of 18 Security and Safety, Vol. 1, 2022008

3:
Generate m samples from encoder E: {z

10:
Update G by descending its stochastic gradient:

11:
Update E and G according to Eq (8). 12: end for 13: return G; UCI adult dataset 5 . The dataset contains 48 842 instances. Each instance contains 7 numerical and 7 categorical variables, and the class label represents whether the annual income exceeds $50k.

Evaluation metrics
This paper adopts several metrics to verify the performance of our framework. These metrics are listed as follows.
Accuracy. Accuracy measures the proportion of the number of rightly predicted samples to the entire number of predicted samples and can be expressed as follows: where T P refers to true positive, F P refers to false positive, T N refers to true negative, and F N refers to false negative.
F1-score. Both precision and recall are important performance metrics in classification problems. However, they are a pair of opposite metrics. In general, the higher the recall, the lower the precision. In order to comprehensively consider these two indicators, we adopt F1-score to measure the prediction performance of classifiers. The F1-score denotes a weighted average of the recall and precision. It can be given by: where precision = T P F P +T P and recall = T P T P +F N .
• Setting A: classifiers are trained on real samples and tested on real samples.
• Setting B: classifiers are trained on generated samples and tested on generated samples.
We are mainly concerned with two comparisons. On the one hand, compared with setting A, if the classifiers trained on synthetic data have poor prediction performance for the explicit privacy attribute on synthetic data (setting B), our framework IMPOSTER is able to eliminate the correlation between implicit and explicit privacy attributes. On the other hand, compared with setting A, if the classifiers trained on synthetic data have good prediction performance for the class label on synthetic data (setting B), the generated synthetic data can capture the corresponding relationship between attributes and labels, and the association between attributes, that is, data utility.

Correlation elimination
In this subsection, we conduct elaborate experiments to illustrate whether our framework IMPOSTER is able to eliminate the correlation between implicit and explicit privacy attributes. Specifically, we first use the implicit privacy detection module to explore the correlation between the explicit privacy attribute and other attributes in the original dataset. Figure 6 illustrates the normalized mutual information between corresponding attributes in Adult. Here, we treat "gender" as an explicit privacy attribute and set θ = 0.01. We then adopt implicit privacy attributes to construct classifiers to infer the attribute "gender" on setting A and B. From Table 1, we can observe that, compared with setting A, all the classifiers trained on synthetic data generated by IMPOSTER have a significantly decreased performance in predicting the explicit privacy attribute (setting B). Therefore, our framework IMPOSTER is able to eliminate the correlation between implicit and explicit privacy attributes.

Data utility
In order to evaluate whether our framework can guarantee the data utility, we use the synthetic dataset generated by IMPOSTER to predict the class label "income" on setting A and B. From Table 1, we can see that, compared with setting A, even though the performance of all classifiers trained on setting B in predicting the class label decreases slightly, the prediction accuracy of training on data generated by IMPOSTER is higher than 81%. This result reflects that synthetic samples generated by IMPOSTER have captured the relationship between attributes and class labels well. Therefore, our framework can guarantee data utility while protecting implicit privacy.  Table 1. Accuracy and F1-score of predicting "gender" (the explicit privacy attribute) and "income" (the class label) in classifiers on setting A and B Setting A Setting B Predict "gender" Predict "income" Predict "gender" Predict "income"

Comparative experiment
To verify the superiority of IMPOSTER, we use the traditional privacy protection method kRR mechanism, a state-of-the-art and representative algorithm in differential privacy, to protect implicit privacy. Specifically, given a privacy budget ε, we keep the value of each implicit privacy attribute unchanged with high probability and flip it to another value in the same attribute value space with low probability. Then, we utilize the data disturbed by kRR to train the GBDT classifier to predict the class label and the explicit privacy attribute, respectively. . The prediction performance of data perturbed by the kRR mechanism on the class label and the explicit privacy attribute, respectively. The x-axis denotes different values of privacy budget ε, and the y-axis denotes F1-score and accuracy Figure 7 indicates the prediction performance of data perturbed by the comparative method kRR mechanism on the class label and the explicit privacy attribute. From Figure 7, we can observe that, with the privacy budget ε changing from 1 to 100, the prediction performance of disturbed implicit privacy attributes for the class label and the explicit privacy attribute increases together. By comparing the data of Table 1 and Figure 7, when the accuracy of the predicted explicit privacy attribute reaches 68.91% on the data disturbed by kRR, the accuracy of the predicted class label is only 75.17%. However, the accuracy of the predicted explicit privacy attribute and class label is 67.26% and 82.29%, respectively on the data generated by IMPOSTER. We can conclude that, the highest prediction performance for "gender" and "income" does not exceed the prediction performance on the original data, either on data disturbed by the kRR mechanism or on the data generated by IMPOSTER. However, compared with the kRR mechanism, our proposed framework can maintain better data utility while providing the same privacy-preserving level.

Parameter sensitivity analysis
In this part, we evaluate the sensitivity of parameters θ and λ, respectively. Except for the parameters explored, all the other parameters take default values. As a correlation threshold, the larger θ indicates that each attribute selected has a higher correlation with the explicit privacy attribute. We first get the candidate set of implicit privacy attributes by varying θ and then train an XGBoost classifier to evaluate the accuracy and F1-score of the predicted "gender" (the explicit privacy attribute) and "income" (the class label) on settings A and B, as shown in Table 2. We can observe that with the decrease in θ, the accuracy and F1score of the predicted "gender" and "income" tend to increase on setting A, i.e., smaller values of θ offer higher accuracy and F1-score. On setting B, however, with the decrease in θ, the accuracy and F1-score of the predicted "gender" are lower than 67% and 46%, respectively, while the accuracy and F1-score of the predicted "income" are slightly lower than those on setting A under the same θ. Therefore, our framework IMPOSTER can maintain good data utility while protecting implicit privacy effectively.
The trade-off coefficient λ is an important parameter of IMPOSTER, which is used to balance the data utility and privacy level of synthetic data. We evaluate how parameter λ affects the synthetic datasets generated by IMPOSTER in terms of two dimensions: correlation elimination and data utility. For correlation elimination, when λ becomes larger, the framework IMPOSTER tends to generate synthetic data with a lower correlation between implicit and explicit privacy attributes. In essence, the game of G and D 1 is to generate synthetic data similar to the original data, including capturing the correlation between attributes, which limits the extent to which IMPOSTER can eliminate the correlation. Therefore, when parameter λ increases to a certain degree, the accuracy and F1-score show fluctuations in a certain Table 2. Accuracy and F1-score of predicting "gender" (the explicit privacy attribute) and "income" (the class label) with different θ on setting A and B Setting A Setting B Predict "gender" Predict "income" Predict "gender" Predict "income"  interval. This can be observed in Figure 8a, where we train an XGBoost classifier on the synthetic data generated by IMPOSTER to predict the explicit privacy attribute "gender". For data utility, we adopt the synthetic data generated by IMPOSTER to construct an XGBoost classifier to predict the class label "income". Figure 8b shows the performance curves of the predicted class label "income" with different λ. From Figure 8b, we can observe that with the increase of λ, accuracy and F1-score keep relatively steady only with a slight fluctuation. Observations from Figures 8a and b illustrate that IMPOSTER can get rid of the correlation between implicit and explicit privacy attributes while preserving data utility. Note that when λ is around 1, our framework IMPOSTER achieves the best performance, and when λ = 0, it degenerates to CGAN equipped with VAE, which cannot be used to remove data correlation between implicit and explicit privacy attributes.

Conclusion and future work
This paper addresses a special and imperceptible class of privacy, called implicit privacy, and then proposes an ex-ante implicit privacy-preserving framework based on data generation, called IMPOSTER, which consists of implicit privacy detection and protection modules. Specifically, the former uses normalized mutual information to detect attributes strongly related to privacy attributes. The latter equips the standard GAN framework with an additional discriminator, which is used to eliminate the association between explicit and implicit privacy attributes. Experimental results demonstrate that IMPOSTER can learn a generator producing data without implicit privacy while preserving good data utility.
In future work, on the one hand, we will adopt the Rényi entropy [58] to explore the correlation between multi-attributes and explicit privacy attributes, which is an open and interesting question. On the other hand, we will apply the proposed IMPOSTER framework to address the implicit privacy issue of time series data in the financial risk control scenario, generate data that eliminate the implicit privacy as much as possible to meet user expectations and regulatory requirements, and replace the real data with the generated data to train the financial anti-fraud model and improve the robustness of financial risk control systems.