Skip to content

Commit 43bcc56

Browse files
committed
tweak syndiffix text a little
1 parent 0695782 commit 43bcc56

File tree

1 file changed

+14
-15
lines changed

1 file changed

+14
-15
lines changed

_pages/syndiffix.md

Lines changed: 14 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,18 +5,18 @@ permalink: /syndiffix
55
nav: true
66
---
77

8-
**SynDiffix** is an open-source Python package for generating statistically-accurate and strongly anonymous synthetic data from structured data. Compared to existing open-source and proprietary commercial approaches, **SynDiffix** is
8+
**SynDiffix** is an open-source Python package for generating statistically-accurate and strongly anonymous synthetic data from structured data. Compared to existing open-source and proprietary commercial approaches, SynDiffix is
99

1010
- many times more accurate,
1111
- has comparable or better ML efficacy,
1212
- runs as fast or faster,
1313
- has stronger anonymization.
1414

15-
A complete description of **SynDiffix**, including its operation, performance, and anonymity, can be found on [arXiv](https://arxiv.org/abs/2311.09628). See [github.com/diffix/syndiffix](https://github.com/diffix/syndiffix).
15+
A complete description of SynDiffix, including its operation, performance, and anonymity, can be found on [arXiv](https://arxiv.org/abs/2311.09628). See [github.com/diffix/syndiffix](https://github.com/diffix/syndiffix).
1616

1717
{% include image.html href="/syndiffix-mostlyai-ctgan" src="/assets/img/compare-link.png" alt="SynDiffix usage style" max_width="500px" %}
1818

19-
Programming with **SynDiffix** can be as easy as:
19+
Programming with SynDiffix can be as easy as:
2020

2121
```py
2222
from syndiffix import Synthesizer
@@ -26,37 +26,36 @@ df_synthetic = Synthesizer(df_original).sample()
2626

2727
## Use Cases
2828

29-
Synthetic data has two primary use cases:
29+
The high accuracy of SynDiffix makes it a good choice for descriptive analytics --- histograms, heatmaps, column correlations, basic statistics like counting, averages, standard deviations, and so on. SynDiffix works well for accurately capturing the statistics of both time-series data (events like transactions and mobility) and non-time-series data (demographics, surveys).
3030

31-
1. Descriptive analytics (histograms, heatmaps, column correlations, basic statistics like counting, averages, standard deviations, and so on).
32-
2. Machine learning (building models, extending datasets, etc.)
31+
SynDiffix is ideal for releasing accurate data statistics while strongly protecting anonymity. It is far easier to use than K-anonymity and Differential Privacy, and far more accurate than other synthetic data methods for descriptive analytics.
3332

34-
While **SynDiffix** serves both use cases, it is especially good at descriptive analytics. The quality of descriptive analytics is many times that of other synthetic data products.
33+
SynDiffix can also be used for Machine Learning use case (building models, extending datasets, etc.). It's ML models are on a par with other synthetic data methods, but it is somewhat less easy to use for ML applications.
3534

3635
## Usage
3736

38-
Obtaining this accuracy improvement, however, requires a different usage style compared to other products. The intended usage style of other products is "*one size fits all*": a single monolithic synthetic dataset serves all use cases. It's like expecting a vehicle to be good at city maneuverability, highway cruising, and hauling lumber.
37+
Obtaining its accuracy improvement requires a different usage style compared to other products. The intended usage style of other products is "*one size fits all*": a single monolithic synthetic dataset serves all use cases. It's like expecting a vehicle to be good at city maneuverability, highway cruising, and hauling lumber.
3938

40-
By contrast, with **SynDiffix**, a different *tailored* synthetic dataset can be produced for each use case: like having a Smart car for the city, a Mercedes S-class for the highway, and a Ford F-150 for hauling stuff.
39+
By contrast, with SynDiffix, a different *tailored* synthetic dataset can be produced for each use case: like having a Smart car for the city, a Mercedes S-class for the highway, and a Ford F-150 for hauling stuff.
4140

4241
{% include image.html src="/assets/img/usage.png" alt="SynDiffix usage style" max_width="550px" %}
4342

44-
For instance, suppose the analyst is interested in a heatmap with columns C and E. With other synthetic data products, one would synthesize the complete table, and then make the heatmap with only columns C and E. With **SynDiffix**, one would create a synthetic table consisting of only those two columns and obtain much better results.
43+
For instance, suppose the analyst is interested in a heatmap with columns C and E. With other synthetic data products, one would synthesize the complete table, and then make the heatmap with only columns C and E. With SynDiffix, one would create a synthetic table consisting of only those two columns and obtain much better results.
4544

46-
**SynDiffix** has anonymization mechanisms that allow for literally thousands of column combinations without compromising anonymity. This is not the case with other products, where anonymity is weakened with each new data synthesis.
45+
SynDiffix has anonymization mechanisms that allow for literally thousands of column combinations without compromising anonymity. This is not the case with other products, where anonymity is weakened with each new data synthesis.
4746

4847
## Accuracy
4948

50-
Here are three scatter plots showing the synthetic and real points for a 2-column dataset for **SynDiffix**, the commercial product MostlyAI, and the open-source implementation of CTGAN by SDV.
49+
Here are three scatter plots showing the synthetic and real points for a 2-column dataset for SynDiffix, the commercial product MostlyAI, and the open-source implementation of CTGAN by SDV.
5150

5251
{% include image.html src="/assets/img/scatter.png" alt="SynDiffix accuracy" max_width="600px" %}
5352

54-
The black dots are the original data and the blue dots are the synthetic overlaid on the original data. SynDiffix is far more accurate. More examples can be found in the [arXiv paper](https://arxiv.org/abs/2311.09628).
53+
The black dots are the original data and the blue dots are the synthetic overlaid on the original data. SynDiffix is far more accurate. More examples can be found [here](/syndiffix-mostlyai-ctgan) or in the [arXiv paper](https://arxiv.org/abs/2311.09628).
5554

5655
## Anonymity
5756

58-
We measured the anonymity of **SynDiffix** and a number of other products using the [Anonymeter](https://github.com/statice/anonymeter) tool developed by Statice. The following figure shows the results of measuring the effectiveness of 100s of attacks over multiple different tables (again, see the [arXiv paper](https://arxiv.org/abs/2311.09628) for details). Any score below 0.5 can be regarded as having strong anonymity, and scores below 0.2 are very strong. The "noAnon" plot measures attacks against the original data, and is there for calibration.
57+
We measured the anonymity of SynDiffix and a number of other products using the [Anonymeter](https://github.com/statice/anonymeter) tool developed by Statice. The following figure shows the results of measuring the effectiveness of 100s of attacks over multiple different tables (again, see the [arXiv paper](https://arxiv.org/abs/2311.09628) for details). Any score below 0.5 can be regarded as having strong anonymity, and scores below 0.2 are very strong. The "noAnon" plot measures attacks against the original data, and is there for calibration.
5958

6059
{% include image.html src="/assets/img/privacy.png" alt="SynDiffix privacy" max_width="400px" %}
6160

62-
As these results show, **SynDiffix** and most of the other products have very strong anonymization with respect to the attack used by Anonymeter.
61+
As these results show, SynDiffix and most of the other products have very strong anonymization with respect to the attack used by Anonymeter.

0 commit comments

Comments
 (0)