In our last post, we discussed how to determine how many people drink pumpkin spice lattes in a given time period without learning their identifying information. But say, for example, you would like to know the total amount spent on pumpkin spice lattes this year, or the average price of a pumpkin spice latte since 2010. You’d like to detect these trends in data without being able to learn identifying information about specific customers to protect their privacy. To do this, you can use summation and average queries answered with differential privacy.
In this post, we will move beyond counting queries and dive into answering summation and average queries with differential privacy. Starting with the basics: in SQL, summation and average queries are specified using the SUM and AVG aggregation functions:
SELECT SUM(price) FROM PumpkinSpiceLatteSales WHERE year = 2020
SELECT AVG(price) FROM PumpkinSpiceLatteSales WHERE year > 2010
In Pandas, these queries can be expressed using the sum() and mean() functions, respectively. But how would we run these queries while also guaranteeing differential privacy?
In our previous post we introduced the concept of sensitivity — how much a query’s output changes when a row is added or removed in the database — and analyzed the sensitivity of counting queries. When a row is added to the database, the result of a counting query can increase by at most 1. With this simple bound on sensitivity, it’s easier to answer counting queries with differential privacy. Computing sums and averages is slightly more involved.
So, how much does a summation query change when a row is added to a database? It depends on the row! If someone spends $1 on a pumpkin spice latte, then the increase in the sum will be $1. If someone spends $10,000, the sum will increase much more.
To achieve differential privacy, we need an upper limit on the largest possible increase we could see when a row is added. For our latte query, that means we need an upper limit on the price of a pumpkin spice latte. This is a big challenge — no matter what limit we set, hypothetically, there may be a cafe somewhere that charges more than the limit.
The solution to this problem is called clipping. The idea is to enforce an upper limit, instead of assuming one. Lattes that cost more than the limit are clipped so that their price is equal to the limit. After clipping, we are guaranteed that all values in our database lie between the lower and upper limits we set. Clipping is a common operation in data analysis; Pandas even has a built-in function for it.
Since clipping enforces guaranteed lower and upper bounds on the data, we can use those bounds to determine sensitivity. If we clip the data so that lattes cost at most $10, then the largest increase we will see in the output of our summation query is $10 when we add a single latte sale to the database. Having bounded the sensitivity, we can add noise to the query’s output using the Laplace mechanism and achieve differential privacy.
The clipping parameters (the upper and lower limits) are extremely important for accuracy. If the upper limit is too high, we’ll add a lot of unnecessary noise. If it’s too low, then we’ll lose information that was present in the data by modifying too many of the data points.
The general rule of thumb is to set the upper limit as small as possible, while also high enough that very few data points are modified during clipping.
Unfortunately, bounding the sensitivity of average queries is even more difficult than it is for summation queries. How much does an average change if we add a row to the database? In addition to the upper limit on the data values themselves, the answer depends on how many things we are averaging. If we’re averaging five numbers, adding one more number might change the average by quite a bit. If we’re averaging 5 million numbers, adding one more probably won’t change the average very much.
As a general rule, however, the sensitivity of a query shouldn’t depend on the data. Otherwise, the sensitivity might itself be sensitive — meaning that it might reveal something about the data. This adds another level of complexity to bounding the sensitivity of averages.
A simple and effective solution for answering an average query using differential privacy is by taking an indirect approach: split the query into two separate queries, a summation query and a counting query. We will use the approach described above for the summation query, and the approach described in our last post for the counting query. To split our example query, we compute instead the two following queries:
SELECT SUM(price) FROM PumpkinSpiceLatteSales WHERE year > 2010
SELECT COUNT(*) FROM PumpkinSpiceLatteSales WHERE year > 2010
The average is then computed by dividing the result of the first query by the result of the second. Even if we compute each query using differential privacy, how do we know that the final average is differentially private? We don’t actually need to take any extra differential privacy steps; the result will automatically satisfy privacy. This follows from the post-processing property of differential privacy — to extend the list of properties we began in our first post — which promises that combining differentially private results in any way will also be differentially private (and with no extra effort!).
The practical systems for differential privacy referenced in the last post also provide support for sums and averages. Both Google’s differential privacy library and IBM’s Diffprivlib provide algorithms that compute differentially private bounded sums, leveraging clipping as described above. Both libraries also provide utilities for helping analysts to determine the bounds automatically.
The libraries additionally provide solutions for calculating averages. Google’s library also provides a more sophisticated algorithm, relative to the “sum divided by count” approach we’ve described above, which can sometimes give better accuracy.
Both Google’s differential privacy library and IBM’s Diffprivlib are indexed in NIST’s Privacy Engineering Collaboration Space, where you are invited to share feedback or use cases related to these tools.
Coming Up Next
So far, we’ve looked at the basics of answering counting, summation, and average queries with differential privacy. In these scenarios, we’ve assumed that you have a single question about a dataset. What if you’re interested in a more sophisticated analysis of your data, with a set of different questions, to learn about various trends from different perspectives? For example, you may want to know the average number of lattes purchased per age group, and also separately per geographic region. In the next post, we’ll welcome a guest author from academia to walk us through considerations and tools available to conduct this type of analysis with differential privacy.