Waldo: A Private Time-Series Database from Function Secret Sharing
About our Guest Speaker
Soroco invited Emma Dauterman, a current 4th year Ph.D. student studying computer science in UC Berkeley’s RISELab where she is advised by Raluca Ada Popa and Ion Stoica. Emma’s work throughout her Ph.D. has focused broadly in building secure systems using cryptography, published in IEEE and ACM conferences such as S&P (Oakland), SOSP, OSDI, and others. We invited Emma to give a talk at Soroco as her work relates very closely to technologies we build that must be both secure and scalable.
Why This Talk at Soroco
The work that Emma presented at Soroco, called Waldo, is a private time-series database focused on how to protect the privacy of user data stored in it by cryptographically securing their data, access patterns of their data, as well the query filter values. Particularly, the work also focused on how to support multi-predicate filtering which is common in database queries. All of this was published in S&P (Oakland) in “Waldo: A Private Time-Series Database from Function Secret Sharing.”
At Soroco, we build the work graph that helps organizations understand how digital work gets done. The work graph is a connected sequence of steps that teams execute. It is, in essence, a map of how teams execute digital work, and it lies at the intersection of people, work, and technology. Once discovered, the work graph enables teams to collaborate and work more effectively.
Since the work graph is a sequence of steps that teams execute and sourced from the activities that the teams perform, a major role of the work graph’s design and information access is to ensure end-user privacy is protected. How we protect end-user privacy is and will always be a focus of our system design. For these reasons, we invited Emma to give the talk so that we could learn more about furthering cryptographic storage and privacy.
Watch the Talk
Powerful ideas from Waldo
As presented in Waldo’s system design, it is focused on a time-series database design that supports write-intensive workloads that have a high ratio of updates (‘appends’ being what is supported), where multiple features and multiple predicates are supported while keeping the data cryptographically protected. There are two types of clients in the system, data producers and queriers (or clients that are both). Data producers collect real-time data and update server state with it. Data queriers query the data collected stored on the server.
The author’s work around Waldo first leverages distributed trust through multiple server deployments. For example, by storing the data in three servers present in three different trust domains. Therefore, if the data is compromised in one trust domain the other two trust domains could still be used to access the correct data. Practically, that means that these servers should be deployed in different clouds and managed by different organizations. By distributing information across these servers, if a majority of the servers are honest then a single malicious server cannot learn the data contents, query filter values, or any search access patterns. Clients need to send messages directly to each of the servers, distributing the information and trust. At Soroco, We think that this follows a powerful design pattern to protect against single or malicious compromised servers. However, it does increase system design complexity and cost. Therefore, when to leverage the distributed trust framework depends on what data is important.
Encrypting query at client and decrypting at server using Function Secret Sharing:
Waldo suggests the use of a cryptographic technique called Function Secret Sharing (FSS) to generate FSS keys for the query on the client side and evaluate it at the server. Combining this technique with replicated secret sharing ensures that the malicious attacker is not able to determine the access pattern or the filter values used to access the result. In the experiment setting, there is a medical practitioner who queries the data across two data servers using FSS keys generated at client side and the server returns the shares of data. These shares of data are again aggregated at client side to produce the final output. Although Waldo’s complete protocol uses 3 servers, we are showing a simplified example with two below. Emma covers the complete protocol in both the paper and the recording of her talk above!
Using MAC key along with FSS key:
The Waldo system design further expands on the above idea by introducing the concept of using MAC keys to verify at client side whether the result from server is correct. Here the client is only checking whether the data is compromised on the server or not. In the experimental setting, the medical practitioner sends pair of keys to the server. Waldo uses MAC techniques originating in multi-party computation to provide these security guarantees.
Using Function Secret Sharing
To ensure that a malicious party is not able to access the access pattern of the data: let’s deep dive into the problem in Soroco’s context. Please note that this is a simplified example of the Waldo system, and we encourage reading the original paper and watching the video for further real-world applications! Suppose there is an individual in a data analyst role that wants to identify how many applications were accessed between two periods of time X and Y. The query in the work graph would be:
Query = “select count(distinct application) from workgraph where time between X and Y. “
In this simplified example and strawman-based approach, we will generate secure keys for this query, e.g., K1 and K2 using a Generate method.
K1, K2 <- Gen(Query).
Sticking with this simplified model, we can then take these keys K1, K2 and initiate a request to the data servers using one key each as shown below. At the server level, we can then use a method Eval and iterate across the entire dataset. Eval method will either produce zero or one for the index in the dataset. One is produced when K1 matches the index that the analyst is interested in and zero when the index does not match. If the index match is successful, then Eval returns a “share” of data corresponding to the index back to client machine (in this case, the analyst’s). In the final Waldo system design, however, instead of just 2 pairs of keys you would further scale this up to 6 pairs of keys for a total of 12 keys.
Within the client machine, the “share” or partial output from both the servers are aggregated to produce the final output required by the analyst.
Here the Eval method uses the database contents as the key to evaluate the K1 Key (FSS key) on the database. But the server cannot simply evaluate its FSS key on the database contents because the server should not be able to view the database contents. At the same time, the servers need to evaluate their keys on identical copies of the database to produce correct outputs. Further, the author proposed the use of a search index structure to store the time series data to reduce complexity during evaluation. This allows Waldo to leverage FSS using the structure rather than the context of the search index.
Additional Thoughts and Conclusions
Waldo is a unique piece of work that we believe can help further security and privacy in technology globally. As future work, we imagine a few areas to be strengthened. For example, Waldo implementation protects the access pattern while querying the data as well as the filter values used for querying.
There are several considerations at scale in Waldo’s design, which we think lead to future work in this area. Distributed trust by distributing the data and queries across multiple servers in different trusted domains (e.g., cloud providers) provides much stronger guarantees around privacy and protects against malicious servers but obviously comes at a high cost. That is around the additional server, storage, and maintenance costs to obtain that distributed trust. For these reasons, what data is put in this distributed trust framework may need to be limited to sensitive user information as opposed to simply migrating one’s entire database to this framework.
Additionally, performance in terms of computational requirements and additional latency introduced by the cryptographic storage and distributed trust framework also needs to be considered in the application of the data. Waldo’s evaluation has shown significant improvements on latency with multiple predicates as compared to prior work (e.g., MP-SPDZ and ORAM) and we would encourage further advancements in reducing these latencies to be able to scale with the kinds of queries and data sizes we see in our application of the work graph.
We encourage our readers to watch Emma’s talk and read the author’s articles of work that was published in the conferences.