Table Of Links
4 Identifying API Privacy-relevant Methods
5 Labels for Personal Data Processing
6 Process of Identifying Personal Data
7 Data-based Ranking of Privacy-relevant Methods
8 Application to Privacy Code Review
Conclusion, Future Work, Acknowledgement And References
Application To Privacy Code Review
This section outlines how our approach can be applied to privacy code reviews across a diverse set of 100 open-source applications. We then delve into detailed case studies of two popular software applications to illustrate the utility of our approach.
8.1 Large-scale Analysis
To understand the prevalence and types of personal data processing in real-world applications, we analyzed 100 open-source applications. These were equally divided between Java and JavaScript/TypeScript and were selected from GitHub’s daily top-starred repositories list 3 . We selected applications that are popular (top-starred), non-trivial (over 300K lines of code), and predominantly written in Java or JavaScript/TypeScript (constituting over 60% of the codebase).
Additionally, we ensured these applications differed from the 30 popular libraries analyzed previously and that their primary documentation language was English for easier identification of functionalities. This selection process resulted in a dataset that is representative of real-world software applications and suitable for our analysis of personal data processing practices.
We then examined the proportion of methods in these applications that invoke privacy-relevant methods and are involved in the flow of personal data and Personally Identifiable Information (PII). The result of statistics of our findings are listed below in Table 8.
Our findings indicate that our approach can make the privacy code review process more efficient. By identifying methods that are critical for personal data and PII processing, we help reviewers focus their efforts, enabling a more targeted review.
8.2 In-Depth Case Studies
We validate the effectiveness of our approach through two open-source projects: Signal Desktop4 and Cal.com5 . Each offers unique insights for privacy code review. Both projects were chosen due to their popularity, sensitivity, and public availability. Their open codebases ensure transparency and reproducibility, making them ideal candidates to validate our approach.
By applying our approach to these carefully selected real-world projects, we provide concrete examples that demonstrate practical value in identifying key areas to focus on during privacy code reviews.
Signal Desktop Signal Desktop is a famous end-to-end encrypted messaging application, primarily written in TypeScript (79.5%) and JavaScript (15.6%), covering about 360K lines of code. Its reputation for enhanced security and privacy features showcases the depth of our approach. While the application has limited use of popular libraries, our approach highlighted a minor number of privacy-relevant methods invocations (48, approximately 0.5% of total methods) from our selected APIs and native libraries potentially linked to personal data processing.
In our analysis, Signal stands out for using its own encryption protocol (Signal Protocol) and message transmission services, minimally relying on external libraries. This underscores Signal’s commitment to end-to-end encryption. Our categorization highlights the primary areas of Data Processing and Transformation (DPT), Network Communication (NC), and Data Encryption and Cryptography (DEC), with most encryption methods used for local encryption of profiles and group data. Signal’s proprietary protocol, used for encrypting chats and attachments, falls outside our analysis scope.
Our findings show that Signal rarely transmits PII directly to the internet. Instead, encrypted system data or anonymized IDs are mainly used, reflecting Signal’s dedication to user privacy. For privacy code reviewers examining Signal Desktop, our approach underscores Signal’s limited use of popular libraries for PII processing, aligning with its privacy-focused design philosophy. This categorization helps reviewers understand how Signal handles personal data, aiding in a more streamlined review process.
Cal.com Cal.com, a scheduling application, is designed to grant users comprehensive control over their schedules. Written entirely in TypeScript, it spans about 126K lines of code. Our method identified 371 (approximately 3.8% of total methods) privacy-relevant methods that might engage in personal data processing.
Applications such as Cal.com often employ diverse frameworks for specific functionalities. For instance, Cal.com’s utilization of the popular ORM framework, Prisma, for handling user profiles and credentials, aligns with our library list. In terms of categories, Data Processing and Transformation (DPT) topped the list at 26%, followed by Identity and Access Management (IAM) at 17%, and Network Communication (NC) at 15%. Unlike Signal Desktop, Cal.com heavily leverages libraries like Prisma, next-auth, and nodemailer for processing personal data, mirroring its primary functions of user registration, email interaction, and scheduling.
Approximately 97% of privacy-relevant methods invoked by Cal.com handle PII. This attests to the capability of our method in identifying PII processing methods and subsequently guiding code reviewers efficiently. Our approach highlights the extensive use of specific libraries in applications like Cal.com, aligning with their core features. This correlation boosts reviewers’ confidence and precision. By categorizing processing activities, it provides an overview of how the application handles personal data, helping reviewers prioritize effectively. This makes the review process time-efficient and thorough.
8.3 Threats to Validity
Our study’s validity may be affected by several factors. The project selection based on GitHub trends could bias towards popular topics, potentially overlooking a broader range of applications. The use of Semgrep for static analysis, though efficient, hasn’t been thoroughly validated for precision, which could impact the accuracy of our results. Reliance on regular expression matching for identifying personal data risks introducing false positives and negatives, thus affecting result reliability.
Additionally, the absence of manual validation for each instance of personal data processing identified might lead to inaccuracies. Furthermore, focusing only on the top 25 libraries for Java and JavaScript due to resource constraints limits the generalizability of our findings, as other privacy-relevant methods in lesser-known libraries may have been missed.
Authors:
- Feiyang Tang
- Bjarte M. Østvold
This paper is