Table of Links
VI. Conclusions, Acknowledgments, and References
IV. RESULTS AND DISCUSSION
Our hyperparameter tuning demonstrated that the best number of topics is three. Similarly, after analyzing pyLDAvis’s visualizations and top keywords for 1 to 25 topics, the two reviewers also achieved a consensus on three as the number of topics. The pyLDAvis interactive visualization for the three topics is available online.6 The reviewers named the topics by considering the general themes of top keywords returned by LDA (See Table I). We determined that the first topic is about digital certificates and configuration issues, the second one is about programming issues concerning encryption and decryption, and the third concerns passwords/hashes and basic crypto-related algorithms. As an influential indicator of topic relevancy, we realized that the frequencies of the candidate tags used in the three topics are aligned with the general themes of the topics.[7] For instance, we observed that the AES, DES, Encryption, and RSA tags are mostly used in
programming issues, the Hash, SHA, SHA256, MD5, XOR, and Salt tags are more frequent in the password/hash topic, and finally, the Digital-signature, Keystore, OpenSSL, Privatekey, Public-key, Smartcard, and X509certificate tags are more common in the digital certificate topic.
With respect to stratified sampling, we considered the number of documents in each stratum (i.e., each topic) as 139, 124, and 119 documents from the first topic to the third one respectively. The selected documents were created in the last 5 years on Stack Overflow. Extracting the themes, the reviewers achieved 79% Kappa score, which demonstrates a substantial agreement between the two reviewers.
- Topic One: Digital certificate and configuration problems. The manual analysis for the first topic depicts that developers discussed two main areas, namely certificate/OpenSSL (63%) and SSH (37%). For instance, the discussions were related to OpenSSL configuration, signing and verifying a signature, and generating PEM files using OpenSSL. There were also questions concerning how to generate self-signed certificates, access a certificate store, create a Certificate Signing Request (CSR), establish https and secure connections, and configure certificate-based authentication in ASP.NET. In the SSH-related questions, the majority of the users had difficulty setting SSH with no password, checking the right permission for SSH keys, using SSH programmatically, and connecting to SSH servers of other platforms (e.g., Amazon).
2) Topic Two: Programming issues. As for the programming issues topic, we observed that the three most frequently discussed programming languages were Java (i.e., 44), C/C++ (i.e., 31), and C# (i.e., 19). In 31% of the posts developers discussed issues related to the AES algorithm such as different encryption modes (e.g., CBC and ECB) and key sizes (e.g., 128, 192, and 256-bit). In addition to symmetric encryption, 47% of the posts were related to working with asymmetric encryption (i.e., RSA). The challenges were mostly concerned with different padding modes (e.g., OAEP), how to calculate or understand the raw modulus and exponent numbers, and how to generate and work with different key file encodings in RSA (e.g., DER-encoded format, PEM, or XML). Moreover, another evident problem was dealing with different RSA key formats, i.e., Public Key Cryptography Standards (PKCS). The
users commonly asked how to convert PKCS#8 to PKCS#1 or other standards, and how to programmatically generate or use different key standards in various crypto libraries (e.g., Bouncy Castle). There were users who had problems with illegal block size errors, often misunderstanding the suitable usage of RSA, e.g., encrypting a long text. Nevertheless, the discussions were resolved by proper responses that suggested incorporating AES and RSA into the encryption/decryption scenario. Another type of question was about the issues in Microsoft CryptoAPI (12%). Developers reported issues on working with OpenSSL or using RSA keys from other sources, e.g., importing keys from OpenSSL into Crypto API, converting RSA keys to be used by Bouncy Castle, verifying an OpenSSL DSA signature using CryptoAPI, having extra fields in generated keys by PHP OpenSSL, and signing a message with pyOpenSSL in Python and verifying it with CryptoAPI. Moreover, there were questions (10%) associated with how to either implement a scenario, e.g., encryption of a string with RSA public key with Swift on iOS, or deal with problems while working with more than one crypto library or programming language, e.g., encryption of a string with RSA in JavaScript and decryption in Java, or decryption of a string in Java which is already encrypted using AES-256 in iOS.
3) Topic Three: Password/hashes and basic crypto algorithms. Our findings for the password/hash topic suggest that users primarily discussed problems associated with either passwords (86%) or basic crypto algorithms (14%). Different facets of producing secured passwords were the topic of most discussions. First and foremost, users were uncertain which hashing algorithms (e.g., MD5, SHA-1) can provide a higher level of reliability and how password length contributes to the strength of the resulting hash. Users lacked the required knowledge as to what salt is and how salt can maximize the security of a hash. In addition to pointing out the pros and cons of static salt vs random salt, respondents encouraged users to use salted passwords in order to render the bruteforce or the rainbow table attack prohibitively expensive. Developers were doubtful about which crypto functions, i.e., bcrypt(), PBKDF2(), or Scrypt(), are more secure and faster, and what key differences distinguish the three functions from other hashing algorithms, e.g., MD5, SHA-256. As regards the basic crypto algorithms, users contributed to responses concerning how to produce or find prime numbers, how to use the BigInteger class for RSA modular exponentiation, how to produce unique URL safe hash or IDs, and how to solve a Caesar Cipher or substitution ciphers. Lastly, a few users discussed how to program an authentication module in web programming frameworks such as Laravel, or CakePHP.
4) Topic difficulty and popularity: We checked the popularity and difficulty level of each topic so as to determine which questions attracted more attention or received acceptable answers with a longer time span, which the same approach was used in the previous study [7]. We used four factors to measure the popularity of a topic, namely the average number of views of documents, the average number of comments, the average number of favorites, and the average score of documents. The four factors can be found in the CSV files,[8] namely CommentCount, FavouriteCount, Score, and ViewCount. We considered the average number of ViewCount as the foremost factor to judge the popularity of a topic, the question’s score and the number of FavouriteCount as the second most important factors, and the average number of comments as the last factor. To find the most difficult topic, we used two factors, namely the average time it takes for a document to obtain an accepted answer, and the ratio of the average number of answers in documents to the average number of the views. We avoided recently posted questions from affecting the analysis by only including those that are older than six months.
We infer that questions related to the usage of digital certificates, and configuration problems are the most popular (highest average ViewCount and FavouriteCount), and questions related to hashing and passwords are also viewed as popular based on the other two factors (i.e., average CommentCount and Score). From the difficulty standpoint, we notice that the programming issues topic is the most difficult topic as it had a greater average response time, and its proportion of average answers to average views is the lowest.
5) Summary: The challenges in each theme were studied in detail to demonstrate how developers struggle to use or comprehend various areas of cryptography. According to our findings, we believe that there are two foremost reasons with which developers mainly encounter problems in cryptography. The first leading cause is a distinct lack of knowledge to discern why or what they need to use to accomplish a crypto task. We observed ample evidence where developers lacked the confidence to choose the best algorithm or parameter, for instance, the right and safest padding option in AES. Consequently, developers may use boilerplate code snippets from the provided answers, in spite of the answers’ reliability and security. In the second factor, although the fundamental concepts are the same, the implementation approach of a crypto concept in various crypto libraries is influential to developer performance. Compelling evidence in findings urges that working with more than a crypto library due to using various architectures or platforms in a project creates confusion for developers regarding how a particular problem can be resolved. They commonly have trouble in creating keys with one library and import them into another library or verifying a signature in a different crypto library. Furthermore, adequate explanations and the existence of useful examples in documentations can alleviate the difficulty of using cryptography.
This paper is available on arxiv under CC BY 4.0 DEED license.
[6] http://185.94.98.132/~crypto/paper_data/lda.html
[7] http://185.94.98.132/~crypto/paper_data/tags-topics.csv
[8] http://185.94.98.132/~crypto/paper_data/
Authors:
(1) Mohammadreza Hazhirpasand, Oscar Nierstrasz, University of Bern, Bern, Switzerland;
(2) Mohammadhossein Shabani, Azad University, Rasht, Iran;
(3) Mohammad Ghafari, School of Computer Science, University of Auckland, Auckland, New Zealand.