sia.hackernoon.com

Table of Links

III. METHODOLOGY

We first explain the data gathering procedure and then describe how we clean the data, and briefly introduce the LDA topic modeling.

A. Data Extraction

To collect crypto-related posts on Stack Overflow, we assumed that the attached tags to a question mainly reflect the question’s topic. We first used the “cryptography” tag, i.e., base tag, to fetch crypto-related posts, i.e., 11 130 posts, with the help of the Data Explorer platform (Stack Exchange). We found 2 184 tags (candidate tags) that occurred in posts together with the “cryptography” tag. However, not all candidate tags were crypto-related e.g., C#.

To find relevant posts with the base tag, we used two metrics to determine which of the candidate tags are exclusively related to the base tag. We introduced the first metric as affinity to determine the degree to which a candidate tag (T) is exclusively associated with the base tag (BT). For each T, we used the posts with tags function, for brevity pwt(), to calculate the number of posts whose tags contain both T and BT . We used pwt() to obtain the number of posts whose tags contain T. Given these two values, we compute affinity(T,BT) = |pwt(T,BT)| / |pwt(T)|, whose result ranges from zero to one.

The smaller the value of the first metric, the weaker the association between T and BT. For example, the “C++” and “encryption” tags each appeared 639 897 and 29 737 times respectively in the entire Stack Overflow. The “C++” tag appeared together with BT 540 times and “encryption” was used 3535 times with BT. The value of affinity for the “C++” tag is 0.0008 and 0.1188 for the “encryption” tag, values which demonstrate a strong affinity for “encryption” and BT.

However, higher values of affinity for some candidate tags do not necessarily indicate tags that are closely related to cryptography. For example, the “s60-3rd-edition” tag appeared once with the base tag and in total 11 times in Stack Overflow. The value of affinity for this candidate tag is 0.09, which is close to the value of the “encryption” tag, even though it appeared only once with the base tag. To resolve this issue, we introduced a second metric, coverage(T,BT) = |pwt(T,BT)| / |pwt(BT)|. The second metric indicates the coverage of the BT posts by T. As an example, the value (i.e., 0.00008) of coverage for the “s60-3rd-edition” tag proves that the candidate tag does not exclusively cover the base tag while the “C++” tag covers 0.04 of the cryptography-related questions.

Two authors of this paper examined various combinations of thresholds for the two metrics, and manually reviewed the resulting tags. We noticed that the thresholds to collect only crypto-related tags from the candidate tags (i.e., 2 184) are the ones above the affinity: 0.025 and coverage: 0.005. There are 40 crypto-related tags that fall within the selected threshold domain. The list of crypto-related tags as well as their frequencies are available online.[1] Next, we again used Stack Exchange Data Explorer to extract posts containing each of the selected tags (i.e., 40 tags) but not the base tag, and recorded them in CSV files, which are available online.[2]

B. Data clustering via Topic Modeling

We combined the title and body of a post in order to create a document. We removed duplicate post IDs in multiple CSV files, and finally obtained 91 954 unique documents, without considering when the posts were created. Evidently, each of the documents contained a large number of unnecessary text elements that could produce noise in the output of a topic modeling algorithm. We preprocessed the documents in the following steps: (1) we removed all the code blocks enclosed by the “” tag, (2) we removed all the HTML elements with the help of the Beautiful Soup library,[3] (3) we removed newlines and non-alphanumeric characters, (4) we used the NLTK package to eliminate English stop words from the documents, and finally (5) we used the Snowball stemmer to normalize the text by transforming words into their root forms, e.g., playing converts to play. We found 269 795 stemmed words in total. Finally, we used the CountVectorizer class in Scikit-learn to transform the words into a vector of term/token counts to feed into a machine learning algorithm.

We used Scikit-learn,4 a popular machine learning library in Python that provides a range of supervised and unsupervised learning algorithms. Latent Dirichlet Allocation (LDA) is an unsupervised learning algorithm based on a generative probabilistic model that considers each topic as a set of words and each document as a set of topic probabilities [11]. LDA has been used to discover latent topics in documents in a large number of prior studies [12] [7] [13].

Before training a model, LDA requires a number of important parameters to be specified. LDA asks for a fixed number of topics and then maps all the documents to the topics. The Alpha parameter describes document-topic density, i.e., higher alpha means documents consist of more topics, and generates a more precise topic distribution per document. The Beta parameter describes topic-word density, i.e., higher beta means topics entail most of the words, and generates a more specific word distribution per topic.

The optimal values of hyperparameters cannot be directly estimated from the data, and, more importantly, the right choice of parameters considerably improves the performance of a machine learning model [14]. We therefore used the GridSearchCV function in Scikit-learn to perform hyperparameter tuning to generate candidates from an array of values for the three aforementioned parameters, i.e., Alpha, Beta, and the number of topics. As research has shown that choosing the proper number of topics is not simple in a model, an iterative approach can be employed [15] to render various models with different numbers of topics, and choose the number of topics for which the model has the least perplexity. Perplexity is a measure used to specify the statistical goodness of fit of a topic model [11]. We therefore specified the number of topics from 1 to 25. We also used the conditional hyperparameter tuning for Alpha, which means a hyperparameter may need to be tuned depending on the value of another hyperparameter [16]. We set alpha = 50 / number of topics and beta = 0.01, following guidelines of previous research [17].

Optimizing for perplexity, however, may not always result in humanly interpretable topics [18]. To facilitate the manual interpretation of the topics, we used a popular visualization package, named pyLDAvis[5], in Python. The two authors of this paper separately checked the resulting top keywords of the topics, i.e., from 1 to 25, and the associated pyLDAvis visualizations to ensure that the given number of topics is semantically aligned with human judgment.

C. Data analysis

We computed the required sample size for 91 954 documents with a confidence level of 95% and a margin of error of 5%, which is 383 documents. We then used stratified sampling to divide the whole population into smaller groups, called strata. In this step, we considered each topic as one stratum, and randomly selected the documents proportionally from the different strata. We then used thematic analysis, a qualitative research method for finding topics in text [19], to extract the frequent topics from the documents. Two authors of the paper carefully reviewed the title, question body, and answer body of each document. Each author then improved the extracted topics by labeling the posts iteratively. We then calculated Cohen’s kappa, a commonly used measure of interrater agreement [20], between the two reviewers. Finally, the two reviewers compared their final labelling results, and reanalyzed the particular posts in a session where they disagreed in order to discuss and arrive at a consensus.

This paper is available on arxiv under CC BY 4.0 DEED license.

[1] http://185.94.98.132/~crypto/paper_data/tags.csv

[2] http://185.94.98.132/~crypto/paper_data/

[3] https://www.crummy.com/software/BeautifulSoup/

[4] https://scikit-learn.org/

[5] https://github.com/bmabey/pyLDAvis

Authors:

(1) Mohammadreza Hazhirpasand, Oscar Nierstrasz, University of Bern, Bern, Switzerland;

(2) Mohammadhossein Shabani, Azad University, Rasht, Iran;

(3) Mohammad Ghafari, School of Computer Science, University of Auckland, Auckland, New Zealand.

Precision Analysis: LDA Topic Modeling & Stratified Sampling for Crypto Hurdles

Table of Links

III. METHODOLOGY