Table of Links
Results
Study Selection
The search process began by identifying a total of 328 studies from three databases: 192 from Google Scholar, 101 from PubMed, and 35 from IEEE Xplore. After removing 57 duplicate studies, 271 unique titles and abstracts were retained for screening. During the title and abstract review, 174 studies were excluded. These exclusions were due to issues related to methodology (53 studies), scope (77 studies), and publication type (44 studies). This left 97 full-text studies to be reviewed in detail.
Upon reviewing the full texts, another 50 publications were excluded. The reasons for exclusion included being outside the scope or irrelevant (32 studies), methodological concerns (6 studies), publication type (9 studies), and unavailability (3 studies). Ultimately, 47 studies were included in the final narrative synthesis.
Figure 1 outlines how the initial pool of studies was refined down to the most relevant research for inclusion.
Characteristics of Included Studies
In this systematic review, key details, as listed in Table 1, of all 47 included studies are provided in an online supplementary document. The majority of the studies focus on Twitter, Reddit, and Facebook, with 32 studies centered on Twitter, 8 on Reddit, and 7 on Facebook. Additionally, 1 study examined Blued (a platform for MSM communities), and another focused on Indian Social Networking Sites (SNS). Notably, 8 studies (17.02%) involved multiple platforms. The most commonly used models include traditional machine learning approaches such as Support Vector Machines (SVM) (19 studies), tree-based models (e.g., Decision Trees in 6 studies, Random Forests in 13 studies, and eXtreme Gradient Boosting (XGBoost) in 3 studies), and Logistic Regression (6 studies). Some studies also explored deep learning models, including Convolutional Neural Networks (CNN) (9 studies), Long Short-Term Memory networks (LSTM) (5 studies), and Bidirectional Encoder Representations from Transformers (BERT) (9 studies) for depression detection.
Methodological Quality and Risk of Bias
ion model Risk Of Bias ASsessment Tool (PROBAST) (Wolff et al., 2019), which provides a structured framework across four key domains: participants, predictors, outcome, and analysis. This comprehensive tool allowed for an in-depth evaluation of potential biases within each stage of the machine learning lifecycle, including data collection and preprocessing, model development, and model evaluation. Additionally, we evaluated biases in the reporting of study methodologies and findings, ensuring a thorough assessment of transparency and completeness. The risk of bias was assessed by applying targeted questions for each domain, which are listed in Table 2. By incorporating both PROBAST and bias evaluation in reporting, we aimed to identify common sources of bias, understand their implications for study findings, and assess the overall validity and generalizability of machine learning models used for mental health detection on social media.
Sample Selection and Representativeness (Q1 & Q2):
The reviewed studies employed diverse sampling methods across various social media platforms, primarily focusing on Twitter (63.8%) with additional data from Reddit (23.5%), Facebook (8.5%), and other social media (2.1%). Most studies (around 80%) used non-probability sampling techniques, such as convenience sampling or keyword filtering, often utilizing APIs (e.g.,Twitter API, Reddit API) to filter posts by specific mental health-related keywords like "depression" or "#MentalHealth," or leveraging pre-existing datasets from repositories like Kaggle.
The diversity in sampling criteria, sample sizes, demographic details, language focus, and geographic regions across the studies introduces potential biases. Sample sizes and levels of representation varied significantly among the studies, from small-scale studies (e.g., Study #46 which analyzed 4,124 Facebook posts from 43 undergraduate students with pre-specified criteria from the U.S.) to large-scale analyses (e.g., Study #5 which analyzed 56,411,200 tweets from 70,000 users across seven major U.S. cities). Many studies lacked detailed demographic information. The majority of studies focused predominantly on English-language posts and specific regions, such as the U.S., U.K., Japan, Spain, and Portugal, although a few studies examined posts in other languages, like Study #15, which analyzed Arabic tweets. Even within these regions and language-specific studies, demographic distribution was not always fully balanced. For example, Study #1 reported a mean participant age of 30.5 years (ranging from 18 to 68) and had a slight overrepresentation of female participants at 66.4%.
The non-representative sampling approaches observed across studies suggest limited generalizability to broader social media user populations. The primary biases identified include:
● Platform Bias: The predominance of Twitter (63.8%) over other platforms means that findings may not represent behaviors on platforms like Facebook, Instagram, or Reddit. As suggested by Olteanu et al. (2019), utilizing multi-platform data can reduce platformspecific biases and provide a more comprehensive view of user behaviors.
● Language Bias: The overwhelming focus on English-language content (over 90%) excludes insights from non-English-speaking communities, limiting the generalizability of findings across diverse linguistic groups. For instance, Study #15 was one of the few that analyzed non-English tweets, indicating the rarity of multilingual studies in this field. To address this, (Danet et al., 2007) recommended leveraging multilingual analysis methods, such as machine translation, or employing multilingual research teams to capture a more diverse linguistic landscape.
● Geographic Bias: Studies often concentrated on specific regions, such as the U.S. and European countries For example, Study #5 analyzed tweets from seven major U.S. cities, and Study #19 focused on Twitter users in Spain and Portugal. (Hargittai, 2015) suggests broadening the geographic scope to better represent global populations and avoid regionspecific findings.
● Selection Bias: Some studies relied on keyword-based sampling, which may overlook users not explicitly mentioning mental health. Study #7, for instance, searched for tweets containing "I was diagnosed with depression." As suggested by (Morstatter, 2013), combining keyword-based and random sampling can capture a broader range of user behaviors and discussions.
● Self-selection Bias: Platforms like MTurk or Clickworker, used in some studies (e.g., Studies #45 and #1, respectively), may attract specific demographic or employment profiles (e.g., higher digital literacy, particular age ranges, or specific socioeconomic statuses), affecting generalizability. (Chandler & Shapiro, 2016), recommend combining multiple recruitment sources and using stratified sampling to achieve a more representative participant pool.
In summary, no study in the review provided a fully representative sample of all social media users or posts. Key limitations include platform-specific focus (mostly Twitter), heavy reliance on non-probability sampling techniques (e.g., approximately 80% of studies utilized convenience sampling or keyword filtering), and geographic and linguistic constraints. Notably, over 90% of the studies themselves acknowledged these limitations, recognizing the challenges of achieving representativeness in social media research. These limitations are, to a large extent, unavoidable due to the nature of social media platforms and the constraints of current data collection methodologies. This underscores the need for ongoing efforts to develop more sophisticated sampling techniques and analytical methods to mitigate these biases.
Similarly, some studies explicitly stated that their findings were intended to represent only specific populations. For instance, Study #8 and Study #21 focused on users discussing mental health or particular demographic groups on specific platforms. These limitations significantly impact the generalizability of findings to the broader population of social media users. Future research should strive for more diverse and representative sampling across platforms, languages, and geographic regions to enhance the applicability of results in the field of mental health and social media research.
Data Preprocessing with Focus of Negative Words Handling (Q3):
Across all studies, several common preprocessing tasks were consistently performed. Tokenization was conducted in all studies to break text into individual words or tokens, and text normalization steps included converting text to lowercase, as well as removing punctuation, URLs, and special characters. Many studies also performed stop-word removal to eliminate common words that are generally not informative for modeling. Additionally, some studies applied stemming and lemmatization to reduce words to their base or root forms, thereby unifying different morphological variants. Feature extraction techniques such as Term Frequency-Inverse Document Frequency (TF-IDF) (Singh & Singh, 2022), Bag of Words (BoW) (Singh & Singh, 2022), and various word embedding methods were widely used to represent textual data numerically for modeling purposes.
While these standard preprocessing steps were broadly applied, certain aspects of sentiment analysis in mental health detection require additional attention. One such aspect is the effective handling of negative words, which is crucial for accurately interpreting sentiment and emotional tone, especially within this context. Among the 47 reviewed studies, approaches to negative words varied significantly:
First, only a minority of studies (11 out of 47 studies, approximately 23%) explicitly addressed negative words or negations in their preprocessing steps. Methods included standardizing all negative words to a basic form, like "not", during preprocessing, which simplifies the representation of negations and improves sentiment recognition (e.g., Studies #3 and #34). Some studies quantified negative words as features, by calculating metrics such as the user-specific average number of negative words per post. This metric captures the frequency of negative expressions per user and is then used as input for machine learning models to identify depressive emotions (e.g., Study #21). Others (e.g., Study #25) assigned a weight of -1 to negative adverbs to account for their inversion effect on sentence sentiment, ensuring more accurate sentiment quantification. Moreover, several studies employed specific methods for managing negations within their sentiment analysis frameworks. For example, some studies used sentiment analysis tools like TextBlob to determine the polarity of words in context, identifying negative words as indicators of depressive symptoms (e.g., Study #31). Others incorporated linguistic inquiry and word count (LIWC) categories related to negations and negative emotions, indirectly addressing negations through predefined lexicon categories (Studies #1, #40, #42, #46, and #47).
The importance of negation handling has also been recognized in studies currently under review. For instance, Study #6 specifically explored the role of negation preprocessing in sentiment analysis for depression detection. By comparing datasets with and without negation handling, the authors demonstrated that addressing negations can significantly improve the accuracy of both sentiment analysis and depression detection, underscoring the need to address them in preprocessing. This study highlights the critical need for comprehensive negation handling in preprocessing to enhance the reliability of machine learning models in mental health contexts. Second, a subset of studies (9 out of 47 studies, approximately 19%) did not explicitly handle negative words but employed advanced language models capable of inherently managing negations due to their contextual understanding, such as transformer-based models like Bidirectional Encoder Representations from Transformers (BERT) (Devlin, 2018) and Mental Health BERT (MentalBERT) (Owen et al., 2023) (e.g., Studies #8, #9, #15, #16, and #39). These transformer-based models can capture the context of negations by processing text bi-directionally without explicit preprocessing steps. Other studies used attention mechanisms[1] with word embeddings, such as attention layers combined with Global Vectors for Word Representation (GloVe) embeddings, allowing models to inherently understand and assign appropriate weights to negations through contextual embeddings (e.g., Studies #7, #10, and #13). Additionally, Embeddings from Language Models (ELMo), which capture the entire context of a word within a sentence, was also noted as a method that could capture the effect of negative words without explicit handling (Study #45).
However, the majority (27 out of 47 studies, approximately 57%) neither explicitly addressed negative words in their preprocessing nor used models inherently capable of handling negations (i.e., Studies #2, #4, #5, #11, #12, #14, #17, #18, #19, #20, #22, #23, #24, #26, #27, #28, #29, #30, #32, #33, #35, #36, #37, #38, #41, #43, and #44). These studies primarily focused on standard preprocessing tasks (e.g., tokenization, lowercasing, stop-word removal, stemming, and lemmatization), feature extraction methods (e.g., TF-IDF, BoW), and basic word embeddings (e.g., Word to Vector [Word2Vec]), without any special consideration for negations.
The impact on model performance and potential bias varied depending on how negative words were handled. Studies that explicitly addressed negative word handling reported improvements in model accuracy and a more nuanced understanding of sentiment (Helmy et al., 2024). Proper handling of negations allowed these models to correctly interpret phrases where negations invert the sentiment (e.g., "not happy" versus "happy"), leading to more reliable results. In contrast, studies that did not explicitly account for negative words risked misinterpreting negated expressions, introducing bias into their findings. This oversight can cause models to incorrectly assign positive sentiment to negated negative expressions or vice versa, thus skewing the analysis. Such biases can significantly affect the overall performance and generalizability of the models, particularly in sensitive applications like depression detection. While some studies used advanced models capable of inherently handling negations (e.g., Studies #7, #8, #9, #10, #13, #15, #16, #39, and #45), reliance solely on the model's ability without explicit preprocessing might not capture all nuances of negations. Explicitly addressing negations can further enhance model performance, even when using sophisticated language models (Khandelwal and Sawant, 2020). Therefore, integrating both advanced modeling techniques and careful preprocessing of negative words may provide the most effective approach.
In summary, the review highlights a significant gap in the explicit handling of negative words in data preprocessing among studies focused on sentiment analysis and related fields. Proper management of negations is crucial, as it can substantially impact both model accuracy and reliability. Without adequately handling negative words, models may introduce bias and reduce their effectiveness, particularly in applications such as mental analysis and depression detection, where understanding sentiment nuances is critical. Future studies should prioritize the inclusion of explicit negation handling techniques within their preprocessing pipelines to enhance model performance and ensure more accurate interpretations of textual data.
Model Development
Hyperparameter Tuning (Q3, Q4 & Q5)
Hyperparameter tuning is a critical aspect of optimizing machine learning models, directly impacting their performance and reliability. Our evaluation of the 47 reviewed studies focused on whether the studies reported their hyperparameters, the extent to which these hyperparameters were optimized, and whether tuning was applied consistently across all models within each study. In particular, 27 studies (approximately 60%) reported using hyperparameters, but not all of them performed proper tuning. Only a limited number of studies ensured consistent tuning across all models, with many opting for default settings or tuning only specific models, leaving significant performance potential unexplored (Yang & Shami, 2020). This practice suggests that while hyperparameters are acknowledged by researchers, there is still a notable gap in their comprehensive and consistent optimization across studies. The breakdown of hyperparameter reporting and tuning practices is presented in Table 3.
The absence of consistent hyperparameter tuning can result in suboptimal model performance, reduced generalizability, or biased model comparisons.
Key hyperparameters such as learning rate, regularization terms, or the number of hidden layers directly impact a model’s training process and final accuracy (Probst et al., 2019). Without proper tuning, models may overfit, meaning they perform well on training data but poorly on unseen data, or underfit, failing to capture the complexity of the data altogether. For example, Study #2 did not report any tuning, which likely affected its model's ability to generalize to unseen data, leading to reduced model performance.
When only some models are tuned, comparisons across models become biased, as those with optimized hyperparameters gain an undue advantage. In Study #1, for instance, the Elastic Net model had its hyperparameters tuned, while other models, such as random forest, were left with default settings. This discrepancy can misleadingly suggest the superiority of the Elastic Net model due to tuning alone, rather than any inherent advantage in its architecture, leading to biased model comparisons.
A significant proportion of studies did not report (approximately 40%) or failed to consistently tune them across all models (approximately 32%), which compromises the validity of their findings. For example, Studies #2 and #4 used default settings and missed opportunities to enhance performance, while Study #1 tuned hyperparameters for only one model, resulting in biased comparisons. Proper hyperparameter tuning is essential to avoid issues like overfitting or underfitting. Consistent tuning across all models ensures fair comparisons and enhances result validity.
Providing detailed descriptions of hyperparameter settings and optimization processes enhances transparency and reproducibility. Standardized tuning protocols, such as grid search, random search, or Bayesian optimization, should be employed to explore optimal configurations. Clearly documenting tuning strategies and any challenges encountered will provide valuable context for interpreting model performance results and strengthen the credibility of future machine learning studies. Future research should prioritize consistent tuning strategies and detailed reporting to enhance the credibility and reproducibility of their machine learning studies.
Data Partitioning (Q6):
Proper data partitioning is fundamental to developing robust machine learning models that generalize well to unseen data. Of the 47 reviewed studies, 32 studies (approximately 68%) adhered to recommended machine learning protocols by appropriately dividing their datasets into training, validation, and test sets or by employing cross-validation techniques. The breakdown of data partitioning practices is summarized in Table 4.
Among the studies that explicitly partitioned their datasets, such as Studies #1, #6, and #7, performance metrics were reported based on the test sets, adhering to the best practices outlined by (Goodfellow et al., 2016) (Goodfellow et al., 2016). By evaluating their models on unseen data, they ensured that the models' performance accurately reflected their generalizability.
Seven studies used cross-validation methods instead of a traditional train/validation/test split. Techniques like k-fold cross-validation provide a robust assessment of a model's ability to generalize by iteratively training and testing on different subsets of the dataset (Hastie et al., 2009). For instance, Study #39 utilized 5-fold cross-validation, where the dataset was divided into five subsets, with each subset used as a test set once while the remaining subsets formed the training set. The reported metrics—Positive Predictive Value (PPV), Sensitivity, and F1 Score—were averaged across the five test folds in the cross-validation process, ensuring that evaluation was based on separate test data rather than solely on the training data.
Conversely, as shown in Table 4, approximately 17% studies (8 out of 47) did not report sufficient details on data partitioning or did not employ partitioning techniques. For example, Study #2 provided limited information about its dataset division and did not elaborate on how model performance was evaluated, while Study #5 applied pre-existing models without conducting new data partitioning or validation within their analysis, thereby limiting the validity of their performance assessments.
Inadequate data partitioning practices introduces significant risk of bias, particularly overfitting. Models that lack proper data division tend to memorize the training data, leading to overly optimistic performance metrics that do not accurately reflect real-world applicability. (Bishop, 2006).
According to Andrew (Ng, 2018), proper validation and testing sets are crucial for assessing generalization and preventing overfitting. Without these, models may appear overly effective due to inflated performance metrics, misleading when applied beyond the training context. For example, studies that evaluated models solely on training data, such as Studies #2 and #5, likely overestimate their real-world performance.
In summary, while the majority of the reviewed studies adhered to best practices in data partitioning—thereby enhancing the credibility and generalizability of their findings—a significant minority did not. The lack of proper data partitioning in approximately 17% of studies introduces risks of bias, underscoring the need for more rigorous practices. For the development of robust models, future research should consistently apply proper data partitioning and report performance based on validation or test sets to provide accurate, unbiased evaluations. Transparent data partitioning and evaluation reporting, as emphasized by (Bishop, 2006) and (Goodfellow et al., 2016), is fundamental to enhancing reproducibility and reliability in machine learning research. By incorporating these practices, researchers can enhance the reliability of their models, ensure that findings are both valid and applicable in real-world scenarios and contribute to the advancement of the field.
Model Evaluation: Evaluation Metrics for imbalanced Class Scenarios (Q8, Q9 & Q10):
In the domain of depression-related emotion detection, datasets often exhibit significant class imbalance, with non-depressed cases vastly outnumbering depressed ones. This imbalance poses challenges for model evaluation, as traditional metrics like accuracy can be misleading. According to He & Garcia (2009), accuracy may not adequately reflect a model's performance in imbalanced scenarios because a model could achieve high accuracy by simply predicting the majority class. Therefore, metrics such as the recall, precision, F1 score, and Area Under the Receiver Operating Characteristic Curve (AUROC or AUC) are preferred, as they provide a more balanced evaluation by accounting for both false positives and false negatives. Japkowicz (2013) further emphasize the necessity of using these metrics, arguing that they are crucial for a comprehensive assessment of model performance in the presence of class imbalance.
In the context of depression detection, recall is particularly important, as it measures the proportion of actual positive cases (individuals with depression) that the model correctly identifies. In applications where missing a positive case could have serious consequences, such as failing to identify someone who is depressed and may need help, high recall is crucial. This prioritization ensures that the model captures as many true positive cases as possible, even if it results in more false positives.
Precision, on the other hand, is equally important because it measures the proportion of positive predictions that are correct. In depression detection, low precision indicates a high rate of false positives—incorrectly labeling non-depressed individuals as depressed, potentially causing unnecessary concern for those wrongly flagged as depressed. Therefore, balancing precision with recall is essential to ensure that the model is not only identifying true cases of depression but also minimizing the number of false alarms.
The F1 score, representing the harmonic mean of precision and recall, provides a balanced measure of both recall and precision. It is particularly useful in imbalanced datasets, where a balance between recall and precision is essential.
Finally, AUROC measures the model's ability to distinguish between positive and negative classes across different threshold settings, providing a comprehensive view of the model's discriminatory power. A higher AUROC indicates better capability of distinguishing between depressed and non-depressed individuals, making it a robust metric for evaluating models in this domain. Among the 47 studies reviewed, approximately 35 (Studies #1, #3, #6, #7, #8, #13, #14, #15, #16, #17, #19, #21, #22, #23, #25, #26, #27, #28, #29, #30, #31, #32, #33, #34, #35, #36, #37, #39, #40, #41, #42, #43, #44, #45, #46) utilized these preferred metrics. For example, Study #6, "Depression Detection for Twitter Users Using Sentiment Analysis in English and Arabic Tweets", employed precision, recall, F1 score, and AUC to evaluate their models, acknowledging the importance of these metrics for imbalanced data. Similarly, Study #42, "Classification of Helpful Comments on Online Suicide Watch Forums", emphasized the recall as a key metric in evaluating their model's effectiveness in identifying individuals at risk.
Other than the utilization of preferred metrics, an alternative way to address imbalanced data involves implementing data balancing techniques, including resampling and reweighting. For instance, Study #6, "Depression detection for twitter users using sentiment analysis in English and Arabic tweets", employed dynamic sampling methods, such as oversampling the minority class and undersampling the majority class, to balance the dataset. This approach ensured that the model had sufficient exposure to both classes before model construction and evaluation. Similarly, Study #41, "A Deep Learning Model for Detecting Mental Illness from User Content on Social Media", used Synthetic Minority Oversampling Technique (SMOTE) to enhance the representation of the minority class, leading to improved classification performance, particularly for underrepresented classes.
Notably, some studies (Studies #3, #6, #13, #15, #34, #40, #41, #42, #43) applied both data balancing techniques and preferred evaluation metrics together to comprehensively address the class imbalance. For example, "Explainable Depression Detection with Multi-Aspect Features Using a Hybrid Deep Learning Model on Social Media" (Study #13) first implemented preprocessing steps to balance the dataset, enhancing the model's ability to learn from both classes equally. After addressing the class imbalance, the study then used the F1 score and related metrics to evaluate model performance, ensuring a more accurate and fair assessment. These examples indicate that researchers are increasingly aware of the class imbalance issue and are employing various approaches to address it effectively.
Conversely, some studies primarily relied on accuracy without addressing class imbalance issues. For examples, Studies #2, #10, and #24 reported high accuracy but did not mention techniques to mitigate the effects of class imbalance.
In the context of depression detection, addressing class imbalance is essential for achieving reliable model evaluation. When instances of the non-depressed class significantly outnumber those of the depressed class, the resulting imbalance can skew model outcomes if not properly managed. Two primary strategies are commonly employed to mitigate this issue: the use of evaluation metrics that accommodate class imbalance and data pre-processing techniques, such as resampling and reweighting. Japkowicz and Stephen (2002) emphasize that metrics like recall, precision, and F1 score offer a more nuanced evaluation by accounting for both positive and negative classes, thus reducing potential bias. Additionally, data pre-processing methods like reweighting or resampling adjust the dataset to provide balanced exposure to both classes, enhancing model training on imbalanced data.
While some studies utilized both strategies, demonstrating a thorough approach to handling imbalance, others employed just one—either through preferred evaluation metrics or data balancing. Even when only one strategy is adopted, it can still reduce potential bias to some extent. However, solely relying on accuracy introduces a significant risk of bias, as it often leads the model to favor the majority class, thereby failing to identify depressed individuals accurately. Chawla et al. (2004) highlight that this reliance on accuracy alone can lead to misleading conclusions in imbalanced datasets, as it does not accurately reflect the model’s ability to detect minority class instances.
Out of the 47 studies analyzed, approximately 35 employed preferred metrics such as F1 score, precision, recall, or AUROC, recognizing their importance in evaluating models on imbalanced datasets. Seven studies explicitly mentioned preprocessing steps like resampling to mitigate class imbalance, even when using accuracy as an evaluation metric. However, several studies relied mainly on accuracy without addressing class imbalance, potentially introducing bias into their evaluations.
In conclusion, while a significant number of studies have adopted appropriate evaluation metrics and techniques to address class imbalance, there remains a need for broader implementation of these practices. Incorporating balanced metrics and addressing class imbalance is essential for reliable and valid model evaluations in depression detection research. As Fernández et al. (2018) recommended, employing these strategies enhances the robustness of machine learning models in domains characterized by imbalanced datasets.
Reporting: Transparency and Completeness:
Transparency and completeness in reporting are fundamental to the integrity and reproducibility of scientific research. In our examination of the 47 studies, we assessed the extent to which they transparently reported their methodologies, findings, and limitations. Notably all studies (100%) included a limitation section, indicating an overall acknowledgment of the importance of addressing potential shortcomings. However, the depth and specificity of these disclosures varied significantly across the studies.
While every study mentioned limitations, not all of them fully recognized or disclosed all critical methodological issues that could impact their findings. For instance, as highlighted in our earlier assessments, approximately 23% of the studies (11 out of 47) did not properly partition their data or failed to report their data partitioning methods adequately (Studies #2, #5, #9, #12, #20, #27, #31, and #37). Despite this, only a few of these studies explicitly acknowledged the potential biases introduced by improper data partitioning in their sections of limitations. This suggests that while researchers are generally aware of the necessity to report limitations, there is a gap in fully understanding or disclosing specific methodological shortcomings, such as data partitioning, which is crucial for model generalizability and validity.
Similarly, in the context of hyperparameter tuning, approximately 43% of the studies did not report or properly tune hyperparameters across all models used (e.g., Studies #1, #2, #4, #5, #12, #14, #17, #19, #20, #24, #27, #29, #30, #32, #34, #35, #37, #38, #42, #44, and #46). Only a few acknowledged this limitation in their reports. This lack of comprehensive reporting on hyperparameter tuning can lead to biased model comparisons and affect the reproducibility of the studies.
Incomplete or non-transparent reporting can introduce significant bias and limit the reproducibility and applicability of research findings. When critical methodological details are omitted or underreported, it hinders the ability of other researchers to replicate studies or to understand the context in which the results are valid. For instance, failing to disclose improper data partitioning can lead to overestimation of model performance due to overfitting (Bishop, 2006). Models evaluated on training data or without appropriate validation may appear to perform well, but this performance may not generalize to new, unseen data. This oversight can mislead stakeholders about the efficacy of the models and affect subsequent research or practical applications that build upon these findings.
Similarly, not reporting on hyperparameter tuning practices can result in unfair comparisons between models and misinterpretations of their relative performances (Claesen & De Moor, 2015). Models with optimized hyperparameters may outperform others not because they are inherently better but because they were given an optimization advantage. Without transparency in reporting these practices, readers cannot assess the fairness of the comparisons or replicate the optimization process.
In conclusion, while all 47 studies recognized the importance of reporting limitations, there remains a notable disparity in the thoroughness and transparency of their reporting. For the field to advance, transparent and comprehensive reporting of methodologies and limitations is essential. Future research should strive for complete disclosure of data collection, preprocessing, model development, hyperparameter tuning, and evaluation metrics. This includes acknowledging specific methodological limitations, such as data partitioning practices and sampling biases, and discussing how these limitations may impact results and generalizability. Such transparency will allow others to interpret findings accurately, replicate studies, and build upon prior work effectively
Summary of Findings and Implications for Future Research
This systematic review evaluated biases throughout the entire lifecycle of machine learning and deep learning models for depression detection on social media. In sampling, biases arose from a predominant reliance on Twitter, English-language data, and specific geographic regions, limiting the representativeness of findings. Data preprocessing commonly showed inadequate handling of negations, which can skew sentiment analysis results. Model development was often compromised by inconsistent hyperparameter tuning and improper data partitioning, reducing model reliability and generalizability. Lastly, in model evaluation, an overreliance on accuracy without addressing class imbalance risked favoring majority class predictions, potentially misleading results. These findings highlight the importance of enhancing methodologies to bolster the validity and applicability of future research.
To address these biases, future research should improve practices across all stages of the machine learning lifecycle. Expanding data sources across multiple platforms, languages, and regions will help mitigate platform and language biases and improve representativeness. Standardizing data preprocessing, especially with explicit negation handling, and employing resampling and reweighting techniques will enhance sentiment analysis accuracy and balance datasets. Consistent hyperparameter tuning protocols are essential to ensure fair model comparisons and optimal performance. Lastly, prioritizing evaluation metrics like precision, recall, F1 score, and AUROC in imbalanced datasets, particularly for depression detection, will yield more accurate and insightful assessments. By implementing these improvements, future studies can achieve greater model robustness and generalizability, contributing to more effective mental health detection tools.
Authors:
(1) Yuchen Cao, Khoury college of computer science, Northeastern University;
(2) Jianglai Dai, Department of EECS, University of California, Berkeley;
(3) Zhongyan Wang, Center for Data Science, New York University;
(4) Yeyubei Zhang, School of Engineering and Applied Science, University of Pennsylvania;
(5) Xiaorui Shen, Khoury college of computer science, Northeastern University;
(6) Yunchong Liu, School of Engineering and Applied Science, University of Pennsylvania;
(7) Yexin Tian, Georgia Institute of Technology, College of Computing.
This paper is available on arxiv under CC BY 4.0 license.
[1] Attention mechanisms allow models to focus on specific parts of the input data by assigning different weights to different elements. This enables the model to capture and utilize relevant contextual information more effectively during processing.