Published on in Vol 9, No 3 (2022): Jul-Sep

Preprints (earlier versions) of this paper are available at, first published .
Integrating Natural Language Processing and Interpretive Thematic Analyses to Gain Human-Centered Design Insights on HIV Mobile Health: Proof-of-Concept Analysis

Integrating Natural Language Processing and Interpretive Thematic Analyses to Gain Human-Centered Design Insights on HIV Mobile Health: Proof-of-Concept Analysis

Integrating Natural Language Processing and Interpretive Thematic Analyses to Gain Human-Centered Design Insights on HIV Mobile Health: Proof-of-Concept Analysis

Original Paper

1Department of Social, Behavioral, and Population Sciences, School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA, United States

2Department of Psychology, Hunter College, City University of New York, New York, NY, United States

3Department of Psychology, San Diego State University, San Diego, CA, United States

Corresponding Author:

Simone J Skeen, MA

Department of Social, Behavioral, and Population Sciences

School of Public Health and Tropical Medicine

Tulane University


1440 Canal Street

New Orleans, LA, 70112

United States

Phone: 1 504 988 1847


Background: HIV mobile health (mHealth) interventions often incorporate interactive peer-to-peer features. The user-generated content (UGC) created by these features can offer valuable design insights by revealing what topics and life events are most salient for participants, which can serve as targets for subsequent interventions. However, unstructured, textual UGC can be difficult to analyze. Interpretive thematic analyses can preserve rich narratives and latent themes but are labor-intensive and therefore scale poorly. Natural language processing (NLP) methods scale more readily but often produce only coarse descriptive results. Recent calls to advance the field have emphasized the untapped potential of combined NLP and qualitative analyses toward advancing user attunement in next-generation mHealth.

Objective: In this proof-of-concept analysis, we gain human-centered design insights by applying hybrid consecutive NLP-qualitative methods to UGC from an HIV mHealth forum.

Methods: UGC was extracted from Thrive With Me, a web app intervention for men living with HIV that includes an unstructured peer-to-peer support forum. In Python, topics were modeled by latent Dirichlet allocation. Rule-based sentiment analysis scored interactions by emotional valence. Using a novel ranking standard, the experientially richest and most emotionally polarized segments of UGC were condensed and then analyzed thematically in Dedoose. Design insights were then distilled from these themes.

Results: The refined topic model detected K=3 topics: A: disease coping; B: social adversities; C: salutations and check-ins. Strong intratopic themes included HIV medication adherence, survivorship, and relationship challenges. Negative UGC often involved strong negative reactions to external media events. Positive UGC often focused on gratitude for survival, well-being, and fellow users’ support.

Conclusions: With routinization, hybrid NLP-qualitative methods may be viable to rapidly characterize UGC in mHealth environments. Design principles point toward opportunities to align mHealth intervention features with the organically occurring uses captured in these analyses, for example, by foregrounding inspiring personal narratives and expressions of gratitude, or de-emphasizing anger-inducing media.

JMIR Hum Factors 2022;9(3):e37350




The advent of antiretroviral therapy (ART) marked an inflection point in the global AIDS epidemic, transforming HIV into a manageable chronic condition [1-3]. With people living with HIV who maintain undetectable viral loads incapable of passing the virus to their sexual partners, viral suppression by optimizing ART adherence is now a key tenet of population-level HIV-prevention planning [4,5]. However, ART adherence remains a challenge for many people living with HIV, endangering their health through viral rebound [6]. These challenges are attributable to a range of interlocking factors, many of them mirroring broader societal inequities in the United States: mistrust of medical providers, logistical and financial burdens of medical appointments, and stigma [7-10]. Unreliable transit, a lack of accessible brick-and-mortar services, and trauma can compound these challenges, particularly for many Black men who have sex with men (MSM) [11,12].

These persistent challenges suggest that traditional clinic-based treatment programs may be inadequate for fulfilling the needs of many MSM living with HIV. Mobile health (mHealth) interventions, which offer tools such as informational videos, hyperlocal service guides, and peer-support forums, have shown promise in this domain [13-17], including among MSM [15]. Many mHealth interventions include user-centered adaptations to bolster their appeal to user bases who inhabit intersecting identities (eg, Messages4Men for Black and Latino MSM [18]) or undertake specific risk behaviors (eg, APP+ for stimulant-using MSM [19]).

Traditional formative methods [20-23], often guided by the principles of user- and human-centered design (HCD [24-29]), aim to incorporate the insights of prospective mHealth user bases. Focus groups, user-experience interviews, and related in-person or virtual interactions are often undertaken to gain these insights. These methods can represent important contributions toward global health equity [30,31]. However, by relying on in-depth and often iterative interactions such as “think-aloud” usability tests [32], these methods can be burdensome to members of the communities they aim to empower, requiring time and logistical commitments akin to traditional study participation [33-35]. One alternative to these immersive approaches is mining user-generated content (UGC), comprising rich, unstructured, text-based data that end users themselves contribute to platforms, often in the form of social media posts or product reviews [36,37]. Across diverse sectors [37-39], UGC is increasingly recognized as an unmediated source of experiential data, through which consumers’, citizens’, and end users’ needs can be ascertained noninvasively at scale [40,41].

The scale of UGC data can introduce analytic challenges. The extraction of meaningful units of analysis among vast unstructured data is the foremost among those challenges [42]. Natural language processing (NLP) approaches, which rely on machine-readable elements such as keyword frequencies and probabilistic distributions of keyword clusters [43], are often employed for UGC analyses [44,45]. One common NLP technique is topic modeling (TM), in which the likelihood of contextually meaningful terms to co-occur in relative proximity to each other and thus signify a discrete topic within an unstructured text is computed [46]. For example, the relative proximity of the terms “epidemic,” “antiretroviral,” and “suppression” in the opening paragraphs of this introduction would be highly unlikely to occur by chance alone. Instead, their likelihood to co-occur in those passages can be interpreted as a meaningful signifier of the topic in those passages, namely HIV treatment. The topic model itself is composed of these co-occurring terms [43]. Another widely employed NLP technique, sometimes used in concert with TM [47], is sentiment analysis (SA). SA refers to a variety of tools that map individual keywords and other syntactic units to a prevalidated human-rated lexicon, computing a crude but summative account of the prevailing emotional tenor of a text [45,48].

NLP techniques are typically incapable of preserving narrative, subtext, and nuance [49,50]. Within digital health research, recent attempts to address these shortcomings have integrated NLP with traditional qualitative methods. These methods, although fruitful, remain exploratory, and are often resource-intensive, with little evident standardization in methods. In health sciences, combined NLP and qualitative approaches have been applied, preliminarily, toward cross-validation of each respective approach. For example, Leeson et al [51] have shown that conceptual overlaps among the findings of probabilistic TM using the Gensim toolkit in Python, the neural network application Word2Vec, and open qualitative coding are broad but not uniform [51], demonstrating the value of a “both-and” versus an “either-or” approach to machine- versus human-optimized analyses of UGC. The clearest strength of the “both-and” approach is its ability to analyze very large textual data sets, while preserving important nuance. To this end, Guetterman et al [52] combined qualitative coding and an NLP semantic-similarity clustering technique to classify open-ended text message responses to the MyVoice national youth poll. Through a modified 2-arm crossover experiment, NLP, qualitative, and sequential NLP-qualitative and qualitative-NLP variations were compared. Although the latter sequential approaches proved most time-consuming, they were able to check the validity of exploratory qualitative work or cultivate more nuanced interpretations of NLP-applied topics, respectively [51]. Jones et al [53] used a sequential qualitative-NLP approach to model topics across 4,901,516 posts contributed to 5 breast cancer forums scraped (with permission) from the open web. Timimi et al [54], examining UGC from the Inspire online support communities, used a nested NLP-qualitative approach to generate “entities” (a clustering technique) across more than 11 million unique posts. An inductive thematic coding analysis, applied to a subset of 246 posts, aided in developing a patient-centered lexicon to identify cognitive impairment side effects related to statin use.

Specifically, within mHealth, Petersen et al [55] integrated latent Dirichlet allocation (LDA) TM and SA with standard assessments of usability within a user-centered app design process. The sentiment of formative user interviews trended more positive as development progressed, which was reflected through improvements in the System Usability Scale (though not usefulness, satisfaction, and ease of use) scores. To our knowledge, no prior studies have applied a combined NLP-qualitative approach to textual UGC derived from an interactive mHealth environment. This is despite recent calls to bridge the respective strengths of data mining, at scale, with the richly realized insights provided by end-user narratives, to advance design practices in mHealth [56]. These detailed user-experience insights are necessary to advance mHealth design within the HCD paradigm [24,57,58]. If mHealth is to play a key role in the global HIV epidemic response, its persistent adoption will require deeply humanistic, yet scalable, strategies to guide user-centered adaptation. To this end, analyses of UGC in HIV mHealth must preserve the full range of human experiences and unique needs of multiply marginalized people living with HIV.


Recent findings point to the relative strengths of the sequential NLP-qualitative approach toward characterizing large-scale UGC, while preserving experiential nuance [51-55]. We applied a variation of this approach to UGC from the peer-support forum of Thrive With Me, a web app tailored for gay and bisexual MSM living with HIV [59,60]. Blending the strengths of machine-optimized techniques using NLP analyses with the strengths of traditional qualitative analyses, our findings were guided by the following aims:

Aim 1: To demonstrate the viability of a novel, sequential, NLP-qualitative approach toward characterizing UGC contributed by the end users of Thrive With Me

Aim 2: To examine the implications of the UGC-derived insights obtained in Aim 1 toward developing user-centered design adaptations for the next generation of HIV mHealth interventions

Study Intervention

Thrive With Me is a web app–delivered intervention that combines self-monitoring tools for ART adherence, informative multimedia covering ART adherence, and asynchronous peer-to-peer support within a pseudonymous forum with the aim of improving treatment adherence among MSM living with HIV. Its components are grounded in the Information-Motivation-Behavioral skills (IMB) model of health behavior change [61]. An early iteration of Thrive With Me demonstrated preliminary efficacy versus treatment as usual in a pilot randomized controlled trial [59]. A prospective 2-arm randomized controlled trial testing a refined version of Thrive With Me versus an information-only control condition finished in 2019, with outcome analyses presently underway [60]. A screenshot of the user interface on which Thrive With Me users interacted is shown in Figure 1.

Figure 1. Illustrative screenshot of the Thrive With Me peer-support forum’s user interface. Posts and comments in the screenshot were mocked up by the study staff for demonstration purposes.
View this figure

Study Population

Participants were eligible if they (1) were HIV seropositive, (2) identified as males, (3) had a self-reported detectable viral load or suboptimal (<90%) ART adherence in the past 30 days, (4) reported sex with another man in the past 12 months, (5) could read and write English, (6) resided in the New York City area, and (7) had access to the internet and SMS text messaging for the duration of the study [60]. This study analyzed UGC contributed by participants randomized to the trial’s active intervention condition (N=202), who were given access to the Thrive With Me web app for a period of 5 months at baseline. (Throughout, we use “UGC” to refer to unstructured text exclusively, distinct from paradata or usage analytics.) The subsample’s sociodemographic attributes are shown in Table 1. Full details of the Thrive With Me parent trial are available elsewhere [60].

Table 1. Baseline characteristics of Thrive With Me study participants in the intervention arm.
DemographicsThrive With Me intervention arm (N=202)
Age, mean (SD)40.1 (10.8)
Male, n (%)202 (100)
Race, n (%)

African American or Black123 (61)

American Indian/Alaskan Native1 (0.5)

Asian1 (0.5)

Native Hawaiian or Pacific Islander2 (1.0)

White54 (27)

More than one race12 (5.9)

Not reported9 (4.5)
Hispanic, n (%)62 (31)
Education, n (%)

High school or less59 (29)

Some college/associates/technical degree90 (45)

College/postgraduate/professional degree52 (26)

Not reported1 (0.5)
Employment status, n(%)

Full-time41 (20)

Part-time45 (22)

Unemployed77 (38)

Disabled35 (17)

Retired2 (1.0)

Not reported2 (1.0)
Viral load (VL) measures

VL (biological) (<20), n (%)

Detectable VL74 (37)

Undetectable VL127 (63)

Not reported1 (0.5)

Ethics Approval

All study procedures and the use of associated data for secondary analyses were approved by the ethics review boards of the University of Minnesota (#1504S69721) and Hunter College of the City University of New York (#2015-0641).


Initially, our procedures relied on the NLP techniques of unsupervised TM and rule-based SA to capture the semantic attributes of UGC drawn from Thrive With Me. We then employed a novel ranking technique to condense the richest and most emotionally polarized UGC. Finally, the detailed insights included in this condensed UGC were explored using the qualitative technique of interpretive thematic analysis. A flowchart of our complete procedure is shown in Figure 2.

Figure 2. Flowchart of sequential machine- and human-optimized techniques. ICR: intercoder reliability; LDA: latent Dirichlet allocation; SA: sentiment analysis; TM: topic modeling; UGC: user-generated content; VADER: Valence Aware Dictionary for sEntiment Reasoner.
View this figure
Data Extraction

Textual UGC from Thrive With Me’s peer-support forums were extracted by the web app’s developer, Radiant, as a structured .csv file using the Drupal content management system’s Entity Export CSV function. Original posts and the comments they accrued were handled uniformly (referred to as posts throughout) for the sake of analysis. Content generated by study staff during prelaunch testing was removed manually before preprocessing. With test content excised, the raw UGC corpus contained 4912 posts and 147,649 total words. To accommodate necessary differentiation in the preprocessing steps, 2 UGC corpora were created: the SA corpus and TM corpus.

Data Preprocessing

All subsequent data preprocessing and NLP analyses were undertaken in Python (version 3.7.10, Python Software Foundation) on the Windows 10 (Microsoft Corporation) operating system.

In the TM corpus, first, unigram frequencies were calculated, and any unigrams occurring fewer than 3 times were discarded. The 571-term SMART (System for the Mechanical Analysis and Retrieval of Text) stop list was applied, excising all unigrams, such as “the” and “of,” terms whose co-occurrences are not typically indicative of the underlying topics from the raw TM corpus [62,63]. Capitalization and punctuation were removed throughout. All terms were converted to lowercase and then “split by whitespace” to ensure consonance in model inputs [43].

In the SA corpus, all semantic elements were preserved. In social media environments such as the Thrive With Me forum, peculiarities in syntax may amplify or even invert the intended sentiment of a text (eg, “so happy” versus “SOO happy!!! <3” versus “sooo happy. /s”) and thus represent important model inputs to retain [64].

TM Process

All steps in TM were applied to the TM corpus. We used the unsupervised LDA algorithm native to the scikit-learn (“sklearn”) Python library [65]. LDA is a generative probabilistic model that outputs a distribution of words (termed “tokens” [66]), which characterize the discrete topics within a text corpus [46]. K, the number of topics an LDA model will detect, is a model input determined based on prior familiarity with a corpus, relevant domain expertise, and the results of exploratory analyses [43]. Replication scripts for LDA TM are provided in Multimedia Appendix 1.

Each LDA model was evaluated for coherence by the first and second authors (SJS and SSJ) aided by the pyLDAvis tool. pyLDAvis plots modeled topics in 2 dimensions represented by circles, allowing for visual inspection of intertopic distances (how thematically distinct each topic is) and topic prevalence (how much content within a corpus each topic captures). A satisfactory K is characterized visually by circles with sufficiently large radii to capture a substantive share of a corpus and negligible overlap between circles, indicating discriminant interpretability across topics [67]. Detailed documentation on the use of pyLDAvis is available elsewhere [68]. We denote this preliminary LDA model as Model 1, which advanced to first-pass thematic analysis.

Finally, informed by Schofield and colleagues [62,63] and based on the coding schema developed inductively with Model 1, we removed high-frequency, non-topic–specific n-grams to generate a more intuitive set of tokens. This second pass was used to provide more self-evidently meaningful clusters of tokens for this proof-of-concept analysis. We denote this final model as Model 2.

Topic labels were developed based on domain knowledge, visual inspection of the top 30 per-topic tokens, and their particular distribution and contextual usage within the full series of posts assigned to each topic. Labels were finalized based on consensus between the first and second authors (SJS and SSJ).

SA Process

All steps in SA were applied to the SA corpus using the vaderSentiment library in Python [69]. We used the human-validated VADER (Valence Aware Dictionary for sEntiment Reasoner) sentiment lexicon, which scores the valence and intensity of individual terms and their related semantic elements, such as emoticons (“(: ”) and abbreviations common to social media and web-based forums (“lol” and “wtf”). VADER outputs polarity (positive-neutral-negative, on a scale of –1 to +1) scores for each input string [64]. For this analysis, we generated sentiment polarity and compound scores per unique post. As the richest instances of neutral-sentiment UGC were thematically redundant with the posts examined via LDA, we focused on emotionally polarized UGC captured by VADER’s positive and negative polarity scores. This focus on polarized UGC allowed us to explore sources and expressions of distress, while highlighting organically occurring positive interactions among Thrive With Me users.

Replication scripts for VADER SA are provided in Multimedia Appendix 1.


Data condensation strengthens an analytic sample by honing it to its richest, most illustrative cases [70]. To condense the raw 4912-post UGC corpora, we used a novel percentile-ranking standard, loosely informed by (and considerably simplified from) the work of Nikolenko and colleagues [49] to advance the most meaningful data toward thematic analysis.

In the TM corpus, we calculated a simple affinity score for each post by summing the number of topic-specific tokens that appeared within that post. In this context, affinity refers to the degree to which each post is representative of the topic to which it has been assigned [49]. Using the =PERCENTILE() function in Excel (Microsoft Corporation) [71], we identified the 90th percentile affinity score for each topic, discarding posts that contained fewer topic-specific tokens than the 90th percentile thresholds.

In the SA corpus, we relied on VADER-generated polarity scores for percentile ranking. Posts that fell below the 90th percentile polarity score for positive and negative valences were discarded.

The SA and TM corpora were percentile-ranked independently. LDA modeling, which relies on co-occurrence of terms, favors verbose UGC, whereas VADER, reliant on purer expressions of sentiment, favors concision; hence, no UGC was duplicated in the condensed TM and condensed SA corpora. Specifically, richer and verbose UGC was emphasized in the condensed TM corpus, whereas emotive and concise UGC was emphasized in the condensed SA corpus.

A 90th percentile cutoff was chosen to condense a data set such that it became compact enough to be handled by 2 human coders (SJS and CMC) for the following inductive thematic analyses.

Interpretive Thematic Analyses

The condensed data set, comprising high-affinity and high-polarity UGC, was then subdivided into .csv files for thematic analysis by human coders. We used an inductive latent-level approach to examine the underlying concepts and discursive nuances intratopically [72]. Each stable topic and the strongest positively and negatively scored posts were thus handled as a meta-theme, each within a discrete .csv file. Human coders (SJS and CMC) undertook immersive close reads of these posts, identifying emergent intratopic themes and building pilot codes, first independently and then collaboratively, informed by the RADaR (rigorous and accelerated data reduction) technique in Excel [73]. Initially, we conducted open coding in Excel to leverage the accessibility of rapid matrix analysis techniques undertaken with nonspecialized software and to facilitate the necessary sorting and ranking of posts. Codes were applied iteratively and the overall coding schema was refined in conference, until unanimity in coding applications was obtained. Then, all data were migrated to Dedoose (SocioCultural Research Consultants) for final coding of the condensed data set that included LDA Model 1, where an overall pooled intercoder reliability of κ=0.78 was achieved [70,74]. Finally, after obtaining the acceptable intercoder reliability, the first author independently applied the coding schema to the condensed data set that included LDA Model 2 in Dedoose, producing the final coding applications reported here.

TM Process

The LDA model rated for optimal coherence comprised K=3 topics, each composed of 30 co-occurring tokens. Topic A, disease coping [75], encompassed all posts in which the subject of living with HIV as a chronic condition predominated. Topic B, social adversities, covered those posts explaining the difficulties of navigating the interpersonal sphere as a person living with HIV. Topic C, salutations and check-ins, covered the broad array of brief greetings and personal updates routinely shared by users of the Thrive With Me forum. From the refined model, Model 2, our condensed data set included the 67 posts that contained more than 5 topic A–specific tokens (mean 7.31, SD 1.83), 118 posts containing more than 6 topic B–specific tokens (mean 9.43, SD 2.05), and 113 posts containing more than 4 topic C–specific tokens (mean 5.81, SD 1.14).

Altering the percentile split (whose primary rationale in this study was pragmatic) would have varied the size of the condensed UGC corpus considerably. In topic A, at the 75th percentile, at >3 tokens per post, 188 posts would be carried forward to thematic analysis; at the 95th percentile, or >6 tokens per post, 38 posts would be carried forward. In topic B, at the 75th percentile, at >4 tokens per post, 270 posts would be carried forward to thematic analysis; at the 95th percentile, or >8 tokens per post, 72 posts would be carried forward. Topic C, given the sparser nature of its UGC, was more dispersed. At the 75th percentile, at >2 tokens per post, 522 posts would be carried forward to thematic analysis; at the 99th percentile, or >6 tokens per post, only 20 posts would be carried forward.

The Model 2 tokens that characterize these topics, their labels and definitions, details of their condensation including the 90th percentile affinity score thresholds, and illustrative excerpts are shown in Table 2.

The number of posts and the number of tokens detected in LDA modeling by topic and by user are tabulated in Multimedia Appendix 2.

Table 2. Machine-detected topics, token n-grams, intratopic condensation, definitions, and illustrative examples.
TopicModel 1 tokensModel 2 tokensLabelDefinitionModel 2

Posts per topic, n (%) (N = 4912)90th percentile thresholdHigh-affinity posts per topic, n (%) (N=1276)Example high-affinity postsa
Aaids, care, com, doctor, don, effects, free, health, help, hiv, http, https, just, know, living, meds, need, new, people, positive, support, taking, thanks, time, took, treatment, undetectable, use, www, yesaids, care, community, days, doctor, effects, feel, free, gay, health, hiv, living, know, meds, man, men, need, new, people, positive, really, sex, support, taking, think, time, took, treatment, undetectable, useDisease copingPortrayals of daily living with HIV, emphasizing serostatus awareness, ARTb regimens, and other sociomedical topics1028 (20.92%)>5 topic-specific tokens per post67 (5.25%)I don’t think disclosing an HIV undetectable viral load will persuade anyone who is HIV negative that we’re less likely to infect them. It can be useful for potential partners that are also HIV positive because they are more likely to understand and accept that an undetectable viral load lowers the risk for re infection. Someone looking to avoid HIV or the risk of having sex with anyone HIV infected will likely not care or understand about undetectable viral loads.
Bblessed, cause, com, come, day, don, feel, gay, good, https, just, know, life, like, love, make, morning, men, real, really, people, person, sex, say, things, think, time, want, way, wwwbetter, blessed, cause, come, day, feel, gay, good, hard, know, life, live, love, make, man, men, people, person, point, need, new, real, really, say, think, time, want, way, work, yearSocial adversitiesPortrayals of challenges and accomplishments in navigating sociality and sexuality as a sexual minority MSM living with HIV1555 (31.65%)>6 topic-specific tokens per post118 (9.25%)Truth b told…. what i am finding hard is to find guys that want more than a hook up....(they) always seem to want to sleep together FIRST (…) just sleeping with strangers right away doesn’t turn me on like it used to...makes me feel kind of like a freak at times..... if all i wanted to do was ‘play’- I’d have NO problem finding guys to roll around with- even with my status- which i immediately and upfront disclose, both online and in person....... It’s finding guys who want conversation and dating and getting to know someone that has been hardest for me....
Cbetter, day, days, enjoy, feel, feeling, good, great, going, got, guys, happy, hey, hope, just, like, lol, man, morning, new, really, today, time, ve, welcome, year, years, week, weekend, workbest, better, day, doing, enjoy, feeling, going, good, got, great, guys, friday, happy, hello, Hey, hope, lol, luck, monday, morning, nice, really, sunday, thanks, time, today, week, weekend, welcome, wishSalutations and check-insGreetings and brief personal updates2329 (47.41%)>4 topic-specific tokens per post113 (8.86%)Morning Thrivers! Can say much about my weekend cause I slept through it......... I just wish this holiday season would be over already so I can get back to some kind of normal being........ anyway I wish everyone a productive week and an enjoyable thanksgiving..........

aThe topic-specific tokens are italicized.

bART: antiretroviral therapy.

cMSM: men who have sex with men.

SA Process

For the positively valenced ([+]Pos) posts, our condensed data set included the 488 posts assigned a polarity score >0.659 by the VADER lexicon ([+]Pos intravalence mean 0.81, SD 0.12). For the negatively valenced ([–]Neg) posts, our condensed sample included the 490 posts that were assigned a polarity score >0.196 ([–]Neg intravalence mean 0.34, SD 0.16).

Details of the intravalence condensation of the strongly positive and negative posts, with illustrative examples, are shown in Table 3.

Table 3. VADER (Valence Aware Dictionary for sEntiment Reasoner)-assigned sentiment polarity, intravalence condensation, and illustrative examples.
Sentiment polarity90th percentile thresholdHigh-affinity posts per valence, n (%) (N=1276)Example high-affinity posts (including polarity scores)
(+)Posa>0.659 (+) score488 (38.24%)“Beautiful story, thanks for sharing” (0.828 Pos, 0.172 Neg)

“I love you positiveness.............” (0.789 Pos, 0.000 Neg)
(–)Negb>0.196 (–) score490 (38.4%)“I hate trump (lower case)!!!” (0.000 Pos, 0.604 Neg)

“Bad anxiety today. Even my blood pressure was high.” (0.000 Pos, 0.552 Neg)

aPositively valenced.

bNegatively valenced.

Thematic Analyses

The condensed data set contained 1276 posts: 298 associated with the 90th percentile of the affinity of topics A, B, and C from LDA Model 2, and 978 associated with the 90th percentile of positive and negative polarity. This data set was advanced to thematic analysis. The detected intratopic and intravalence themes, their operational definitions, code co-occurrences, and illustrative excerpts are displayed in a meta-matrix in Multimedia Appendix 3.

Within topic A, most themes articulated the distinct, day-to-day obligations of living with HIV. The most frequently detected themes, which reflected informative prompts provided by the Thrive With Me web app, covered ART medications. These instances were rich enough to warrant the coding of dedicated subthemes capturing detailed adherence tips, personal antiretroviral regimens, and adverse effects. Issues of long-term survival were raised, as were various personal narratives and peer-to-peer recommendations for disclosing one’s HIV serostatus to potential sexual partners. Further, a code (“Raising Awareness”) captured the many instances in which users shared details of activist events, local resources, and HIV-tailored public health messaging.

Within topic B, diverse personal narratives were shared, including all articulations of the specific challenges that sexual minority MSM living with HIV may encounter as they seek social and sexual bonds with other men. These included mismatched expectations around relationship longevity and extradyadic pairings, life chaos attributed to partners’ alcohol and crystal meth use, and the roles of ex-partners. Trust, broken trust, discussions of self-confidence, and expressions of loneliness and isolation were emergent intratopic themes. The roles of support networks, including direct appeals to and provision of peer-to-peer social support among Thrive With Me users also emerged within this topic.

Within topic C, the overwhelming proportion of UGC was made of brief greetings. In posts where these greetings were expanded to include personal updates and peer check-ins, 2 intratopic themes predominated; the first comprised substance use, misuse, and recovery, which included disclosures of relapse among Thrive With Me users; second, an emergent theme of personal triumph was also evident, which covered accomplishments such as new physical fitness regimens, career successes, and the attainment of treatment goals such as stable CD4 counts.

Strongly positive posts were characterized by gratitude, typically in response to peer-to-peer encouragements and affirmations occurring on the forum. Strongly negative posts were richer and more thematically heterogenous. Many of these posts were reactions to linked external news media, which overwhelmingly provoked anger. These media often covered acts of homonegativity and racism. Another (–)Neg theme encompassed the political climate in the United States during the period of the Thrive With Me trial, when the 2016 presidential election was decided. The final intravalence theme concerned mental health, typically through expressions of acute or ongoing struggles with depression, stress, and insomnia.

Principal Findings

We combined common NLP techniques with traditional latent thematic analysis to classify UGC drawn from an interactive HIV mHealth environment. Through multiple iterations of LDA modeling, stable topics emerged: the day-to-day concerns of living with HIV; the social, romantic, and sexual tolls of aging with HIV as a sexual minority MSM; and routine greetings and daily affirmations. Using a 90th percentile cutoff, we condensed the UGC of which these topics were composed from a total of 4912 posts to a rich, illustrative subset of 1276 posts. By further analyzing this condensed UGC as a set of meta-themes, we identified latent discourses within them, through which experiential design insights could be mined.

Our work contributes to the diverse, cross-disciplinary literature exploring sequential NLP-qualitative methods [49,51-55,76], while responding to the call of Britt and colleagues [56] to explore the possibilities of integrated data mining and narrative analyses in mHealth. By sequentially combining NLP and qualitative techniques, our work resembles recent analyses that demonstrated the ability of consecutive NLP-qualitative methods to create machine-generated meta-themes from web-based forum and text message data and, in turn, preserve narrative and context through qualitative coding [51-55]. In contrast to these analyses, we used UGC derived from an interactive mHealth environment, focusing on user-centered product adaptation as a potential application. In emphasizing design applications, our work resembles that of Petersen et al [55], who applied similar NLP techniques to interviews of prospective users of an exercise-promoting wearable technology, capturing improvements in sentiment and usability at 0-, 5-, and 10-week intervals. Unlike our own, this analysis [55] fulfilled the iterative criteria of a user-centered design cycle [22,26,27], forgoing the more labor-intensive aspects of qualitative analysis [70], while demonstrating its NLP-aided, user-centered approach.

To that end, the results reported here offer partial fulfillment of Aim 1. Although the sequential methods we demonstrated did successfully characterize the prevailing themes of the peer forum, the future viability of these methods will depend on their routinization. Our procedures included a number of transformations and cross-platform migrations, each of which introduces friction, which in turn disincentivizes adoption [77]. Routine NLP-enabled mHealth monitoring would, instead, require integrated text analytics [78,79] and graphical user interfaces to ensure accessibility for investigators without coding expertise [56]. Such “no-code” (a common industry term) solutions could aid in bridging the knowledge-translation gap through evidence synthesis and translation, a lasting challenge in implementation science [80], as well as in clinically integrating mHealth interventions [81]. Alternately, although our method demonstrates the value of maintaining human interpretability of NLP outputs, the very thematic codes we developed inductively might lend themselves, in future, to repurposing as target labels in training HIV-domain data sets for supervised deep-learning applications [82]. The inherent potential of such “both-and” approaches remains to be explored.

As for Aim 2, a range of actionable design insights surfaced from these findings to guide future iterations of Thrive With Me specifically and HIV mHealth generally. An HCD approach typically reframes these insights as “how might we” (HMW) prompts, a reframing we embrace here [24,26]. First, the seropositive MSM end users of Thrive With Me who engaged in the peer-support forum typically did so transparently and intimately, tapping their peers for encouragement, collaboratively navigating difficult subjects. These instances are most evident throughout topics A and B, specifically within the ART-related, “survivorship,” and “partnering challenges” themes and in the peer-to-peer affirmations surfaced via the (+)Pos UGC. Nevertheless, the forum was also, more problematically, a platform to express outrage at external news media. These media often recounted instances of homonegative violence and discrimination. These issues were, of course, clearly relevant to Thrive With Me users, as “reacting to media” codes, emergent within the (–)Neg condensed UGC (exclusively), occurred at twice the frequency of any other, with the exception of “partnering challenges” within topic B. However, their intrusive nature and negativity may have dampened the overall emotional tenor of the forum. These appeals to outrage may have discouraged newly enrolled or “lurking” users from interacting with the forum or disproportionately consumed their attention. In either scenario, the intended benefits of the social support provided by the forum may have been undermined. As such, HMW 1 is “How might future iterations of Thrive With Me acknowledge the anger evoked by an oppressive society without compromising the supportive aims of the peer forum?” Active content moderation, dedicated channels for current events, or even an embargo on outbound links might accomplish such an aim; however, these solutions would require prototyping and prospective end-user feedback in an HCD cycle [24].

Another topic, with several related intratopic themes, concerned relationship difficulties. In addition to the abovementioned “partnering challenges” theme emergent within topic B, unmet relationship needs were evident throughout the “trust and betrayal” and social isolation–focused “voids in my life” themes. Thus, HMW 2 is “How might we support the interpersonal needs of seropositive MSM without imposing model drift into an ART adherence intervention?” The latent need is evident, and the deliberations of end users often touched on cross-cutting topic A and (–)Neg themes; the richest instances of intertopic cross-codings are shown among the “disclosing serostatus” (topic A), “partnering challenges” (topic B), and “substance use and misuse” (topic C) themes, illustrating the entanglement of these issues in Thrive With Me users’ lives. Dedicated informational modules might address these needs more directly, tying decision-making within this domain to specific triggers for illegal drug use or missed ART doses in a manner consistent with the IMB model in which Thrive With Me is grounded [60,61].

Finally, a desire to narrativize the personal triumphs of HIV survivorship is often evident across topics A and C, particularly within the “survivorship” and (in vivo) “other days I move mountains” codes. These narratives, which cover grief, coming out, and the lessons imparted by long-term survival, surface as an organically occurring form of UGC, pointing out their importance to Thrive With Me users, perhaps as validations of their personal resilience. Such strength-based, person-centered affirmations may hold the potential to constructively reauthor Thrive With Me users’ experiences of societal oppressions, while finding resonances within each other’s stories [83,84]. If implemented carefully, such reframing may redirect the negativity discussed in HMW 1 without invalidating the stressors that drive it, while simultaneously encouraging engagement with the peer forum. An appropriate HMW 3 is “How might we activate the potential of personal narratives toward the well-being of MSM living with HIV?” Asynchronous health recovery narratives, even those scraped from UGC on the open web, can enhance behavior-change self-efficacy and the likelihood of cancer screening [85,86]. The curation of these narratives in a dedicated portal, akin to innovations in digital psychiatry such as the NEON (Narrative Experiences Online) intervention [87], might represent an adaptable periphery of next-generation HIV mHealth [88].


These findings are subject to a range of limitations. As a proof-of-concept analysis, our methods are exploratory. Nevertheless, the abovementioned migrations and transformations built into our methods allow for the imposition of human error, while rate-limiting the rapidity with which results can be generated. In contrast, the adoption of a single alternative developer environment such as R (R Foundation for Statistical Computing), which permits qualitative analyses via the R qualitative data analysis package [89] would enhance efficiency considerably. We were also limited by our inability to member-check our LDA modeling and thematic coding schema with Thrive With Me users themselves, which would have bolstered transactional validity and laid the groundwork for a true HCD process, incorporating iterative prototyping, design sprints, and feedback elicited from the user base whose needs we attempt to fulfill. The use of domain-expert raters employed by Nikolenko and colleagues [49] to assure the coherence and human interpretability of LDA outputs offers a template for such a member-checking approach. A de facto tension exists between HCD, which is nimble, creativity-driven, and interactive, and UGC analysis, which is static and typically archival. Innovative solutions, such as real-time syndromic surveillance on social media [90,91], point toward the possibilities of resolving this tension and toward potential innovations in interactive mHealth. Finally, through a design justice lens [31], we recognize that the approach we describe leverages analytic advancements undertaken in English, using English-language corpora, within an intervention context that requires users to receive information and interact in English [60]. Although the need for multilingual NLP is recognized within the field, progress remains limited [92]. Monolingual approaches toward capturing user-experience insights will, of course, remain narrow in scope amid the vast diversity of human speech.


mHealth interventions that fulfill the needs of multiply marginalized MSM living with HIV must accommodate a diverse array of needs and experiences. The findings of this proof-of-concept analysis suggest that combined machine- and human-optimized techniques can capture actionable insights on these needs and experiences without adding to the burdens of prospective end users. By maintaining an empathic lens and focusing on refinements in method, techniques such as those demonstrated here can contribute to future innovations in HIV mHealth.


We thank the participants for their time and effort during the study, and for the richly realized insights and personal narratives that they contributed to the Thrive With Me forum. Further, we thank AvaGrace Palazzolo for her service as a human rater assessing the interpretability of pilot latent Dirichlet allocation outputs. SJS is supported in part by a Garvin Shands Saunders Foundation scholarship. This work was supported by a grant from the National Institute on Drug Abuse (grant R01DA039950).

Authors' Contributions

SJS and SSJ designed and executed the analysis, with SJS leading the condensation, thematic analyses, and human-centered design (HCD) interpretations, and SSJ leading the preprocessing, topic modeling (TM), and sentiment analyses. CMC served as consensus coder in all rounds of thematic analysis and contributed toward the initial drafting of the manuscript. KJH designed the Thrive With Me intervention, led the development and parent trial from which these secondary analyses originated, and supervised all aspects of the work described herein. SJS wrote the initial draft of the manuscript, with SSJ, CMC, and KJH contributing to its refinement.

Conflicts of Interest

SJS is a paid advisor to Waverider, which builds customizable dialectical behavior therapy eHealth tools. SSJ, CMC, and KJH declare no conflicts of interest.

Multimedia Appendix 1

Text preprocessing, latent Dirichlet allocation topic modeling, and VADER (Valence Aware Dictionary for sEntiment Reasoner) sentiment analysis replication scripts in Python.

TXT File , 13 KB

Multimedia Appendix 2

Tokens detected per user, per topic (Model 2).

DOCX File , 49 KB

Multimedia Appendix 3

Human-detected intratopic (Model 2) and intravalence themes with definitions and illustrative examples.

DOCX File , 29 KB

  1. Vella S, Schwartländer B, Sow SP, Eholie SP, Murphy RL. The history of antiretroviral therapy and of its implementation in resource-limited areas of the world. AIDS 2012 Jun;26(10):1231-1241. [CrossRef] [Medline]
  2. HIV-CAUSAL Collaboration, Ray M, Logan R, Sterne JAC, Hernández-Díaz S, Robins JM, et al. The effect of combined antiretroviral therapy on the overall mortality of HIV-infected individuals. AIDS 2010 Jan;24(1):123-137 [FREE Full text] [CrossRef] [Medline]
  3. Samji H, Cescon A, Hogg RS, Modur SP, Althoff KN, Buchacz K, North American AIDS Cohort Collaboration on Research and Design (NA-ACCORD) of IeDEA. Closing the gap: increases in life expectancy among treated HIV-positive individuals in the United States and Canada. PLoS One 2013 Dec;8(12):e81355 [FREE Full text] [CrossRef] [Medline]
  4. Eisinger RW, Dieffenbach CW, Fauci AS. HIV viral load and transmissibility of HIV infection: undetectable equals untransmittable. JAMA 2019 Feb;321(5):451-452. [CrossRef] [Medline]
  5. Evidence of HIV treatment and viral suppression in preventing the sexual transmission of HIV. 2020.   URL: [accessed 2021-08-03]
  6. Centers for Disease Control and Prevention. Selected national HIV prevention and care outcomes in the United States. 2019.   URL: [accessed 2021-08-03]
  7. Altice F, Evuarherhe O, Shina S, Carter G, Beaubrun AC. Adherence to HIV treatment regimens: systematic literature review and meta-analysis. Patient Prefer Adherence 2019 Apr;13:475-490 [FREE Full text] [CrossRef] [Medline]
  8. Eberhart MG, Yehia BR, Hillier A, Voytek CD, Blank MB, Frank I, et al. Behind the cascade: analyzing spatial patterns along the HIV care continuum. J Acquir Immune Defic Syndr 2013 Nov;64(Suppl 1):S42-S51 [FREE Full text] [CrossRef] [Medline]
  9. Eberhart MG, Yehia BR, Hillier A, Voytek CD, Fiore DJ, Blank M, et al. Individual and community factors associated with geographic clusters of poor HIV care retention and poor viral suppression. J Acquir Immune Defic Syndr 2015 May;69(Suppl 1):S37-S43 [FREE Full text] [CrossRef] [Medline]
  10. Goswami ND, Schmitz MM, Sanchez T, Dasgupta S, Sullivan P, Cooper H, et al. Understanding local spatial variation along the care continuum: the potential impact of transportation vulnerability on HIV linkage to care and viral suppression in high-poverty areas, Atlanta, Georgia. J Acquir Immune Defic Syndr 2016 May;72(1):65-72 [FREE Full text] [CrossRef] [Medline]
  11. Tieu H, Koblin BA, Latkin C, Curriero FC, Greene ER, Rundle A, et al. Neighborhood and network characteristics and the HIV care continuum among gay, bisexual, and other men who have sex with men. J Urban Health 2020 Oct;97(5):592-608 [FREE Full text] [CrossRef] [Medline]
  12. Quinn KG, Voisin DR. ART adherence among men who have sex with men living with HIV: key challenges and opportunities. Curr HIV/AIDS Rep 2020 Aug;17(4):290-300 [FREE Full text] [CrossRef] [Medline]
  13. Daher J, Vijh R, Linthwaite B, Dave S, Kim J, Dheda K, et al. Do digital innovations for HIV and sexually transmitted infections work? Results from a systematic review (1996-2017). BMJ Open 2017 Nov;7(11):e017604 [FREE Full text] [CrossRef] [Medline]
  14. Cooper V, Clatworthy J, Whetham J, EmERGE Consortium. mHealth Interventions To Support Self-Management In HIV: A Systematic Review. Open AIDS J 2017;11:119-132 [FREE Full text] [CrossRef] [Medline]
  15. Nelson KM, Perry NS, Horvath KJ, Smith LR. A systematic review of mHealth interventions for HIV prevention and treatment among gay, bisexual, and other men who have sex with men. Transl Behav Med 2020 Oct;10(5):1211-1220 [FREE Full text] [CrossRef] [Medline]
  16. Catalani C, Philbrick W, Fraser H, Mechael P, Israelski DM. Open AIDS J 2013 Aug;7:17-41 [FREE Full text] [CrossRef] [Medline]
  17. Lee SB, Valerius J. mHealth interventions to promote anti-retroviral adherence in HIV: narrative review. JMIR Mhealth Uhealth 2020 Aug;8(8):e14739 [FREE Full text] [CrossRef] [Medline]
  18. Herbst JH, Mansergh G, Pitts N, Denson D, Mimiaga MJ, Holman J. Effects of brief messages about antiretroviral therapy and condom use benefits among Black and Latino MSM in three U.S. cities. J Homosex 2018;65(2):154-166. [CrossRef] [Medline]
  19. Horvath KJ, Lammert S, MacLehose RF, Danh T, Baker JV, Carrico AW. A pilot study of a mobile app to support HIV antiretroviral therapy adherence among men who have sex with men who use stimulants. AIDS Behav 2019 Nov;23(11):3184-3198. [CrossRef] [Medline]
  20. Aunon FM, Okada E, Wanje G, Masese L, Odeny TA, Kinuthia J, et al. Iterative development of an mHealth intervention to support antiretroviral therapy initiation and adherence among female sex workers in Momnyabasa, Kenya. J Assoc Nurses AIDS Care 2020 Mar;31(2):145-156 [FREE Full text] [CrossRef] [Medline]
  21. Flickinger TE, Sherbuk JE, Petros de Guex K, Añazco Villarreal D, Hilgart M, McManus KA, et al. Adapting an m-Health intervention for Spanish-speaking Latinx people living with HIV in the nonurban southern United States. Telemed Rep 2021 Feb;2(1):46-55 [FREE Full text] [CrossRef] [Medline]
  22. Marent B, Henwood F, Darking M, EmERGE Consortium. Development of an mHealth platform for HIV care: gathering user perspectives through co-design workshops and interviews. JMIR Mhealth Uhealth 2018 Oct;6(10):e184 [FREE Full text] [CrossRef] [Medline]
  23. Rosen R, Ranney M, Boyer E. Formative research for mHealth HIV adherence: the iHAART app. In: Proceedings of the 48th Hawaii International Conference on System Sciences. 2015 Presented at: 48th Hawaii International Conference on System Sciences; January 5-8, 2015; Kauai, Hawaii   URL: [CrossRef]
  24. IDEO (Firm). The Field Guide to Human-Centered Design. New York, United States: IDEO; 2015.
  25. Design thinking bootleg. 2021.   URL: [accessed 2021-08-03]
  26. Beres LK, Simbeza S, Holmes CB, Mwamba C, Mukamba N, Sharma A, et al. Human-centered design lessons for implementation science: improving the implementation of a patient-centered care intervention. J Acquir Immune Defic Syndr 2019 Dec;82(Suppl 3):S230-S243 [FREE Full text] [CrossRef] [Medline]
  27. Farao J, Malila B, Conrad N, Mutsvangwa T, Rangaka MX, Douglas TS. A user-centred design framework for mHealth. PLoS One 2020 Aug;15(8):e0237910 [FREE Full text] [CrossRef] [Medline]
  28. Schnall R, Mosley JP, Iribarren SJ, Bakken S, Carballo-Diéguez A, Brown Iii W. Comparison of a user-centered design, self-management app to existing mHealth apps for persons living with HIV. JMIR Mhealth Uhealth 2015 Sep;3(3):e91 [FREE Full text] [CrossRef] [Medline]
  29. Schnall R, Rojas M, Bakken S, Brown W, Carballo-Dieguez A, Carry M, et al. A user-centered model for designing consumer mobile health (mHealth) applications (apps). J Biomed Inform 2016 Apr;60:243-251. [CrossRef] [Medline]
  30. Holeman I, Kane D. Human-centered design for global health equity. Inf Technol Dev 2019 Sep;26(3):477-505 [FREE Full text] [CrossRef] [Medline]
  31. Costanza-Chock S. Design Justice: Community-Led Practices to Build the Worlds We Need. Boston, United States: The MIT Press; 2019.
  32. Fan M, Shi S, Truong K. Practices and challenged of using think-aloud protocols in industry: an international study. J Usability Stud 2020 Feb;15(2):85-102. [CrossRef]
  33. Marent B, Henwood F, Darking M, EmERGE Consortium. Ambivalence in digital health: co-designing an mHealth platform for HIV care. Soc Sci Med 2018 Oct;215:133-141. [CrossRef] [Medline]
  34. Brear M. Process and outcomes of a recursive, dialogic member checking approach: a project ethnography. Qual Health Res 2019 Jun;29(7):944-957. [CrossRef] [Medline]
  35. Sari E, Tedjasaputra A. Designing valuable products with design sprint. In: Proceedings of the 16th IFIP Conference on Human-Computer Interaction (INTERACT). Switzerland: Springer, Cham; 2017 Sep Presented at: 16th IFIP Conference on Human-Computer Interaction (INTERACT); September 25-29, 2017; Mumbai, India p. 391-394   URL: [CrossRef]
  36. Alqahtani F, Orji R. Insights from user reviews to improve mental health apps. Health Informatics J 2020 Sep;26(3):2042-2066 [FREE Full text] [CrossRef] [Medline]
  37. Baur AW. Harnessing the social web to enhance insights into people’s opinions in business, government and public administration. Inf Syst Front 2016 Jul;19(2):231-251. [CrossRef]
  38. Camacho-Otero J, Boks C, Pettersen IN. User acceptance and adoption of circular offerings in the fashion sector: insights from user-generated online reviews. J Clean Prod 2019 Sep;231:928-939. [CrossRef]
  39. Saura JR, Reyes-Menendez A, Thomas SB. Gaining a deeper understanding of nutrition using social networks and user-generated content. Internet Interv 2020 Apr;20:100312 [FREE Full text] [CrossRef] [Medline]
  40. Timoshenko A, Hauser JR. Identifying customer needs from user-generated content. Mark Sci 2019 Jan;38(1):1-20. [CrossRef]
  41. Tirunillai S, Tellis GJ. Mining marketing meaning from online chatter: strategic brand analysis of big data using latent Dirichlet allocation. J Mark Res 2014 Aug;51(4):463-479. [CrossRef]
  42. Maddox TM, Matheny MA. Natural language processing and the promise of big data: small step forward, but many miles to go. Circ Cardiovasc Qual Outcomes 2015 Sep;8(5):463-465. [CrossRef] [Medline]
  43. Bird S, Klein E, Loper E. Natural Language Processing With Python: Analyzing Text With the Natural Language Toolkit. Sebastopol, CA, United States: O'Reilly Media; 2009.
  44. Batrinca B, Treleaven PC. Social media analytics: a survey of techniques, tools and platforms. AI & Soc 2014 Jul;30(1):89-116. [CrossRef]
  45. Gonzalez-Hernandez G, Sarker A, O'Connor K, Savova G. Capturing the patient's perspective: a review of advances in natural language processing of health-related text. Yearb Med Inform 2017 Aug;26(1):214-227 [FREE Full text] [CrossRef] [Medline]
  46. Blei D, Ng A, Jordan M. Latent Dirchlet allocation. J Mach Learn Res 2003:993-1022 [FREE Full text]
  47. Reyes-Menendez A, Saura JR, Alvarez-Alonso C. Understanding #WorldEnvironmentDay user opinions in Twitter: a topic-based sentiment analysis approach. Int J Environ Res Public Health 2018 Nov;15(11):2537 [FREE Full text] [CrossRef] [Medline]
  48. Wen M, Yang D, Rosé C. Sentiment analysis in MOOC discussion forums: what does it tell us? In: Proceedings of the 7th International Conference on Educational Data Mining (EDM 2014). 2014 Presented at: 7th International Conference on Educational Data Mining (EDM 2014); July 4-7, 2014; London, United Kingdom   URL:
  49. Nikolenko SI, Koltcov S, Koltsova O. Topic modelling for qualitative studies. J Inf Sci 2016 Jul;43(1):88-102. [CrossRef]
  50. Ampofo L, Collister S, O?Loughlin B. Text mining and social media: when quantitative meets qualitative, and software meets humans. In: Halfpenny P, Procter R, editors. Innovations in Digital Research Methods. Thousand Oaks, CA, United States: SAGE; 2015:161-192.
  51. Leeson W, Resnick A, Alexander D, Rovers J. Natural language processing (NLP) in qualitative public health research: a proof of concept study. Int J Qual Methods 2019 Nov;18:160940691988702. [CrossRef]
  52. Guetterman TC, Chang T, DeJonckheere M, Basu T, Scruggs E, Vydiswaran VGV. Augmenting qualitative text analysis with natural language processing: methodological study. J Med Internet Res 2018 Jun;20(6):e231 [FREE Full text] [CrossRef] [Medline]
  53. Jones J, Pradhan M, Hosseini M, Kulanthaivel A, Hosseini M. Novel approach to cluster patient-generated data into actionable topics: case study of a web-based breast cancer forum. JMIR Med Inform 2018 Nov;6(4):e45 [FREE Full text] [CrossRef] [Medline]
  54. Timimi F, Ray S, Jones E, Aase L, Hoffman K. Patient-reported outcomes in online communications on statins, memory, and cognition: qualitative analysis using online communities. J Med Internet Res 2019 Nov;21(11):e14809 [FREE Full text] [CrossRef] [Medline]
  55. Petersen CL, Halter R, Kotz D, Loeb L, Cook S, Pidgeon D, et al. Using natural language processing and sentiment analysis to augment traditional user-centered design: development and usability study. JMIR Mhealth Uhealth 2020 Aug;8(8):e16862 [FREE Full text] [CrossRef] [Medline]
  56. Britt R, Maddox J, Kanthawala S, Hayes JL. The impact of mHealth interventions: improving health outcomes through narratives, mixed methods, and data mining strategies. In: Kim J, Song H, editors. Technology and Health: Promoting Attitude and Behavior Change. Cambridge, MA, United States: Academic Press; 2020:271-288.
  57. Bhattacharyya O, Mossman K, Gustafsson L, Schneider EC. Using human-centered design to build a digital health advisor for patients with complex needs: persona and prototype development. J Med Internet Res 2019 May;21(5):e10318. [CrossRef] [Medline]
  58. Patel D, Sarlati S, Martin-Tuite P, Feler J, Chehab L, Texada M, et al. Designing an information and communications technology tool with and for victims of violence and their case managers in San Francisco: human-centered design study. JMIR Mhealth Uhealth 2020 Aug;8(8):e15866 [FREE Full text] [CrossRef] [Medline]
  59. Horvath KJ, Oakes JM, Rosser BRS, Danilenko G, Vezina H, Amico KR, et al. Feasibility, acceptability and preliminary efficacy of an online peer-to-peer social support ART adherence intervention. AIDS Behav 2013 Jul;17(6):2031-2044 [FREE Full text] [CrossRef] [Medline]
  60. Horvath KJ, Amico KR, Erickson D, Ecklund AM, Martinka A, DeWitt J, et al. Thrive With Me: protocol for a randomized controlled trial to test a peer support intervention to improve antiretroviral therapy adherence among men who have sex with men. JMIR Res Protoc 2018 May;7(5):e10182 [FREE Full text] [CrossRef] [Medline]
  61. Amico KR, Toro-Alfonso J, Fisher JD. An empirical test of the information, motivation and behavioral skills model of antiretroviral therapy adherence. AIDS Care 2005 Aug;17(6):661-673. [CrossRef] [Medline]
  62. Schofield A, Magnusson M, Thompson L, Mimno D. Understanding text pre-processing for latent Dirichlet allocation. In: Proceedings of the 1st Workshop for Women and Underrepresented Minorities in Natural Language Processing. 2017 Presented at: The 1st Workshop for Women and Underrepresented Minorities in Natural Language Processing; July 30, 2017; Vancouver, Canada   URL:
  63. Schofield A, Magnusson M, Mimno D. Pulling out the stops: rethinking stopword removal for topic models. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017 Apr Presented at: 15th Conference of the European Chapter of the Association for Computational Linguistics; April 2017; Valencia, Spain p. 432-436   URL: [CrossRef]
  64. Hutto C, Gilbert E. VADER: a parsimonious rule-based model for sentiment analysis of social media text. In: Proceedings of the 8th International AAAI Conference on Weblogs and Social Media. 2014 Presented at: 8th International AAAI Conference on Weblogs and Social Media; June 1-4, 2014; Ann Arbor, MI, United States   URL:
  65. Scikit-learn v. 0.24.2, machine learning in Python. Scikit Learn.   URL: [accessed 2021-08-02]
  66. Andrzejewski D, Zhu X, Craven M. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. Proc Int Conf Mach Learn 2009 Jun;382(26):25-32 [FREE Full text] [CrossRef] [Medline]
  67. Sievert C, Shirley K. LDAvis: a method for visualizing and interpreting topics. In: Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. 2014 Presented at: Workshop on Interactive Language Learning, Visualization, and Interfaces; June 27, 2014; Baltimore, MD, United States   URL: [CrossRef]
  68. Mabey B. pyLDAvis. 3.3.1 release. GitHub. 2021.   URL: [accessed 2021-08-03]
  69. vaderSentiment v. 3.3.1. Python Package Index.   URL: [accessed 2020-06-01]
  70. Miles M, Huberman A, Saldaña J. Qualitative Data Analysis: A Methods Sourcebook. Thousand Oaks, CA, United States: SAGE; 2014.
  71. Excel for Microsoft 365 v. 16.0.14228.20216 (2017). 2017.   URL: [accessed 2020-06-01]
  72. Braun V, Clarke V. Using thematic analysis in psychology. Qual Res Psychol 2006 Jan;3(2):77-101. [CrossRef]
  73. Watkins DC. Rapid and rigorous qualitative data analysis. Int J Qual Methods 2017 Jun;16(1):160940691771213. [CrossRef]
  74. SocioCultural Research Consultants, LLC. Dedoose v. 8.0.35, web application for managing, analyzing, and presenting qualitative and mixed method research data. Dedoose. Los Angeles, CA, United States; 2018.   URL: [accessed 2020-06-01]
  75. Slomka J, Lim J, Gripshover B, Daly B. How have long-term survivors coped with living with HIV? J Assoc Nurses AIDS Care 2013 Sep;24(5):449-459 [FREE Full text] [CrossRef] [Medline]
  76. Yu C, Jannasch-Pennell A, DiGangi S. Compatibility between text mining and qualitative research in the perspectives of grounded theory, content analysis, and reliability. TQR 2014 Oct;16(3):730-744. [CrossRef]
  77. Ash J, Anderson B, Gordon R, Langley P. Digital interface design and power: friction, threshold, transition. Environ Plan D 2018 Apr;36(6):1136-1153. [CrossRef]
  78. Using the natural language healthcare API. Google Cloud Healthcare API. 2021.   URL: [accessed 2021-08-05]
  79. Text analytics. Microsoft Azure. 2021.   URL: [accessed 2021-08-05]
  80. Michie S, Thomas J, Johnston M, Aonghusa PM, Shawe-Taylor J, Kelly MP, et al. The Human Behaviour-Change Project: harnessing the power of artificial intelligence and machine learning for evidence synthesis and interpretation. Implement Sci 2017 Oct;12(1):121 [FREE Full text] [CrossRef] [Medline]
  81. U.S. Department of Veterans Affairs. VA Mobile Health Practice Guide. 2021.   URL: [accessed 2022-06-01]
  82. Kelleher J. Deep Learning. Boston, MA, United States: The MIT Press; 2019.
  83. Zeligman M, Barden SM. A narrative approach to supporting clients living with HIV. J Constr Psychol 2014 Nov;28(1):67-82. [CrossRef]
  84. Ware C. “Things you can’t talk about’: engaging with HIV-positive gay men’s survivor narratives. Oral Hist 2018;46(2):33-40 [FREE Full text]
  85. Manuvinakurike R, Velicer WF, Bickmore TW. Automated indexing of Internet stories for health behavior change: weight loss attitude pilot study. J Med Internet Res 2014 Dec;16(12):e285 [FREE Full text] [CrossRef] [Medline]
  86. Larkey LK, McClain D, Roe DJ, Hector RD, Lopez AM, Sillanpaa B, et al. Randomized controlled trial of storytelling compared to a personal risk tool intervention on colorectal cancer screening in low-income patients. Am J Health Promot 2015 Nov;30(2):e59-e70. [CrossRef] [Medline]
  87. Slade M, Rennick-Egglestone S, Llewellyn-Beardsley J, Yeo C, Roe J, Bailey S, et al. Recorded mental health recovery narratives as a resource for people affected by mental health problems: development of the Narrative Experiences Online (NEON) intervention. JMIR Form Res 2021 May;5(5):e24417 [FREE Full text] [CrossRef] [Medline]
  88. Damschroder LJ, Aron DC, Keith RE, Kirsh SR, Alexander JA, Lowery JC. Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science. Implement Sci 2009 Aug;4:50 [FREE Full text] [CrossRef] [Medline]
  89. What is RQDA and what are its features? RQDA. 2021.   URL: [accessed 2021-08-05]
  90. Șerban O, Thapen N, Maginnis B, Hankin C, Foot V. Real-time processing of social media with SENTINEL: a syndromic surveillance system incorporating deep learning for health classification. Inf Process Manage 2019 May;56(3):1166-1184. [CrossRef]
  91. Fung IC, Tse ZTH, Fu K. The use of social media in public health surveillance. Western Pac Surveill Response J 2015 Jun;6(2):3-6 [FREE Full text] [CrossRef] [Medline]
  92. Wali E, Chen Y, Mahoney C, Middleton T, Babaeianjelodar M, Nije M, et al. Is machine learning speaking my language? A critical look at the NLP-pipeline across 8 human languages. ArXiv Preprint posted online on July 11, 2020. [FREE Full text] [CrossRef]

ART: antiretroviral therapy
HCD: human-centered design
HMW: how might we
IMB: Information-Motivation-Behavioral skills
LDA: latent Dirichlet allocation
mHealth: mobile health
MSM: men who have sex with men
(–)Neg: negatively valenced
NLP: natural language processing
(+)Pos: positively valenced
SA: sentiment analysis
TM: topic modeling
UGC: user-generated content
VADER: Valence Aware Dictionary for sEntiment Reasoner

Edited by A Kushniruk; submitted 16.02.22; peer-reviewed by A Sharma, J Pry; comments to author 23.05.22; revised version received 13.06.22; accepted 13.06.22; published 21.07.22


©Simone J Skeen, Stephen Scott Jones, Carolyn Marie Cruse, Keith J Horvath. Originally published in JMIR Human Factors (, 21.07.2022.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Human Factors, is properly cited. The complete bibliographic information, a link to the original publication on, as well as this copyright and license information must be included.