This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Human Factors, is properly cited. The complete bibliographic information, a link to the original publication on https://humanfactors.jmir.org, as well as this copyright and license information must be included.
Chatbots are computer programs that present a conversation-like interface through which people can access information and services. The COVID-19 pandemic has driven a substantial increase in the use of chatbots to support and complement traditional health care systems. However, despite the uptake in their use, evidence to support the development and deployment of chatbots in public health remains limited. Recent reviews have focused on the use of chatbots during the COVID-19 pandemic and the use of conversational agents in health care more generally. This paper complements this research and addresses a gap in the literature by assessing the breadth and scope of research evidence for the use of chatbots across the domain of public health.
This scoping review had 3 main objectives: (1) to identify the application domains in public health in which there is the most evidence for the development and use of chatbots; (2) to identify the types of chatbots that are being deployed in these domains; and (3) to ascertain the methods and methodologies by which chatbots are being evaluated in public health applications. This paper explored the implications for future research on the development and deployment of chatbots in public health in light of the analysis of the evidence for their use.
Following the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines for scoping reviews, relevant studies were identified through searches conducted in the MEDLINE, PubMed, Scopus, Cochrane Central Register of Controlled Trials, IEEE Xplore, ACM Digital Library, and Open Grey databases from mid-June to August 2021. Studies were included if they used or evaluated chatbots for the purpose of prevention or intervention and for which the evidence showed a demonstrable health impact.
Of the 1506 studies identified, 32 were included in the review. The results show a substantial increase in the interest of chatbots in the past few years, shortly before the pandemic. Half (16/32, 50%) of the research evaluated chatbots applied to mental health or COVID-19. The studies suggest promise in the application of chatbots, especially to easily automated and repetitive tasks, but overall, the evidence for the efficacy of chatbots for prevention and intervention across all domains is limited at present.
More research is needed to fully understand the effectiveness of using chatbots in public health. Concerns with the clinical, legal, and ethical aspects of the use of chatbots for health care are well founded given the speed with which they have been adopted in practice. Future research on their use should address these concerns through the development of expertise and best practices specific to public health, including a greater focus on user experience.
Sundar Pichai, the chief executive officer of Google, expressed in a recent interview his view that artificial intelligence (AI) will have a more profound impact on humanity than the advent of fire, the internet, or electricity [
Chatbots—software programs designed to interact in human-like conversation—are being applied increasingly to many aspects of our daily lives. Made to mimic natural language conversations to facilitate interaction between humans and computers, they are also referred to as “conversational agents,” “dialog assistants,” or “intelligent virtual assistants,” and they can support speech and text conversation. Notable early chatbots include ELIZA (1966;“a mock Rogerian psychotherapist”), PARRY (1972; a chatbot simulating a person with paranoid schizophrenia, developed by a psychiatrist in response to ELIZA), and ALICE (1995; a general conversational chatbot, inspired by ELIZA) [
The ongoing COVID-19 pandemic has further driven the rapid uptake and deployment of chatbots [
The use of AI for symptom checking and triage at scale has now become the norm throughout much of the world, signaling a move away from human-centered health care [
In the light of the huge growth in the deployment of chatbots to support public health provision, there is pressing need for research to help guide their strategic development and application [
What does the evidence tell us about the use of chatbots in public health?
In which fields of public health have chatbots been used the most frequently?
What are the types of chatbots that have been used in public health?
How have chatbots in public health been evaluated?
What are the potential lessons to be learned from the evidence for the use of chatbots in public health?
We carried out a scoping review of studies on the use of chatbots in public health. We followed the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines and methodological framework by Arksey and O'Malley [
The use of AI and digital technologies and the roles in which they are deployed in health tend to blur the boundaries between population and clinical health—that is, chatbots that are used to service individual health needs are often equally as relevant to population-level health in application. In this respect, the synthesis between population-based prevention and clinical care at an individual level [
Our selection methodology was as follows. One of the authors screened the titles and abstracts of the studies identified through the database search, selecting the studies deemed to match the eligibility criteria. The second author then screened 50% of the same set of identified studies at random to validate the first author’s selection. The papers meeting the criteria for inclusion at the title- and abstract-screening stage were retrieved and reviewed independently by both authors, with subsequent discussion about discrepancies and resolution to end with an agreed upon list of included studies.
Our inclusion criteria were for the studies that used or evaluated chatbots for the purpose of prevention or intervention and for which the evidence showed a demonstrable health impact. We included experimental studies where chatbots were trialed and showed health impacts. We also included feasibility studies for agents that are being rolled out, randomized controlled trials (RCTs) informing the feasibility of conversational agents that have obvious applicability for scalability and potential for population-level interventions, and comparative analyses of in-service chatbots. We chose not to distinguish between embodied conversational agents and text-based agents, including both these modalities, as well as chatbots with cartoon-based interfaces.
We excluded thought experiments, design outlines and reflections on systems that have yet to be implemented, descriptions of proposed chatbots and conversational agents, prototypes of system architecture, surveys and predesign analyses, frameworks, commentaries, validation studies, technical papers that introduced agents explaining their architecture and design that have yet to be trialed, and papers exploring perceptions of digital agents or their acceptability or validity among users. We also excluded studies comparing the effect of differences in technical approaches (eg, messaging) and studies that used “Wizard of Oz” protocols—a protocol used to test users’ reactions in which a human responds to users through an interface in which they think they are interacting with a computer. The review selection process is shown in
Review selection process.
In total, 32 studies met the inclusion criteria. These studies included RCTs (n=12), user analytics (n=8), user experience studies (n=3), an experimental pilot (n=1), a descriptive study (n=1), comparative analyses (n=2), a case control study (n=1), design processes (n=2), and feasibility studies (n=2). These studies were distributed across 11 application domains.
Mental health and COVID-19 dominated the application domains. This result is possibly an artifact of the maturity of the research that has been conducted in mental health on the use of chatbots and the massive surge in the use of chatbots to help combat COVID-19. The graph in
Distribution of included publications across application domains. Mental health research and COVID-19 form the majority of the studies. Due to the small numbers of papers, percentages must be interpreted with caution and only indicate the presence of research in the area rather than an accurate distribution of research.
The timeline for the studies, illustrated in
Similarly, one can see the rapid response to COVID-19 through the use of chatbots, reflecting both the practical requirements of using chatbots in triage and informational roles and the timeline of the pandemic.
Distribution of included publications across application domains and publication year. Mental health research has a continued interest over time, with COVID-19–related research showing strong recent interest as expected.
Studies that detailed any user-centered design methodology applied to the development of the chatbot were among the minority (3/32, 9%) [
One study that stands out is the work of Bonnevie and colleagues [
Two-thirds (21/32, 66%) of the chatbots in the included studies were developed on custom-developed platforms on the web [
All the included studies tested textual input chatbots, where the user is asked to type to send a message (free-text input) or select a short phrase from a list (single-choice selection input). Only 4 studies included chatbots that responded in speech [
The majority (28/32, 88%) of the studies contained very little description of the technical implementation of the chatbot, which made it difficult to classify the chatbots from this perspective. Most (19/32, 59%) of the included papers included screenshots of the user interface. However, some only provided sketches of the interface, and often, the text detailing chatbot capabilities was not congruent with the picture accompanying the text (eg, the chatbot was described as free entry but the screenshot showed a single-choice selection). In such cases, we marked the chatbot as using a combination of input methods (see
Surprisingly, there is no obvious correlation between application domains, chatbot purpose, and mode of communication (see
The presentation of the chatbot persona (see
Distribution of chatbot platforms in the included studies. PC: personal computer.
The ways in which users could message the chatbot were either by choosing from a set of predefined options or freely typing text as in a typical messaging app.
Presentation of the chatbot avatar.
The included studies consisted of RCTs (n=12), user analytics (n=8), user experience studies (n=3), an experimental pilot (n=1), a descriptive study (n=1), comparative analyses (n=2), a case control study (n=1), design processes (n=2), and feasibility studies (n=2).
For RCTs, the number of participants varied between 20 to 927, whereas user analytics studies considered data from between 129 and 36,070 users. Overall, the evidence found was positive, showing some beneficial effect, or mixed, showing little or no effect. Evidence was predominantly preliminary and context specific. Most (21/32, 65%) of the included studies established that the chatbots were usable but with some differences in the user experience and that they can provide some positive support across the different health domains.
Moderate positive results were found across several studies in regard to knowledge and skill improvement [
Studies on the use of chatbots for mental health, in particular anxiety and depression, also seem to show potential, with users reporting positive outcomes on at least some of the measurements taken [
Chatbots were found to have improved medical service provision by reducing screening times [
The evidence for the use of chatbots to support behavior change is mixed. One study found that any effect was limited to users who were already contemplating such change [
Mixed findings were also reported regarding adherence. One study found that there was no effect on adherence to a blood pressure–monitoring schedule [
Research on the use of chatbots in public health service provision is at an early stage. Although preliminary results do indicate positive effects in a number of application domains, reported findings are for the most part mixed. Moreover, varying user engagement with the chatbots (though not necessarily correlated with the effect [
The majority (26/32, 81%) of the studies used quantitative methods and methodologies for the evaluation of chatbot design and their impact in relation to health outcomes. For the most part, qualitative methods were used to examine the acceptability of chatbots to patients and their self-reported experience in using them alongside other quantitative usability metrics [
By far the most prevalent means of assessing health impacts of chatbot-led interventions were RCTs (n=12). Studies that focused on the effectiveness of chatbots with regard to an assigned task, such as triage and symptom checking, lent themselves more easily to evaluation through user analytics (n=8). There was, however, limited evaluation of user experience (n=3), and chatbot development was rarely design-led—although there were notable exceptions, with 1 study identifying user principles in development [
No included studies reported direct observation (in the laboratory or in situ; eg, ethnography) or in-depth interviews as evaluation methods. Given the recognized need for observational study in chatbots deployed for public health [
Although research on the use of chatbots in public health is at an early stage, developments in technology and the exigencies of combatting COVID-19 have contributed to the huge upswing in their use, most notably in triage roles. Studies on the use of chatbots for mental health, in particular depression, also seem to show potential, with users reporting positive outcomes [
Notably, people seem more likely to share sensitive information in conversation with chatbots than with another person [
Although the COVID-19 pandemic has driven the use of chatbots in public health, of concern is the degree to which governments have accessed information under the rubric of security in the fight against the disease. For example, in South Korea, the implementation of integrated technological responses, including personalized communication chatbots and the use of personal data gathered for contact tracing [
The evidence cited in most of the included studies either measured the effect of the intervention or surface and self-reported user satisfaction. There was little qualitative experimental evidence that would offer more substantive understanding of human-chatbot interactions, such as from participant observations or in-depth interviews. In this respect, we should remember that chatbots are complex systems, and chatbot deployment in public health is a technology design activity (the design of the platform, the communication modality, and content), as much as it is a medical intervention (the design of the intervention and setting up measures for its effectiveness). As an interdisciplinary subject of study for both HCI and public health research, studies must meet the standards of both fields, which are at times contradictory [
Studies in the existing research often do not provide sufficient information about the design of the chatbot being tested to be reproducible, including by RCT standards, as the chatbot description is not sufficient for an equivalent chatbot to be implemented. There are further confounding factors in the intervention design that are not directly chatbot related (eg, daily notifications for inputting mood data) or include aspects such as the chatbot’s programmed personality that affect people differently [
Few of the included studies discussed how they handled safeguarding issues, even if only at the design stage. Of those that did, the studies mentioned that they could not provide a person to support the chatbot (ie, conversations with the chatbot are not monitored by a person), so the chatbot was programmed to message the user to contact official health authorities if they had an issue (eg, directing the user to call 911). This methodology is a particular concern when chatbots are used at scale or in sensitive situations such as mental health. In this respect, chatbots may be best suited as supplements to be used alongside existing medical practice rather than as replacements [
Although the use of NLP is a new territory in the health domain [
Most of the included papers contained screenshots of the chatbots. However, some of these were sketches of the interface rather than the final user interface, and most of the screenshots had insufficient description as to what the capabilities were. Although the technical descriptions of chatbots might constitute separate papers in their own right, these descriptions were outside the scope for our focus on evidence in public health. However, a previously published scoping review [
Future research on chatbots would benefit from including more details as to how the chatbot is implemented and what type of NLP it uses and cross-referencing the equivalent technical paper describing the system implementation and technical contribution, if it is available.
More broadly, in a rapidly developing technological field in which there is substantial investment from industry actors, there is a need for better reporting frameworks detailing the technologies and methods used for chatbot development. Similarly, given the huge range of chatbot deployments across a wide variety of public health domains, there is a need for standards of comparative criteria to facilitate a better evaluation and validation of these agents and the methods and approaches that they use to improving health and well-being. Finally, there is a need to understand and anticipate the ways in which these technologies might go wrong and ensure that adequate safeguarding frameworks are in place to protect and give voice to the users of these technologies.
Given the immaturity of the research on chatbots, the huge investment in their development and use for health, and the dynamic nature of AI and HCI, our study does not capture the abundance of chatbots, commercial and otherwise, that have been developed across of the domains of public health application. There is a substantial lag between the production of academic knowledge on chatbot design and health impacts and the progression of the field.
Research on the recent advances in AI that have allowed conversational agents more realistic interactions with humans is still in its infancy in the public health domain. Studies show potential, especially for easily automated and repetitive tasks, but at the same time, concerns with the clinical, legal, and ethical aspects of the use of conversational agents for health care are well founded given the speed with which they have been adopted in practice. There is still little evidence in the form of clinical trials and in-depth qualitative studies to support widespread chatbot use, which are particularly necessary in domains as sensitive as mental health. Most of the chatbots used in supporting areas such as counseling and therapeutic services are still experimental or in trial as pilots and prototypes. Where there is evidence, it is usually mixed or promising, but there is substantial variability in the effectiveness of the chatbots. This finding may in part be due to the large variability in chatbot design (such as differences in content, features, and appearance) but also the large variability in the users’ response to engaging with a chatbot.
There is no doubting the extent to which the use of AI, including chatbots, will continue to grow in public health. The ethical dilemmas this growth presents are considerable, and we would do well to be wary of the enchantment of new technologies [
Search terms.
Chatbot application domain, purpose, interaction type, and findings summary.
artificial intelligence
human-computer interaction
natural language processing
Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews
randomized controlled trial
World Health Organization
The original study that is the basis of this paper was commissioned by the World Health Organization. MM is a contractor of Katikati, a social technology start-up enabling 1-to-1 human-led conversations at scale over SMS text messaging, instant messaging, or web.