Alignment between Heart Rate Variability from Fitness Trackers and Perceived Stress: Perspectives from a Large-Scale in Situ Longitudinal Study of Information Workers

Background: Stress can have adverse effects on health and well-being. Informed by laboratory findings that heart rate variability (HRV) decreases in response to an induced stress response, recent efforts to monitor perceived stress in the wild have focused on HRV measured using wearable devices. However, it is not clear that the well-established association between perceived stress and HRV replicates in naturalistic settings without explicit stress inductions and research-grade sensors. Objective: This study aims to quantify the strength of the associations between HRV and perceived daily stress using wearable devices in real-world settings. Methods: In the main study, 657 participants wore a fitness tracker and completed 14,695 ecological momentary assessments (EMAs) assessing perceived stress, anxiety, positive affect, and negative affect across 8 weeks. In the follow-up study, approximately a year later, 49.8% (327/657) of the same participants wore the same fitness tracker and completed 1373 EMAs assessing perceived stress at the most stressful time of the day over a 1-week period. We used mixed-effects generalized linear models to predict EMA responses from HRV features calculated


Motivation and Overview
The World Health Organization classified stress as a 21st-century epidemic [1], as chronic stress can have adverse effects on health and well-being. Stress is the perceived imbalance in demands and resources and is experienced when a situation is appraised as personally significant and taxes or exceeds resources for coping [2]. In the short term, stress is associated with negative feelings, decreased performance and productivity, and muscular problems such as tension and headaches [3,4]. In the long term, stress can lead to significant health problems, including cardiovascular disease, impaired immunity functions, and lower overall quality of life [5,6]. Therefore, the ability to monitor stress through unobtrusive means could help improve health outcomes and well-being.
Stress measurements fall roughly into two broad categories: measuring stress directly through physiological markers such as heart rate (HR) variability (HRV) [7,8], cortisol [9], or electrodermal activity [10] and using physiological data to predict perceived stress using self-reports as ground truth [11][12][13][14]. Theories on the role of appraisal on the stress response suggest a positive relationship between perceived stress (through appraising a situation as threatening or demanding) and physiological reactions such as changes in cortisol (ie, the stress hormone), respiration, and HR [2,[15][16][17]. Laboratory studies generally confirm this relationship (see the Background section). However, measuring perceived stress in daily life remains an exceedingly challenging task.
Gold standard biological measures of stress such as cortisol (a stress hormone) tend to be time consuming, expensive, and intrusive; they do not allow continuous measurement and may not align with self-reports [18,19]. Researchers have considered other physiological measures associated with the stress response such as HRV, electrodermal activity, and respiration, which can be obtained using less intrusive means such as wearable sensors [20][21][22]. Wearable sensors are some of the least intrusive methods of measuring physiological stress and yield continuous measures with increased frequency and finer temporal granularity than self-reports or cortisol samples. In recent years, the increased quality and battery life and the low cost of wrist-worn wearables have made it possible for studies to focus on the alignment between physiological (HRV) and self-reported measures in daily life [12,23,24], bringing to light some of the limitations of translating laboratory findings to real-world settings.
Although laboratory studies that induced stress supported an association between HRV and perceived stress (eg, using the Stroop Color-Word Interference Test and mental arithmetic problems [25][26][27][28]; also see the study by Kim et al [29] for a review that found differences in HRV in response to stress), studies in daily life settings with and without wearables have yielded mixed results. For instance, in a study of 223 male white-collar workers, Kageyama et al [30] found that daily job stressors did not correlate with short-term electrocardiogram (ECG)-derived HRV features. In contrast, in a study of 909 participants, Sin et al [31] found that ECG-derived HRV features negatively correlated with longer-term (as opposed to daily) perceived stress measured over a period of 8 days. Similarly, Hynynen et al [32] found that HRV measured in an orthostatic test (sitting up after a period of sleep) but not during night sleep was related to longer-term self-reported (global) stress over the past month. Specifically, HRV features were lower in the group with high stress than in the group with lower stress, whereas HR was higher in the group with high stress. Furthermore, in a study of 20 surgeons monitored continuously over 24 hours, Rieger et al [33] separated surgeons into groups experiencing high and low stress and found significantly higher HR and lower HRV during sleep in the group with high stress.
In real-world settings involving wearables, few studies have used HRV to predict perceived stress and have also found mixed results. Hernandez [23] collected physiological and behavioral data to predict self-reported momentary stress (high vs low) from 15 participants during 5 regular days of work.
Hernandez [23] used a support vector machine model using HRV features, achieving an average accuracy of 56%, slightly better than the 50% at baseline. Similarly, in a 4-month study of 35 participants, Muaremi et al [12] achieved a classification accuracy of 59% in a 3-level prediction task of perceived stress (low, moderate, and high), with 40% at baseline. In a simpler classification task of high versus low stress, Wu et al [24] found that HRV features yielded a classification accuracy of 78% in a study of 8 participants for 2 weeks in a data set with 59% of the samples corresponding to low stress.
These studies demonstrate that HRV associations with perceived stress obtained in situ and with wearables are less consistent than in laboratory studies. The evidence is inconclusive as to whether HRV in real-life settings could reflect daily or momentary perceived stress, as is often assumed in popular applications [8,[34][35][36][37]. The greatest success comes from a few small-scale studies with simplified (eg, binarized from ordinal ratings with the removal of the more difficult middle cases) stress classification tasks. Given the recency of incorporating HRV measurement in consumer-grade wearable devices to track stress in daily life and the lack of large-scale studies addressing this issue, we report on a main study, where we collected HRV data from wrist-worn wearables, as well as self-reports for 657 participants across 9 weeks, and a follow-up with 327 (49.8%) of the same participants over 1 week approximately a year later.
We extend previous studies that predicted stress from wearable HRV data in two ways: (1) we collected HRV data in a large-scale longitudinal study in a naturalistic setting (ie, without control over what stressors occur and when); and (2) we incorporated retrospective stress evaluations, including measures of the timing of stressful periods, to investigate whether contextual knowledge of when stress occurs could help predict perceived stress. Our studies also aimed to shed light on potential factors that could explain why self-reports of stress often do not correlate with physiological measures. Specifically, we aimed to understand the extent to which HRV predicts perceived stress in naturalistic settings. Furthermore, given that HRV is a measure of arousal, we also examined the extent to which HRV is specific to stress beyond other high-arousal affective states, including anxiety, negative affect, and positive affect.
The contributions of this study are as follows: 1. We quantified the degree of association between HRV and perceived stress in a longitudinal large-scale in situ study with information workers. 2. HRV can be calculated in many ways over many time scales (eg, 5 minutes to 24 hours). We identified low frequency (LF)/high frequency (HF) ratio, very LF (VLF), triangular index, and SD of the averages of normal-to-normal intervals (SDANN) calculated between 8 AM to 6 PM as the HRV features most strongly associated with perceived stress. Using these optimal features, we found that HRV is a predictor of perceived stress; however, the relationship is not as strong as in the laboratory, indicating that HRV is limited as a sole indicator of perceived stress, as is often used in modern applications. 3. We found that the same features that indicate stress also predict anxiety, negative affect, and positive affect. However, HRV still uniquely predicts stress after accounting for the shared variance of these related constructs with stress. 4. We describe the limitations of using HRV to measure perceived stress in situ and offer suggestions to improve perceived stress measurement.

Background
Stress is defined as the physiological response to maintain homeostasis in unexpected situations or when perceiving a threat [38][39][40][41]. The stress response is manifested in 2 systems, the autonomic nervous system (ANS)-through the sympathetic nervous system (SNS) and parasympathetic nervous system (PNS)-and the hypothalamic-pituitary-adrenal (HPA) axis [42]. The SNS outputs epinephrine, which promotes rapid and widespread physiological changes such as increased HR [43,44], whereas the PNS generally does the opposite [40,[45][46][47]. The HPA axis outputs cortisol, a stress hormone, which supports the SNS system by increasing available glucose by suppressing other body systems such as immune function and growth [5,48,49]. In general, SNS activity ends when a stressor ends, whereas HPA axis activity may persist for up to 90 minutes after the stressor ends [50][51][52]. Thus, especially over time and with chronic stressors (eg, caregivers of patients with dementia), there may be a sustained cortisol response in the absence of specific SNS activity [53][54][55]. Many of the chronic detrimental effects of stress, such as the increased risk of heart disease, diabetes, and mortality, are associated with increased cortisol [5,[56][57][58].
HRV is a measure of ANS activity and has been associated with health and physical and mental stress [25,29,[59][60][61][62][63][64][65]. HRV measurement relies on the detection of RR intervals; that is, the time between upward deflections in an ECG. Effective clinical ECG measurements require the assistance of a trained clinician to ensure correct electrode placement. A more user-friendly version for (fitness conscious) consumers is chest straps (eg, Zephyr Bioharness [66,67]) that capture waveforms in the same manner as an ECG and do not require a clinician while still being vulnerable to improper positioning.
At the other end of the spectrum, photoplethysmography sensors approximate the measurement of RR intervals by detecting beat-to-beat intervals (BBI) evidenced by volumetric changes in the microvascular bed of tissue [68,69]. Traditionally used in wearable equipment such as fitness trackers, smartwatches, and armbands, they are easy to fit and have extended battery life, therefore allowing for continuous measurement of BBI and, in consequence, HRV. This has enabled a myriad of applications that use these sensors to measure HRV and provide a measurement of "stress" [8,[34][35][36][37]. However, although HRV is associated with stress in laboratory studies, as discussed previously, HRV only measures one component of the stress response: ANS activity. Although the short duration and acute stressors may evoke a strong SNS response, chronic stressors that are characterized by increased cortisol in the absence of an SNS response may not be detected by HRV alone but could still influence self-reports of perceived stress.
The differences between SNS and HPA axis activity, their measurement, and the time courses of responses may play a role in when (or whether) a relationship is found between physiological responses and self-reported stress (eg, cortisol assessed via blood shows faster responses than cortisol measured by saliva). For instance, one study [51] induced stress and found that self-reported stress was associated with physiological stress (increased HR and cortisol) only if assessed during the stressor task. Self-reported stress before or after the stressor did not correlate with physiological stress during the same period. Other studies suggest there may be a lag between perceived and physiological stress where subjective stress responses precede cortisol (endocrine) responses [70]. Gaab et al [71] found that anticipatory but not retrospective cognitive appraisal of stress (self-report) is an important determinant of the cortisol stress response, indicating that the timing of the self-report in relation to the stressor affects whether a relationship is found between perceived and physiological stress. In contrast, Oldehinkel et al [72] found that perceived stress before a social stressor in the laboratory did not predict physiological responses, although changes in perceived arousal and unpleasantness were associated with changes in HR, respiratory sinus arrhythmia, and cortisol during the stressor. Furthermore, perceived stress measured after the stressor was inversely associated with HR during the stressor.
Regarding field studies, in a literature review on the association between salivary cortisol and self-reported stress, Hjortskov et al [18] reported a lack of sufficient evidence of an association between self-reported mental stress and the cortisol response in field studies. The review suggested that the large diversity in study designs and stress measurements possibly obscured any potential relationship. However, these findings from previous studies on the association between perceived and physiological stress indicate a relationship that may be dependent on the temporal resolution of both measurements.
Taken together, the data suggest that HRV is a reliable measure of perceived stress during stressful tasks in the laboratory. However, reliability can be eroded in naturalistic studies for several reasons. First, ecological momentary assessments (EMAs) for stress may not occur (or be answered) during a stressor, which may reduce the accuracy of physiological signals for predicting self-reported stress. Second, HRV-based measures of stress would require a stressor that evokes an HR or HRV response rather than a chronic stressor that may influence self-reports but not HR (eg, a chronic illness). Third, self-reported stress may be reflecting memory biases or coping responses (eg, see the studies by Redelmeier and Kahneman [73] and Scheier et al [74]). Fourth, there are contradictory results for the best time to measure the physiological response of a self-reported stressor (albeit possibly because of methodological differences), coupled with the lack of precise and complete information on stressors that influence the perceived stress level themselves. Finally, HRV measured from wearable sensors might not be sufficiently reliable and might be too sensitive to noise (eg, motion artifacts), thereby obfuscating any potential relationship [70]. Given these challenges, this study sought to investigate the relationship between HRV measured through wearable sensors and perceived stress in a large sample across an extended period and in situ.

Data Collection
This data were collected as part of the larger Tesserae Project [75]. Most participants came from 4 distinct organizations (denoted by O1, O2, O3, and O4), and others from various organizations (denoted by U). Participants were enrolled both on site and remotely. The characteristics of the participants, sensing streams, and study details of the Tesserae study are described in the study by Mattingly et al [75].
Participants were enrolled between January and July of 2018 for the main study, where psychological and physiological measurements of 657 participants were collected during the first 56 days of study participation. This data were used to analyze associations between HRV and self-reported perceived stress.
On the basis of the results from this study, we conducted a 1-week follow-up study with 49.8% (327/657) of the same participants in April 2019 to ascertain whether the link between HRV and perceived stress could be improved by refining the self-reporting procedure.

Demographics
Demographics were collected from a survey administered at the onset of participation (Table 1).

Main Study
Stress was measured using the question, "Overall, how would you rate your current level of stress?" on a 5-point Likert scale ranging from 1 (no stress at all) to 5 (a great deal of stress); The responses were distributed as follows: 5303 responses were 1s (no stress at all); 5108 responses were 2s (very little stress); 3593 responses were 3s (some stress), 573 responses were 4s (a lot of stress); and 118 were 5s (a great deal of stress). This item was validated in an unpublished study [76] (available upon request) with 991 Mechanical Turk participants (Table S10 in Multimedia Appendix 1 provides correlations with other measures). Affect was measured using the 10-item Positive and Negative Affect Short inventory [77,78]. The distribution of the responses is available in Figure 1. Anxiety was measured using a validated single-item omnibus measure of anxiety, "Please select the response that shows how anxious you feel at the moment," on a 5-point Likert scale ranging from 1 (not at all anxious) to 5 (extremely anxious) [79]. EMAs were administered once a day through Qualtrics Surveys at 8 AM, 12 PM, or 4 PM over 8 weeks. Participants were prompted to answer the EMAs through SMS text messages. The responses were distributed as follows: 7501 responses were 1s (not at all anxious); 5081 responses were 2s (a little anxious); 1659 were 3s (moderately anxious); 354 were 4s (very anxious); and 100 were 5s (extremely anxious).
Given that the variables were measured repeatedly for each participant throughout the study, we used the repeated-measures correlations [80] procedure to correlate the response variables in the main study. The correlations are shown in Table 2.  Table 2. Repeated-measures correlation between response measures in the main study and 95% CI.

Follow-up Study
In the follow-up study, EMAs were sent at 4 PM every day over a week (Monday to Sunday). We collected stress by asking the same item as in the main study along with the following questions: "When did the most stressful part of your day start?"-answered by entering hours and minutes in free-form fields; "When did the most stressful part of your day end?"-also answered by entering hours and minutes in free-form fields; and "How stressful was that time?"-answered on a 5-point Likert scale ranging from 1 (no stress at all) to 5 (a great deal of stress). The responses to the stress question as stated in the main study were distributed as follows: 205 responses were 1s (no stress at all); 530 responses were 2s (very little stress); 484 responses were 3s (some stress), 22 responses were 4s (a lot of stress); and 132 were 5s (a great deal of stress).
The responses to the question "How stressful was that time?" were distributed as follows: 36 responses were 1s (no stress at all); 254 responses were 2s (very little stress); 732 responses were 3s (some stress), 71 responses were 4s (a lot of stress); and 280 were 5s (a great deal of stress).
From the timings provided by participants, we calculated the duration of the reported most stressful time of the day, as well as the length of time between the end of that moment and when the participant answered the survey. We refer to the stress question asked in the same way as in the main study, as perceived stress at the time of survey response, whereas we refer to the item introduced in the follow-up study as perceived stress at the reported most stressful time of the day. Figures 2  and 3 provide the distribution of responses, and Table 3 shows the correlation of the responses [80].

Physiological Measures
Wearables can accurately detect HR, especially in conditions of rest or mild exercise [81], although they can have missing data [82]. To measure HR and BBI, from which HRV is computed, participants wore the Garmin vivosmart 3 fitness band (24/7) for the duration of their participation. The same sensors were used in the main study and the follow-up.
In both studies, we examined the associations between HRV and the psychological measures in our sample. To do so, we derived a series of HRV features by adopting standards for the measurement, physiological interpretation, and clinical use of HRV from the North American Society of Pacing and Electrophysiology [29]. In total, we computed 16 HRV features across different time windows using the "hrvanalysis" python library [83], each with a minimum and maximum recording time within the recommended ranges established by Shaffer and Ginsberg [84]. Of these features, 5 were from time domain analyses, which measure variation in HR over time, or the intervals between HR cycles [29]. Triangular index was the single geometric method used [85]. A total of 7 features were from frequency domain analyses [24] where the power spectral density analysis of the HRV frequency domain provides information about how power in a signal is distributed as a function of frequency, which allows the autonomic balance to be quantified at a specific time [29]. The remaining 3 features were nonlinear HRV features, which characterize changes in HRV [86][87][88]. In this study, we focused on features derived from the Poincaré plot (ie, the scatter plot of successive BBIs: BBI n vs BBI n+1 ). Table 4 shows the mean and SD of the features across 3 different time windows. As HRV features have different applications but are nevertheless correlated among themselves to varying degrees [84,89], we examined previous studies to select which features to include in our modeling. We started by selecting the three time domain features and one geometric method feature recommended by the Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology [85]: SD of normal-to-normal intervals (SDNN), root mean square of successive differences (RMSSD), SDANN, and triangular index. As RMSSD and SD1 are identical, as are SDNN and SD2, we only entered RMSSD and SDNN in the models [84]. LF power in normalized units and HF power in normalized units are identical measures that capture the same information as LF/HF; therefore, we only included LF/HF in the models to estimate the ratio between SNS and PNS activity [84,90]. HF is also strongly correlated with PNN50 and RMSSD; therefore, we did not include it in the models. Despite eliminating SD1, SD2, HF, and LF, we decided to keep the ratios as SD2/SD1 and LF/HF as they could capture additional information compared with the individual measures [84]. The correlations among the final set of features across long-term (24 hours) and short-term (5 minutes) windows are shown in Tables 5 and 6. Finally, as HRV measurements explain different phenomena depending on the time window, we decided to use variance inflation factor (VIF) feature elimination [91] to determine the set of features for each particular model and time window and address concerns with multicollinearity.

Data Exclusion
To account for missing EMA or smartwatch data during both studies (eg, dead battery or device not worn), days were excluded from the sample if any value was missing from the predictors for that day. This resulted in a final data set of 14,695 entries in the main study and 1373 in the follow-up study of matching psychological and physiological measures.

Main Study
The main purpose of this study was to examine the relationship between HRV and perceived stress as assessed by a daily stress survey. Many of the HRV features calculated are suited for short time frame measurements (eg, 2 minutes), as well as the long term (eg, 24 hours); however, Shaffer and Ginsberg [84] cautioned that these are not to be used interchangeably. Therefore, given the conflicting evidence presented in the related works as to when it is best to measure HRV in relation to a stressful event, we tested a series of models for predicting the daily stress survey response, with HRV features derived (1) 5 minutes before completing the survey, (2) 30 minutes before, (3) 5 minutes after, (4) 30 minutes after, (5) using time windows of varying length (5 minutes, 30 minutes, 1 hour, 2 hours, 4 hours, 8 hours, and 24 hours) centered on the moment the survey was started, (6) during the entire 24 hours on the day a participant answered the survey, and (7) during the "work day" from 8 AM to 6 PM. For sake of brevity, we report all the coefficients only for the model using the time frame with the best fit in the main results, whereas the coefficients of the models across all other time windows are reported in the form of density plots in Figure S5 in Multimedia Appendix 2. Finally, we examined the overall variance explained in the outcome measure of daily perceived stress from the HRV features.
To determine whether HRV specifically predicts stress or simply indicates arousal, which correlates with other psychological measures, we first built models to examine whether our derived HRV features predicted other survey measures that are known to have a relationship with psychological stress or arousal: positive affect, negative affect, and anxiety [92,93]. Then, to understand whether there is specificity in predicting perceived stress, we further built two models: a model predicting stress using anxiety, positive affect, and negative affect as predictors, and a second model incorporating HRV as an additional predictor.

Follow-up Study
In the analysis conducted in the follow-up study, we leveraged the additional information gained from participants related to their perceived stress duration and evaluated how well the HRV features can predict perceived stress at the reported most stressful time of the day and again predict perceived stress at the time of survey response (the same question asked in the main study). For predicting perceived stress at the reported most stressful time of the day and perceived stress at the time of survey response, in this study, we computed the HRV features in the same manner as in the main study and used the best performing time window found earlier while also considering HRV features calculated during participants' reported most stressful periods for that day. We proceeded to compare these 2 models in predicting both perceived stress at the reported most stressful time of the day and perceived stress at the time of survey response. In addition, we considered the duration of perceived stress at the reported most stressful time of the day as an outcome measure in itself to better understand whether HRV is related to the saliency (score) of the stress events or the duration.

Modeling Strategy
As our data comprises repeated observations for each participant, and stress and anxiety are ordinal variables, we used cumulative link mixed-effects models [94] using a random intercept for the participant. We considered using random slopes in our models but decided against it because of model convergence issues in the main study and not having enough observations to support such random effects structure in the follow-up study. In the cases of predicting positive affect and the duration of perceived stress at the reported most stressful time of the day (follow-up study), we used linear mixed-effects models [95,96] as the variables can be considered continuous. In the case of negative affect, we used a negative binomial generalized linear mixed-effects model, given the distribution of the variable ( Figure 1). As stated earlier, we used VIF [97] feature elimination to iteratively remove VIFs >3 to address multicollinearity [91,98]. As the predictors were on vastly different scales, all predictor variables were z score standardized before being entered into the models. Pseudo R 2 values for both marginal (fixed effects alone) and conditional (random and fixed) effects are reported using the method described by Nakagawa and Schielzeth [99].

Ethics Approval
The study protocol was approved by the University of Notre Dame Institutional Review Board (17-5-3870). Figure 4 provides a density plot of the variance explained (pseudo R 2 ) by the HRV features across all periods. On average, HRV explained a small portion (approximately 1%) of the variability in perceived stress. We also found that the model with features computed during the hours of 8 AM to 6 PM had the lowest Akaike information criterion (AIC) and explained the highest variance (Figure 4), although this was still modest (2.2%). Coefficients for this model are reported in Table 7, whereas density plots of coefficients across all time windows are included in Figure S5 (Multimedia Appendix 2).

Main Study
Regarding whether HRV predicts perceived stress specifically or simply predicts arousal, we found that the directionality of most of the associations was the same for stress, anxiety, positive affect, and negative affect (Tables 7 and 8). Mean RR interval was a significant predictor of anxiety and positive affect but was not significant in predicting stress. LF to HF ratio and triangular index were both significant predictors of stress; however, LF to HF ratio was not a significant predictor of negative affect, and triangular index was not a significant predictor of positive affect.
In addition, after controlling for positive affect, negative affect, and anxiety, most HRV features were still significant predictors of perceived stress, and when compared against a model that only considers the measures of affect and anxiety, a model containing HRV provided a better fit (Table 7), as confirmed by likelihood ratio tests and AIC (χ 2 5 =157.8; P<.001; AIC 23,561 vs 23,709).

Follow-up Study
We first assessed whether using the context provided by participants to determine an HRV window to calculate the features provided a benefit over the previously found best time window of work hours of the day. Our outcome variables were perceived stress at the time of survey response and perceived stress at the reported most stressful time of the day. In the case of perceived stress at the time of survey response, the model of HRV during work hours (reported in Table 9) achieved the best fit with an R 2 of 0.032 versus 0.022 and AIC of 3465 versus 3475, therefore favoring the model with HRV features calculated during the workday, as in the main study. It also replicates findings from the main study, which found an R 2 of 0.022. Similar results are obtained when predicting perceived stress at the reported most stressful time of the day (R 2 of 0.023 vs 0.015), with the model based on HRV during work hours reported in Table 9 and a full comparison available in Tables S11 to S12 in Multimedia Appendix 3. Thus, we did not observe benefits from computing HRV features based on self-reported most stressful time of the day compared with the entire workday. We also found that HRV during work hours was predictive of the duration of perceived stress at the reported most stressful time of the day (Table 9), although the fit was quite small.
As the duration of perceived stress at the reported most stressful time of the day was correlated with perceived stress at the time of survey response and perceived stress at the reported most stressful time of the day scores (Table 3), we conducted a post hoc analysis to investigate whether HRV could predict the saliency of the perceived stress while controlling for the effects of the duration of the event and elapsed time since it occurred-contextual features provided through self-report. Including HRV features along with contextual features provided a better fit (R 2 of 0.064 vs 0.050) over simply using the contextual features. This was further confirmed by likelihood ratio tests and AIC (χ 2 5 =22.9; P<.001; AIC 3242 vs 3255; see Tables S13 and S14 in Multimedia Appendix 4 for the full models). Table 9. Prediction of perceived stress at the time of survey response, perceived stress at the reported most stressful time of the day, and duration of perceived stress at the reported most stressful time of the day with the same predictors-heart rate variability during work hours-as in the best model in the main study a .

Principal Findings
Stress is associated with many negative outcomes [3][4][5][6], thereby making accurate measurement and management of it an important aspect of improving both physical and mental health outcomes. To this end, the ubiquitous computing and mobile health communities have turned to wearables and, more specifically, identified wearable-sensed HRV as an attractive method for passively sensing stress [12,23,24,29]. However, does the evidence support associating HRV-as measured with wearables in the wild-with stress, as perceived by the user?
We found that the best model yielded a marginal R 2 of 2.2%, which approximately corresponds to a correlation of 0.15 and a Cohen d of 0.30, which lies between a small (Cohen d=0.20) to medium (Cohen d=0.50) effect [100,101]. Thus, HRV was weakly, although significantly, associated with perceived stress when measured using a wearable in naturalistic settings. The size of this effect is, to some degree, expected, given that HRV only measures ANS activity and not HPA activity, thus being an incomplete assessment of stress, even in ideal conditions. That said, we would have expected a stronger relationship between perceived stress and HRV a priori, given its popular use in assessing stress [8,[34][35][36][37]. Nevertheless, despite the small magnitude of the effect, we also found some evidence for incremental prediction in that HRV uniquely predicted perceived stress above and beyond self-reported positive affect, negative affect, and anxiety (Table 7).
We do not believe the small effect size is because of how perceived stress was assessed, as using validated assessments of related constructs, such as negative affect and anxiety, yielded similar results (Table 8) and was highly correlated with stress ( Table 2). Our findings suggest that the signal provided by wearable-measured HRV is of limited use in predicting perceived stress in the wild in the absence of clear and isolated stressors (such as those provided in laboratory studies).
Regarding the optimal temporal association between HRV and perceived stress, we found that HRV features measured around the time of the survey response-when participants were assessing their current stress level-yielded a lower fit than a generic time window covering the workday (ie, between 8 AM to 6 PM). This is different from the results in laboratory settings, which suggest the optimal time window to be shorter and closer to the assessment of stress, given the quick SNS response to induced stress. Although the length of the time window in which HRV is measured can affect what contributes to the changes in the HRV features (eg, circadian rhythms might be captured with longer-term HRV but not short term [84]), the estimates found within the "workday" time window of 8 AM to 6 PM were generally consistent in directionality with previous literature for changes in HRV because of stress.
Specifically, triangular index and SDANN were both negatively associated with perceived stress. Both of these match the expectation that lower HRV would indicate higher stress [29]. VLF was positively associated with perceived stress, which is to be expected as SNS activity because of stress (among other reasons) modulates the amplitude and frequency of HRV measured in this band [84,102]. Finally, the ratio of LF to HF was negatively associated with perceived stress in the work hours time window, which might be considered counterintuitive. In controlled conditions, LF/HF can be used as a measure of autonomic balance; that is, it is assumed that PNS and SNS activity contributes to LF, and PNS largely contributes to HF [84]. Therefore, one could have expected a higher LF/HF ratio to equate to higher perceived stress, as it would indicate more SNS than PNS activity. Nevertheless, as highlighted in the study by Shaffer and Ginsberg [84], because of the complex relationship between SNS and PNS activity, LF/HF ratio will not always index autonomic balance. Thus, it is possible that in the conditions of this study, either a higher LF/HF was an indicator of higher PNS activity over SNS activity, or a higher PNS activity was a better marker for the saliency of a previous stressful event from which the participant was recovering at the time of the survey response.
In the follow-up study, our modified stress survey aimed to identify and compute HRV based on participants' most stressful time of the day. Although this is impractical for a real-world use case, it does allow measurement of HRV closer to the stressor, as in many laboratory studies. Nevertheless, measuring HRV during the most stressful time of the day yielded a lower model fit than using the generic 8 AM to 6 PM time window (Multimedia Appendix 3). Therefore, we believe the small effect of HRV as a predictor of stress ostensibly resides in the conditions of measurement themselves. Specifically, in laboratory-based studies, the measurements of changes in HRV because of stress occur in the presence of clear and isolated stressors (eg, stress being induced by the study conditions, causing an increase in SNS activity), which, in turn, implies that HRV changes because stress, and these changes can often cease with the end of a stressor [51]. Discrete and isolated stressors in controlled laboratory studies may not be as common in naturalistic settings, making results from these studies under controlled conditions not fully applicable to daily life settings.
In naturalistic settings, identifying perceived stress at the precise moment of a clear and isolated stressor would be difficult to achieve from HRV alone for several reasons. First, physiological stress is different from perceived stress. For instance, physical exertion or exercise is generally classified as a physiological stressor (and would exhibit increases in HR, decreased HRV, and increases in cortisol); however, it is well known that exercise can reduce perceived stress [103] and generally would not be reported as stressful by participants. Second, self-reports are subject to emotional perception and expression biases [104][105][106][107], as well as memory biases and/or coping responses [73,74]. Finally, EMAs are designed to measure stress at either random or specific times, although participants may not respond at the designated time (eg, at the end of a stressor as opposed to the middle of a stressor).
In summary, our main conclusion is that the reported association between HRV and perceived stress may depend on laboratory conditions. In naturalistic studies, there are no clear and direct links between isolated stressors and SNS responses. Although there is still an observable association between wearables and perceived stress, it is weak, and it suggests that HRV alone should not be considered a valid proxy measure of perceived stress in naturalistic studies.

Implications of This Study
Although HRV has been shown to be a useful biomarker of perceived stress in laboratory studies, we have shown that in the wild, perceived stress does not always align strongly with physiological stress. This is of special importance as an increasing number of studies and commercial applications in the ubiquitous computing community use wearables to measure stress using HRV, sometimes under the assumption that there is a very strong alignment between the two, when, in fact, the alignment is more tenuous. Although it is beneficial to have wearables capable of providing continuous measurement of HRV unobtrusively, we caution against the use of HRV features as sole or main indicators of "stress" in user-facing applications, as the results may not align with perceived stress. This level of inaccuracy risks an increase of distrust in health and well-being applications at a minimum. It can have more profound negative effects as well, and based on the present findings, labeling HRV as "stress" without proper validity data would be highly suspect. Therefore, we would encourage future work in the scientific community to investigate complementary sensing streams that could serve as markers of stress and use those in conjunction with HRV.
To realize the goal of monitoring the health of individuals, such sensing streams should be rigorously vetted through longitudinal studies to appropriately measure their predictive power for capturing intraindividual differences over time. Nevertheless, it is unlikely that any single physiological sensing stream would be able to perfectly align with perceived stress. Therefore, rather than looking at a single biomarker of the ANS, as is HRV, a more complete view of the ANS response could perhaps delineate a viable strategy for health monitoring unobtrusively in the wild. More broadly, approaches based on multimodality are more likely to yield successful outcomes in health monitoring, as recent studies show in other fields such as sleep monitoring [108], job performance monitoring [109,110], and personality prediction [111].

Limitations
It is important to note that this study has limitations. First, our sample comprised information workers who might be less likely to have movement artifacts that could affect the wearable measurements of HRV. Second, our sample was fairly homogenous, with participants whose income and education levels were above the US average (low-income and lower education populations were underrepresented). Third, we are unable to determine the accuracy of self-reported stress durations and timing of stress. Similarly, the duration of the most stressful time of the day was correlated with the perceived stress at that time, and it is possible that participants' response to one question influenced the answer to the other (ie, judging stressors that last longer as more intense). Finally, the items introduced in the follow-up study were not validated in this or other studies. Addressing these limitations is a goal for future work.

Conclusions
We examined the alignment of physiological stress (HRV), as measured with a consumer-grade wearable device, and perceived stress in an 8-week study with information workers from multiple organizations across the United States. We found a weak but significant association between HRV and perceived stress, which was replicated in a week-long follow-up study a year later. Computing HRV across the workday outperformed other time windows, including self-reported stressful events. Overall, our findings suggest that wearable-based HRV should not be used as a sole biomarker for perceived stress in naturalistic settings. Instead, it might best be used in conjunction with other measures to measure this complex phenomenon in the wild.