07. GLOBEM
Data Information
Data Group
The four years of datasets are named as
- INS-W_1 (2018) - 155 participants
- INS-W_2 (2019) - 218 participants
- INS-W_3 (2020) - 137 participants
- INS-W_4 (2021) - 195 participants
Data Type
Each datasets consist of three parts:
- Feature Data,
- Survey Data,
a. Pre/Post Surveys: collected at the start/end of the study
b. EMA Surveys: collectedly regualrly during the study - Participant Info Data
Behaivor Feature Data
- Source: a mobile phone and a wearable fitness tracker
- Period: 24×7.
- Feature data types
- Location
- PhoneUsage
- Call
- Bluetooth
- PhysicalActivity
- Sleep
Participant Info Data
- smartphone platform
- demographics (e.g., age, gender, racical group)
Data Structure
Each dataset is a folder and has its unique name, and every dataset has three folders:
SurveyData: a list of files containing participants' survey responses, includingpre/postlong surveys and weekly shortEMAsurveys.FeatureData: behavior feature vectors from all data types, usingRAPIDSas the feature extraction tool.ParticipantInfoData: some additional information about participants, e.g., device platform (iOS or Android).
. root of a dataset folder
├── SurveyData
│ ├── dep_weekly.csv
│ ├── dep_endterm.csv
│ ├── pre.csv
│ ├── post.csv
│ └── ema.csv
├── FeatureData
│ ├── rapids.csv
│ ├── location.csv
│ ├── screen.csv
│ ├── call.csv
│ ├── bluetooth.csv
│ ├── steps.csv
│ ├── sleep.csv
│ └── wifi.csv
└── ParticipantsInfoData
└── platform.csvSurvey Data
File Instructions
The SurveyData folder contains five files, all indexed by pid and date:
pre.csv: The file contains all questionnaires that participants filled in right before the start of the data collection study (thus pre-study).post.csv: The file contains all questionnaires that participants filled in right after the end of the data collection study (thus post-study).ema.csv: The file contains all EMA surveys that participants filled in during the study. Some EMAs were delivered on Wednesdays, while some were delivered on Sundays.
Our current benchmark takes depression detection as the main task. Thus we also prepare the following two files. We envision future work can be extended to other modeling targets as well.
dep_weekly.csv: The specific file for depression labels (column dep) combining post and EMA surveys.dep_endterm.csv: The specific file for depression labels (column dep) only in post surveys. Some prior depression detection tasks focus on end-of-term depression prediction.
Pre/Post Survey
| Name | Full Name | Available Datasets | Description |
|---|---|---|---|
| BFI10 | The Big-Five Inventory-10 | INS-1, INS-2, INS-3, INS-4 | A 10-item scale measuring the Big Five personality traits Extroversion, Agreeableness, Conscientiousness, Emotional Stability, and Openness. The higher the score, the greater the tendency of the corresponding personality. |
| Name | Full Name | Available Datasets | Description |
|---|---|---|---|
| CHIPS | Cohen-Hoberman Inventory of Physical Symptoms | INS-1, INS-2, INS-3, INS-4 | A 33-item scale measuring the perceived burden from physical symptoms, and resulting psychological effect during the past 2 weeks. Higher values indicate more perceived burden from physical symptoms. |
| Name | Full Name | Available Datasets | Description |
|---|---|---|---|
| UCLA | Short-form UCLA Loneliness Scale | INS-1, INS-2, INS-3, INS-4 | A 10-item scale measuring one's subjective feelings of loneliness as well as social isolation. Items 2, 6, 10, 11, 13, 14, 16, 18, 19, and 20 of the original scale are included in the short form. Higher values indicate more subjective loneliness. |
| SocialFit | Sense of Social and Academic Fit Scale | INS-1, INS-2, INS-3, INS-4 | A 17-item scale measuring the sense of social and academic fit of students at the institution where this study was conducted. Higher values indicate higher feelings of belongings. |
| 2-Way SSS | 2-Way SocialSupport Scale | INS-1, INS-2, INS-3, INS-4 | A 21-item scale measuring social supports from four aspects (a) giving emotional support, (b) giving instrumental support, (c) receiving emotional support, and (d) receiving instrumental support. Higher values indicate more social support. |
| PSS | Perceived Stress Scale | INS-1, INS-2, INS-3, INS-4 | A 14-item scale used to assess stress levels during the last month. Note that Year 1 used the 10-item version. Higher values indicate more perceived stress. |
| ERQ | Emotion Regulation Questionnaire | INS-1, INS-2, INS-3, INS-4 | A 10-item scale assessing individual differences in the habitual use of two emotion regulation strategies: (a) cognitive reappraisal and (b) expressive suppression. Higher scores indicate more habitual use of reappraisal/suppression. |
| BRS | Brief Resilience Scale | INS-1, INS-2, INS-3, INS-4 | A 6-item scale assessing the ability to bounce back or recover from stress. Higher scores indicate more resilient from stress. |
| STAI | State-Trait Anxiety Inventory for Adults | INS-1, INS-2, INS-3, INS-4 | A 20-item scale measuring State-Trait anxiety. Year 1 used the State version, while other years used the Trait version. Higher values indicate higher anxiety. |
| CSE-D | Center for Epidemiologic Studies Depression Scale Cole version | INS-1, INS-2, INS-3, INS-4 | A 10-item scale measuring current level of depressive symptomatology, with emphasis on the affective component, depressed mood. Year 2 used the 9-item version. Higher scores indicate more depressive symptoms. |
| BDI2 | Beck Depression Inventory-II | INS-1, INS-2, INS-3, INS-4 | A 21-item detect depressive symptoms. Higher values indicate more depressive symptoms. 0-13: minimal to none, 14-19: mild, 20-28: moderate and 26-63: severe. |
| MAAS | Mindful Attention Awareness Scale | INS-1, INS-2, INS-3, INS-4 | A 15-item scale assessing a core characteristic of mindfulness. Year 1 used a 7-item version, while other years used the full version. Higher values indicate higher mindfulness. |
| Brief-COPE | Brief Coping Orientationto Problems Experienced | INS-2, INS-3, INS-4 | A 28-item scale measuring (a) adaptive and (b) maladaptive ways to cope with a stressful life event. Higher values indicate more effective/ineffective ways to cope with a stressful life event. |
| GQ | Gratitude Questionnaire | INS-2, INS-3, INS-4 | A 6-item scale assessing individual differences in the proneness to experience gratitude in daily life. Higher scores indicate a greater tendency to experience gratitude. |
| FSPWB | Flourishing Scale & Psychological Well-Being Scale | INS-2, INS-3, INS-4 | An 8-item scale measuring the psychological well-being. Higher scores indicate a person with "more psychological resources and mental strengths". |
| Name | Full Name | Available Datasets | Description |
|---|---|---|---|
| EDS | Everyday Discrimination Scale | INS-2, INS-3, INS-4 | A 9-item scale assessing everyday discrimination. Higher values indicate more frequent experience of discrimination. |
| CEDH | Chronic Work Discrimination and Harassment | INS-2, INS-3, INS-4 | A 12-item scale assessing experiences of discrimination in educational settings. Higher values indicate more frequent experience of discrimination in the work environment. |
| Name | Full Name | Available Datasets | Description |
|---|---|---|---|
| B-YAACQ | The Brief Young Adult Alcohol Consequences Questionnaire (optional) | INS-2, INS-3, INS-4 | A 24-item scale measuring the alcohol problem severity continuum in college students. Higher values indicates more severe alcohol problems. |
Info
PS: Due to the design iteration, some questionnaires are not available in all studies. Moreover, some questionnaires have different versions across years. We clarify them using column names. For example, INS-W_2 only has CESD_9items_POST, while others have CESD_10items_POST. CESD_9items_POST is also calculated in other datasets to make the modeling target comparable across datasets.
EMA Surveys
Weekly Ecological Momentary Assessment (EMA) surveys during the study to collect in-the-moment self-report data. They mainly focus on capturing participants' recent sense of their mental health.
| Name | Full Name | Available Datasets | Description |
|---|---|---|---|
| PHQ-4 | Patient Health Questionnaire 4 | INS-2, INS-3, INS-4 | A 4-item scale assessing (a) mental health, (b) anxiety, and (c) depression. Higher values indicate higher risk of mental health, anxiety, and depression. |
| PSS-4 | Perceived Stress Scale 4 | INS-2, INS-3, INS-4 | A 4-item scale assessing stress levels during the last month. Higher values indicates more perceived stress. |
| PANAS | Positive and Negative Affect Schedule | INS-2, INS-3, INS-4 | A 10-item scale measuring the level of (a) positive and (b) negative affects. Higher values indicates larger extent. |
Feature Data
File Instructions
rapids.csv: The complete feature file that contains all features.location.csv: The feature file that contains allLocationfeatures.screen.csv: The feature file that contains allPhoneUsagefeatures.call.csv: The feature file that contains allCallfeatures.bluetooth.csv: The feature file that contains allBluetoothfeatures.steps.csv: The feature file that contains allPhysicalActivityfeatures.sleep.csv: The feature file that contains allSleepfeatures.wifi.csv: The feature file that contains allWiFifeatures. Note that this feature type is not used by any existing algorithms and often has a high data missing rate.
Processing
- Time segment
variable desc morning (6 am - 12 pm, calculated daily) afternoon (12 pm - 6 pm, calculated daily) evening (6 pm - 12 am, calculated daily) night (12 am - 6 am, calculated daily) allday (24 hrs from 12 am to 11:59 pm, calculated daily) 7-day history (calculated daily) 14-day history (calculated daily) weekdays (calculated once per week on Friday) weekend (calculated once per week on Sunday) - Numeric value
variable desc normalized subtracted by each participant's median and divided by the 5-95 quantile range discretized low/medium/high split by 33/66 quantile of each participant's feature value
Name format
All features follow a consistent naming format: [feature_type]:[feature_name][version]:[time_segment]
feature_type: It corresponds to the six data types.location-f_loc,screen-f_screen,call-f_call,bluetooth-f_blue,steps-f_steps,sleep - f_slp.
feature_name: The name of the feature provided by RAPIDS, i.e., the second column of the following figure, plus some additional information. A typical format is[SensorType]_[CodeProvider]_[featurename]. Please refer to RAPIDS's naming format for more details.version: It has three versions:- nothing, just empty "";
- normalized, _norm;
- discretized, _dis.
time_segment: It corresponds to the specific time segment.morning- morning,afternoon- afternoon,evening- evening,night- night,allday- allday,7-day history- 7dhist,14-day history- 14dhist,weekday- weekday,weekend- weekend.
Example
A participant's sumdurationunlock normalized feature in mornings isf_loc:phone_screen_rapids_sumdurationunlock_norm:morning.
Features
| Feature Name | Unit | Description |
|---|---|---|
| hometime | minutes | Time at home. Time spent at home in minutes. Home is the most visited significant location between 8 pm and 8 am, including any pauses within a 200-meter radius. |
| disttravelled | meters | Total distance traveled over a day (flights). |
| rog | meters | The Radius of Gyration (rog) is a measure in meters of the area covered by a person over a day. A centroid is calculated for all the places (pauses) visited during a day, and a weighted distance between all the places and that centroid is computed. The weights are proportional to the time spent in each place. |
| maxdiam | meters | The maximum diameter is the largest distance between any two pauses. |
| maxhomedist | meters | The maximum distance from home in meters. |
| siglocsvisited | locations | The number of significant locations visited during the day. Significant locations are computed using k-means clustering over pauses found in the whole monitoring period. The number of clusters is found iterating k from 1 to 200 stopping until the centroids of two significant locations are within 400 meters of one another. |
| avgflightlen | meters | Mean length of all flights. |
| stdflightlen | meters | Standard deviation of the length of all flights. |
| avgflightdur | seconds | Mean duration of all flights. |
| stdflightdur | seconds | The standard deviation of the duration of all flights. |
| probpause | - | The fraction of a day spent in a pause (as opposed to a flight). |
| siglocentropy | nats | Shannon’s entropy measurement is based on the proportion of time spent at each significant location visited during a day. |
| circdnrtn | - | A continuous metric quantifying a person’s circadian routine that can take any value between 0 and 1, where 0 represents a daily routine completely different from any other sensed days and 1 a routine the same as every other sensed day. |
| wkenddayrtn | - | Same as circdnrtn but computed separately for weekends and weekdays. |
| locationvariance | meters2 | The sum of the variances of the latitude and longitude columns. |
| loglocationvariance | - | Log of the sum of the variances of the latitude and longitude columns. |
| totaldistance | meters | Total distance traveled in a time segment using the haversine formula. |
| avgspeed | km/hr | Average speed in a time segment considering only the instances labeled as Moving. This feature is 0 when the participant is stationary during a time segment. |
| varspeed | km/hr | Speed variance in a time segment considering only the instances labeled as Moving. This feature is 0 when the participant is stationary during a time segment. |
| numberofsignificantplaces | places | Number of significant locations visited. It is calculated using the DBSCAN/OPTICS clustering algorithm which takes in EPS and MIN_SAMPLES as parameters to identify clusters. Each cluster is a significant place. |
| numberlocationtransitions | transitions | Number of movements between any two clusters in a time segment. |
| radiusgyration | meters | Quantifies the area covered by a participant. |
| timeattop1location | minutes | Time spent at the most significant location. |
| timeattop2location | minutes | Time spent at the 2nd most significant location. |
| timeattop3location | minutes | Time spent at the 3rd most significant location. |
| movingtostaticratio | - | Ratio between stationary time and total location sensed time. A lat/long coordinate pair is labeled as stationary if its speed (distance/time) to the next coordinate pair is less than 1km/hr. A higher value represents a more stationary routine. |
| outlierstimepercent | - | Ratio between the time spent in non-significant clusters divided by the time spent in all clusters (stationary time. Only stationary samples are clustered). A higher value represents more time spent in non-significant clusters. |
| maxlengthstayatclusters | minutes | Maximum time spent in a cluster (significant location). |
| minlengthstayatclusters | minutes | Minimum time spent in a cluster (significant location). |
| avglengthstayatclusters | minutes | Average time spent in a cluster (significant location). |
| stdlengthstayatclusters | minutes | Standard deviation of time spent in a cluster (significant location). |
| locationentropy | nats | Shannon Entropy computed over the row count of each cluster (significant location), it is higher the more rows belong to a cluster (i.e., the more time a participant spent at a significant location). |
| normalizedlocationentropy | nats | Shannon Entropy computed over the row count of each cluster (significant location) divided by the number of clusters; it is higher the more rows belong to a cluster (i.e., the more time a participant spent at a significant location). |
| timeathome | minutes | Time spent at home. |
| timeat[PLACE] | minutes | Time spent at [PLACE], which can be living, exercise, study, greens. |
| Feature Name | Unit | Description |
|---|---|---|
| sumduration | minutes | Total duration of all unlock episodes. |
| maxduration | minutes | Longest duration of any unlock episode. |
| minduration | minutes | Shortest duration of any unlock episode. |
| avgduration | minutes | Average duration of all unlock episodes. |
| stdduration | minutes | Standard deviation duration of all unlock episodes. |
| countepisode | episodes | Number of all unlock episodes. |
| firstuseafter | minutes | Minutes until the first unlock episode. |
| sumduration[PLACE] | minutes | Total duration of all unlock episodes. [PLACE] can be living, exercise, study, greens. Same below. |
| maxduration[PLACE] | minutes | Longest duration of any unlock episode. |
| minduration[PLACE] | minutes | Shortest duration of any unlock episode. |
| avgduration[PLACE] | minutes | Average duration of all unlock episodes. |
| stdduration[PLACE] | minutes | Standard deviation duration of all unlock episodes. |
| countepisode[PLACE] | episodes | Number of all unlock episodes. |
| firstuseafter[PLACE] | minutes | Minutes until the first unlock episode. |
| Feature Name | Unit | Description |
|---|---|---|
| count | calls | Number of calls of a particular call_type (incoming/outgoing) occurred during a particular time_segment. |
| distinctcontacts | contacts | Number of distinct contacts that are associated with a particular call_type for a particular time_segment. |
| meanduration | seconds | The mean duration of all calls of a particular call_type during a particular time_segment. |
| sumduration | seconds | The sum of the duration of all calls of a particular call_type during a particular time_segment. |
| minduration | seconds | The duration of the shortest call of a particular call_type during a particular time_segment. |
| maxduration | seconds | The duration of the longest call of a particular call_type during a particular time_segment. |
| stdduration | seconds | The standard deviation of the duration of all the calls of a particular call_type during a particular time_segment. |
| modeduration | seconds | The mode of the duration of all the calls of a particular call_type during a particular time_segment. |
| entropyduration | nats | The estimate of the Shannon entropy for the the duration of all the calls of a particular call_type during a particular time_segment. |
| timefirstcall | minutes | The time in minutes between 12:00am (midnight) and the first call of call_type. |
| timelastcall | minutes | The time in minutes between 12:00am (midnight) and the last call of call_type. |
| countmostfrequentcontact | calls | The number of calls of a particular call_type during a particular time_segment of the most frequent contact throughout the monitored period. |
| Feature Name | Unit | Description |
|---|---|---|
| countscans | scans | Number of scans (rows) from the devices sensed during a time segment instance. The more scans a bluetooth device has the longer it remained within range of the participant’s phone. |
| uniquedevices | devices | Number of unique bluetooth devices sensed during a time segment instance as identified by their hardware addresses. |
| meanscans | scans | Mean of the scans of every sensed device within each time segment instance. |
| stdscans | scans | Standard deviation of the scans of every sensed device within each time segment instance. |
| countscansmostfrequentdevicewithinsegments | scans | Number of scans of the most sensed device within each time segment instance. |
| countscansleastfrequentdevicewithinsegments | scans | Number of scans of the least sensed device within each time segment instance. |
| countscansmostfrequentdeviceacrosssegments | scans | Number of scans of the most sensed device across time segment instances of the same type. |
| countscansleastfrequentdeviceacrosssegments | scans | Number of scans of the least sensed device across time segment instances of the same type per device. |
| countscansmostfrequentdeviceacrossdataset | scans | Number of scans of the most sensed device across the entire dataset of every participant. |
| countscansleastfrequentdeviceacrossdataset | scans | Number of scans of the least sensed device across the entire dataset of every participant. |
| Feature Name | Unit | Description |
|---|---|---|
| countscans | devices | Number of scanned WiFi access points connected during a time_segment, an access point can be detected multiple times over time and these appearances are counted separately. |
| uniquedevices | devices | Number of unique access point during a time_segment as identified by their hardware address. |
| countscansmostuniquedevice | scans | Number of scans of the most scanned access point during a time_segment across the whole monitoring period. |
| Feature Name | Unit | Description |
|---|---|---|
| maxsumsteps | steps | The maximum daily step count during a time segment. |
| minsumsteps | steps | The minimum daily step count during a time segment. |
| avgsumsteps | steps | The average daily step count during a time segment. |
| mediansumsteps | steps | The median of daily step count during a time segment. |
| stdsumsteps | steps | The standard deviation of daily step count during a time segment. |
| sumsteps | steps | The total step count during a time segment. |
| maxsteps | steps | The maximum step count during a time segment. |
| minsteps | steps | The minimum step count during a time segment. |
| avgsteps | steps | The average step count during a time segment. |
| stdsteps | steps | The standard deviation of step count during a time segment. |
| countepisodesedentarybout | bouts | Number of sedentary bouts during a time segment. |
| sumdurationsedentarybout | minutes | Total duration of all sedentary bouts during a time segment. |
| maxdurationsedentarybout | minutes | The maximum duration of any sedentary bout during a time segment. |
| mindurationsedentarybout | minutes | The minimum duration of any sedentary bout during a time segment. |
| avgdurationsedentarybout | minutes | The average duration of sedentary bouts during a time segment. |
| stddurationsedentarybout | minutes | The standard deviation of the duration of sedentary bouts during a time segment. |
| countepisodeactivebout | bouts | Number of active bouts during a time segment. |
| sumdurationactivebout | minutes | Total duration of all active bouts during a time segment. |
| maxdurationactivebout | minutes | The maximum duration of any active bout during a time segment. |
| mindurationactivebout | minutes | The minimum duration of any active bout during a time segment. |
| avgdurationactivebout | minutes | The average duration of active bouts during a time segment. |
| stddurationactivebout | minutes | The standard deviation of the duration of active bouts during a time segment. |
We leverage sleep-related features from RAPIDS-Fitbit-Sleep, including high-level summary features (total duration of being asleep or in bed), and low-level features about the statistics (count, mean, max, min) of episodes of being asleep, restless, and awake during the sleep.
| Feature Name | Unit | Description |
|---|---|---|
| countepisode[LEVEL][TYPE] | episodes | Number of [LEVEL][TYPE] sleep episodes. [LEVEL] is one of awake and asleep and [TYPE] is one of main, nap, and all. Same below. |
| sumduration[LEVEL][TYPE] | minutes | Total duration of all [LEVEL][TYPE] sleep episodes. |
| maxduration[LEVEL][TYPE] | minutes | Longest duration of any [LEVEL][TYPE] sleep episode. |
| minduration[LEVEL][TYPE] | minutes | Shortest duration of any [LEVEL][TYPE] sleep episode. |
| avgduration[LEVEL][TYPE] | minutes | Average duration of all [LEVEL][TYPE] sleep episodes. |
| medianduration[LEVEL][TYPE] | minutes | Median duration of all [LEVEL][TYPE] sleep episodes. |
| stdduration[LEVEL][TYPE] | minutes | Standard deviation duration of all [LEVEL][TYPE] sleep episodes. |
| firstwaketimeTYPE | minutes | First wake time for a certain sleep type during a time segment. Wake time is number of minutes after midnight of a sleep episode’s end time. |
| lastwaketimeTYPE | minutes | Last wake time for a certain sleep type during a time segment. Wake time is number of minutes after midnight of a sleep episode’s end time. |
| firstbedtimeTYPE | minutes | First bedtime for a certain sleep type during a time segment. Bedtime is number of minutes after midnight of a sleep episode’s start time. |
| lastbedtimeTYPE | minutes | Last bedtime for a certain sleep type during a time segment. Bedtime is number of minutes after midnight of a sleep episode’s start time. |
| countepisodeTYPE | episodes | Number of sleep episodes for a certain sleep type during a time segment. |
| avgefficiencyTYPE | scores | Average sleep efficiency for a certain sleep type during a time segment. |
| sumdurationafterwakeupTYPE | minutes | Total duration the user stayed in bed after waking up for a certain sleep type during a time segment. |
| sumdurationasleepTYPE | minutes | Total sleep duration for a certain sleep type during a time segment. |
| sumdurationawakeTYPE | minutes | Total duration the user stayed awake but still in bed for a certain sleep type during a time segment. |
| sumdurationtofallasleepTYPE | minutes | Total duration the user spent to fall asleep for a certain sleep type during a time segment. |
| sumdurationinbedTYPE | minutes | Total duration the user stayed in bed (sumdurationtofallasleep + sumdurationawake + sumdurationasleep + sumdurationafterwakeup) for a certain sleep type during a time segment. |
| avgdurationafterwakeupTYPE | minutes | Average duration the user stayed in bed after waking up for a certain sleep type during a time segment. |
| avgdurationasleepTYPE | minutes | Average sleep duration for a certain sleep type during a time segment. |
| avgdurationawakeTYPE | minutes | Average duration the user stayed awake but still in bed for a certain sleep type during a time segment. |
| avgdurationtofallasleepTYPE | minutes | Average duration the user spent to fall asleep for a certain sleep type during a time segment. |
| avgdurationinbedTYPE | minutes | Average duration the user stayed in bed (sumdurationtofallasleep + sumdurationawake + sumdurationasleep + sumdurationafterwakeup) for a certain sleep type during a time segment. |
Participant Info Data
The ParticipantInfoData folder contains files with additional information about participants.
platform.csv: The file contains each participant's major smartphone platform (iOS or Android), indexed by piddemographics.csv: Due to privacy concerns, demographic data (age, gender, racial group, etc.) are only available for special requests. Please reach out to us (uw-exp@uw.edu) with a clear research plan with demographic data.
