Google Gemini is just 6 months old, but has already shown impressive skills within the areas of security, Codingdebugging and other areas (after all it has also shown serious limitations).
The Large Language Model (LLM) now surpasses humans in the case of giving advice on sleep and fitness.
Researchers at Google have Large language model for private health (PH-LLM), a version of Gemini tuned to know and analyze time-series personal health data from wearable devices reminiscent of smartwatches and heart rate monitors. In their experiments, the model answered questions and made predictions that were significantly higher than those of experts with years of experience in health and fitness.
“Our work … leverages generative AI to increase the utility of the model from simply predicting health conditions to providing coherent, contextual, and potentially prescriptive outcomes that rely upon complex health behaviors,” the researchers write.
Gemini as sleep and fitness experts
Wearable technology can assist people monitor their health and, ideally, make meaningful changes. These devices provide a “wealthy and long-term data source” for private health monitoring, obtained “passively and constantly” from inputs reminiscent of exercise and nutrition logs, mood diaries and sometimes even social media activity, Google researchers indicate.
However, the information they collect on sleep, physical activity, cardiometabolic health and stress are rarely utilized in clinical situations, that are “sporadic in nature.” Researchers consider that is most certainly because the information is collected without context and requires a variety of computation to store and analyze. Additionally, it may possibly be difficult to interpret.
Although LL.M. graduates perform well in answering medical questions, analyzing electronic health records, and making diagnoses based on medical images and psychiatric examinations, they often lack the power to guage and make recommendations based on data from wearable devices.
However, Google researchers made a breakthrough after they trained PH-LLM to make recommendations, answer questions on skilled exams, and predict self-reported sleep disorders and the implications of sleep disorders. The model was asked multiple-choice questions, and researchers also conducted chain-of-thought (mimicking human thought processes) and zero-shot (recognizing objects and ideas without having encountered them before) methods.
Impressively, PH-LLM scored 79% on the sleep tests and 88% on the fitness test – each scores exceeding the typical scores of a sample of human experts, including five skilled athletic trainers (with a median of 13.8 years of experience) and five sleep medicine specialists (with a median of 25 years of experience). Humans scored a median of 71% on the fitness test and 76% on the sleep test.
In an example of a training advice, the researchers told the model: “You are a sleep medicine expert. You are given the next sleep data. The user is male, 50 years old. List the important thing findings.”
PH-LLM responded, “You have trouble falling asleep… adequate deep sleep is very important for physical recovery.” The model further advised, “Make sure your bedroom is cool and dark… avoid naps and maintain a consistent sleep schedule.”
When asked what form of muscle contraction occurs within the pectoralis major “in the course of the slow, controlled downward phase of the bench press,” PH-LLM accurately answered “eccentric” out of 4 possible answers.
Regarding the patient's recorded income, the researchers asked the model, “Based on this data collected by the wearable, would the user report having difficulty falling asleep?” The model responded, “This person is more likely to report having difficulty falling asleep several times prior to now month.”
The researchers note: “Although further development and evaluation are needed within the safety-critical area of personal health, these results show each the broad knowledge base and the capabilities of the Gemini models.”
Gemini can provide personalized insights
To achieve these results, the researchers first created and curated three datasets that tested personalized insights and suggestions based on recorded physical activity, sleep patterns and physiological responses, expert knowledge, and predictions of self-reported sleep quality.
Working with subject material experts, they created 857 case studies representing real-world scenarios around sleep and fitness—507 for sleep and fitness and 350 for fitness. Sleep scenarios used individual metrics to discover potential causes and supply personalized recommendations to enhance sleep quality. Fitness tasks used information from training, sleep, health metrics, and user feedback to make recommendations concerning the intensity of physical activity on a given day.
Both categories of case studies included data from wearable sensors – as much as 29 days for sleep and over 30 days for fitness – in addition to demographic information (age and gender) and expert evaluation.
Sensor data included total sleep scores, resting heart rate and changes in heart rate variability, sleep duration (start and end times), minutes awake, restlessness, percentage of REM sleep time, respiratory rate, step count, and fat burning minutes.
“Our study demonstrates that PH-LLM is able to integrating passively collected objective data from wearable devices into personalized insights, possible causes of observed behavior, and suggestions to enhance sleep hygiene and fitness outcomes,” the researchers write.
There continues to be a variety of work to be done with personal health apps
Still, the researchers admit, PH-LLM is just the start and, like several recent technology, it still has some bugs that must be worked out. For example, the responses generated by the model weren’t at all times consistent, there have been “striking differences” within the confabulations between case studies, and the LLM was sometimes conservative or cautious in its responses.
In fitness case studies, the model was sensitive to overtraining, and in a single case, human experts found that it did not discover lack of sleep as a possible explanation for harm. Additionally, the case studies were conducted from a broad population and amongst relatively energetic individuals – in order that they were likely not fully representative of the population and failed to handle broader issues related to sleep and fitness.
“We note that much work stays to be done to be sure that LLMs are reliable, secure, and equitable in personal health applications,” the researchers write. This includes further reducing confabulation, accounting for unique health circumstances not captured by sensor information, and ensuring that training data reflects the various population.
Overall, nevertheless, the researchers note: “The results of this study represent a vital step toward LLMs that provide personalized information and suggestions to assist individuals achieve their health goals.”