- Research
- Open access
- Published:
Comparative analysis of ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced in the treatment of postmenopausal osteoporosis
BMC Musculoskeletal Disorders volume 26, Article number: 369 (2025)
Abstract
Background
Osteoporosis is a sex-specific disease. Postmenopausal osteoporosis (PMOP) has been the focus of public health research worldwide. The purpose of this study is to evaluate the quality and readability of artificial intelligence large-scale language models (AI-LLMs): ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced for responses generated in response to questions related to PMOP.
Methods
We collected 48 PMOP frequently asked questions (FAQs) through offline counseling and online medical community forums. We also prepared 24 specific questions about PMOP based on the Management of Postmenopausal Osteoporosis: 2022 ACOG Clinical Practice Guideline No. 2 (2022 ACOG-PMOP Guideline). In this project, the FAQs were imported into the AI-LLMs (ChatGPT-4o mini, ChatGPT-4o, Gemini Advanced) and randomly assigned to four professional orthopedic surgeons, who independently rated the satisfaction of each response via a 5-point Likert scale. Furthermore, a Flesch Reading Ease (FRE) score was calculated for each of the LLMs’ responses to assess the readability of the text generated by each LLM.
Results
When it comes to addressing questions related to PMOP and the 2022 ACOG-PMOP guidelines, ChatGPT-4o and Gemini Advanced provide more concise answers than ChatGPT-4o mini. In terms of the overall FAQs of PMOP, ChatGPT-4o has a significantly higher accuracy rate than ChatGPT-4o mini and Gemini Advanced. When answering questions related to the 2022 ACOG-PMOP guidelines, ChatGPT-4o mini vs. ChatGPT-4o have significantly higher response accuracy than Gemini Advanced. ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced all have good levels of self-correction.
Conclusions
Our research shows that Gemini Advanced and ChatGPT-4o provide more concise and intuitive answers. ChatGPT-4o responds better in answering frequently asked questions related to PMOP. When answering questions related to the 2022 ACOG-PMOP guidelines, ChatGPT-4o mini and ChatGPT-4o responded significantly better than Gemini Advanced. ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced have demonstrated a strong ability to self-correct.
Clinical trial number
Not applicable.
Background
PMOP represents a systemic bone disease in which women experience increased bone resorption and reduced bone formation [1]. This is a consequence of the decrease in estrogen levels that occurs after menopause, resulting in a reduction in bone mass, destruction of the bone microstructure, increased bone fragility and susceptibility to fracture [2, 3]. Approximately one in two women older than 50 years will experience an osteoporotic fracture [4]. A study revealed that the prevalence rate of PMOP was 11.4% for women aged < 65 years and 26.4% for those aged ≥ 60–69 years [5]. In recent years, there has been a notable increase in the attention and focus on the treatment and prevention of PMOP within the medical community. System-level and national health care programs have been implemented worldwide. However, studies have demonstrated that 80–90% of adults do not receive appropriate osteoporosis management, even in secondary prevention [6, 7]. Additionally, many postmenopausal women do not take timely preventive measures, such as calcium and vitamin D supplementation or regular bone density testing [8]. This suggests that there is a lack of awareness about the severity of the disease, its prevention methods, and the potential health risks associated with it. It can be reasonably deduced that increased public awareness of the disease will contribute to the implementation of early prevention and intervention strategies, which will in turn reduce the associated health risks.
AI-LLMs are sophisticated neural network computer programs based on deep learning techniques that are capable of reading and comprehending text, as well as learning through the analysis of vast quantities of textual data, thereby continuously increasing their capacity to understand and generate language [9,10,11] AI-LLMs have the potential to suggest personalized treatment plans and assist physicians in making more appropriate treatment decisions [12]. In a study conducted by Mohammad Delsoz, AI-LLMs demonstrated the potential to assist physicians in the primary triage of glaucoma patients and in the clinical practice of eye care by analyzing the history and symptoms of glaucoma patients [13]. In a study conducted by Potapenko et al., AI-LLMs were utilized to respond to queries pertaining to prevalent retinal conditions [14]. These findings indicated that ChatGPT furnished more precise responses. In a study conducted by Grünebaum et al., AI-LLMs were utilized to respond to queries pertaining to obstetrics and gynecology [15]. These applications markedly increase the efficacy and caliber of health care services, offering robust technical assistance for the advancement of personalized and precision medicine [16]. Nevertheless, it remains uncertain whether AI-LLMs are capable of providing up-to-date information and making clinical decisions in the context of postmenopausal osteoporosis. The objective of this study was to evaluate the performance of the latest AI-LLMs (ChatGPT-4o, ChatGPT-4o, Gemini Advanced) in generating professional and clinically accurate responses to common clinical questions about postmenopausal osteoporosis and in accordance with the 2022 ACOG Guidelines. This study was to also ascertain whether AI-LLMs can facilitate improvements in postmenopausal osteoporosis patients’ comprehension of the disease, self-management abilities, the advancement of personalized medicine, and the alleviation of the scarcity of health care resources.
Methods
The FAQs were selected for the purpose of investigating the applicability of different AI-LLMs to common clinical settings. The 48 FAQs (Supplementary Table 1a) related to PMOP were divided into six domains: clinical manifestation, diagnosis, pathogenesis, treatment, prevention, and risk factors. This division was performed to explore the ability of different AI-LLMs to address different PMOP conditions. The sources of these questions include MedlinePlus, the Cochrane Library, the National Osteoporosis Foundation, the Mayo Clinic, UpToDate, WebMD, and offline patient counseling. To further examine the comprehension of different AI-LLMs regarding the Specialized PMOP’s Guidelines, 24 additional questions were formulated on the basis of the 2022 ACOG-PMOP Guidelines (Supplementary Table 1b). The most current versions of the AI-LLMs were utilized in this study. The AI-LLMs utilized in this study were ChatGPT-4o mini (July 18, 2024), ChatGPT-4o (May 13, 2024), and Gemini Advanced (Gemini 1.5 Pro, June 24, 2024). When queries were posed to the AI-LLMs, a new dialog box was generated for each question, and the responses were collated at the conclusion of the interaction. Any references to the AI-LLMs were removed, the responses were aggregated, and they were then randomly assigned to four orthopedic specialists with expertise in the treatment of osteoporosis for separate dialog inputs for Likert scale scoring. Each dialog was subsequently reset after each query to collate the content of the replies. The content of the AI-LLM replies was converted to plain text format, and any information in the text identifying the AI-LLMs was removed. In addition, we counted the characters, total words, total syllables, total sentences, and FRE score (206.835–1.015 (total words/total sentences)-84.6 (total syllables/total words)) for each response content(90–100: Very easy to read; 80–89: Easy to read; 70–79: Easier to read; 60–69: Standard reading difficulty; 50–59: Difficult to read, suitable for college students or professionals; 30–49: Difficult to read, suitable for experts or readers in a specific field; 0–29: Very difficult to read, usually an academic paper or legal document [17]). Figure 1 illustrates the design flow of the present study.
Four orthopedic surgeons with expertise in specialized fields evaluated the AI-LLM responses via a 5-point Likert scale [18, 19] (1 for fully disagree, 2 for partially disagree, 3 for neither agree nor disagree, 4 for partially agree, 5 for fully agree). AS ≤ 2 is indicative of poor performance, 2 < AS ≤ 3 is indicative of fair performance, 3 < AS ≤ 4 is indicative of good performance, and AS > 4 is indicative of excellent performance. The consistency of the responses of the four specialized orthopedic surgeons to ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced was evaluated via Fleiss’s kappa coefficient.
We further explored the ability of AI-LLMs to self-correct. Questions with an AS ≤ 2 that were identified as ‘poor’, where the incorrect part was pointed out by an orthopedic specialist, were subject to further questioning along the lines of ‘You do not seem to have answered that correctly, can you answer it again?’ Replies were collected and converted to plain text format, and any information in the text identifying the LLM chatbot was removed and randomly assigned to four raters to reevaluate the corrected content. This round of evaluation was completed two weeks after the final round of scoring. During the initial round of re-evaluation, the raters were not informed that the responses were self-correcting versions.
Statistical analysis
The data analysis was conducted via SPSS 26 software (released by IBM Corp. in 2021). Normally distributed data are expressed as the mean ± standard deviation, whereas non-normally distributed data are expressed as the median (25th‒75th percentile) (M(P25-P75)). Statistical comparisons were performed via the Kruskal–Wallis H test to determine the significance of differences in the FRE score and AS between ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced. When significant differences among ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced were detected, Dunn’s test with Bonferroni correction was applied to identify specific pairwise differences. Paired t tests were employed to evaluate the initial AS and self-corrected AS. For categorical outcomes, ratings were dichotomized into ‘excellent’ vs. ‘other’, statistical comparisons were performed via the Pearson’s chi-square tests to determine the significance of differences in ratings the between ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced. When significant differences among ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced were detected, Bonferroni correction (adjusted α = \(\:\:\frac{0.05}{3}\)) was applied to identify specific pairwise differences. The consistency of the responses from the four advanced orthopedic surgeons to the question ratings on the ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced scores was assessed via Fleiss’s kappa. Fleiss’s kappa coefficient ranged between 0 and 1. According to the established criteria, consistency is classified as poor when the coefficient is between 0 and 0.2, moderate when it is between 0.2 and 0.4, moderate when it is between 0.4 and 0.6, strong when it is between 0.6 and 0.8, and very strong when it is between 0.8 and 1.0.
Results
Length and FRE score of the responses from ChatGPT-4o Mini, ChatGPT-4o, and gemini advanced
Table 1 shows the average total characters, total words, total syllables, total sentence lengths, and FRE scores generated by the AI-LLMs for the FAQs of the PMOPs in the different subject areas. The total characters, total words, total syllables, total sentences, and FRE score responses to the individual questions on the AI-LLMs are shown in Supplementary Tables 2a-c. The P-values corrected using Dunn’s test and Bonferroni are shown in Supplementary Tables 2d-g. There was no significant difference in the total number of characters, total words, total syllables, or total sentence lengths or FRE scores generated by the AI-LLMs for the topics of “Clinical Manifestation”, “Treatment”, and “Prevention”, and the FRE scores were not significantly different. In the FAQ responses related to the topic of “Diagnosis”, Gemini Advance’s total number of characters (1484.88 ± 377.67 vs. 2355.00 ± 796.14), total words (227.00 ± 55.68 vs. 338.13 ± 110.78), total syllables (425.00 ± 124.67 vs. 677.25 ± 225.17), and total sentences (13.63 ± 3.62 vs. 25.25 ± 10.18) were significantly less than that of ChatGPT-4o mini (P < 0.05), and the FRE score of ChatGPT-4o was significantly greater than that of ChatGPT-4o mini (47.55 ± 17.08 vs. 23.24 ± 5.10) (P < 0.05). In the FAQ responses with the topic “Pathogenesis”, ChatGPT-4o with Gemini Advanced had significantly fewer total syllables than ChatGPT-4o mini (384.88 ± 158.79, 368.88 ± 133.67 vs. 623.00 ± 266.78) (P < 0.05), and the total number of sentences with ChatGPT-4o was significantly less than those of ChatGPT-4o mini (11.00 ± 3.82 vs. 23.13 ± 10.45) (P < 0.05). In the FAQ responses concerning the topic of “Risk Factor”, Gemini Advanced had a significantly greater FRE score than ChatGPT-4o mini (55.11 ± 15.21 vs. 24.06 ± 12.63) (P < 0.05). Summarizing all the FAQ responses revealed that Gemini Advanced had significantly fewer total syllables and total sentences than ChatGPT-4o mini (409.90 ± 168.96 vs. 562.56 ± 230.88, 16.08 ± 8.27 vs. 22.31 ± 9.84) (P < 0.05). The FRE score of ChatGPT-4o was significantly greater than that of ChatGPT-4o mini (37.23 ± 13.18 vs. 28.88 ± 11.69) (P < 0.05). Table 2 shows the length and FRE score, total characters, total words, total syllables, and total sentences of the AI-LLMs’ responses to the questions related to the 2022 ACOG-PMOP Guideline. The results revealed that the total words and total syllables of Gemini Advanced were significantly lower than those of ChatGPT-4o and ChatGPT-4o mini (221.67 ± 57.33 vs. 316.04 ± 120.65, 329.96 ± 128.78; 344.79 ± 91.34 vs. 618.13 ± 237.32, 642.46 ± 258.46) (P < 0.05), and the FRE score of Gemini Advanced was significantly greater than that of ChatGPT-4o and ChatGPT-4o mini (63.40 ± 2.58 vs. 27.00 ± 9.46, 25.77 ± 13.33) (P < 0.05).
AS and grading of the ChatGPT-4o Mini, ChatGPT-4o, and gemini advanced responses
Table 3 shows the ASon the Likert scale of the ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced’s responses to the FAQs about PMOP on different topics. The Likert scale scores for the four reviewers’ responses to the individual questions on the AI-LLMs are shown in Supplementary Tables 2a-c. The P-values corrected using Dunn’s test and Bonferroni are shown in Supplementary Table 2 h. In terms of “Diagnosis”, Gemini Advanced’s AS (2.53 ± 1.15) was significantly lower than that of ChatGPT-4o (4.19 ± 0.78) (P < 0.05). In terms of “Pathogenesis”, ChatGPT-4o mini had a significantly lower AS (2.84 ± 0.86) than ChatGPT-4o (4.28 ± 0.47) (P < 0.05). In terms of “Risk Factor”, Gemini Advanced had a significantly greater AS (4.09 ± 0.79) than ChatGPT-4o mini (2.34 ± 0.72) (P < 0.05). Overall, the AS of ChatGPT-4o was significantly greater than that of ChatGPT-4o mini and Gemini Advanced according to the ASs of all the FAQs related to PMOP (4.01 ± 0.81 vs. 3.21 ± 1.02, 3.42 ± 1.16). Table 4 shows the mean scores on the Likert scale for the ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced for the 2022 ACOG-PMOP Guideline guideline-related questions. Google Gemini’s AS was significantly lower than that of ChatGPT-4o mini, ChatGPT-4o (2.56 ± 0.90 vs. 3.49 ± 0.98, 3.90 ± 0.79) (P < 0.05).
Tables 5 and 6 shows the chi-square test for the overall comparisons of the ratings for ChatGPT-4o, ChatGPT-4o mini, and Gemini Advanced across different topics. The P-values corrected using Bonferroni correction (adjusted α =\(\:\:\frac{0.05}{3}\)) are shown in Table 7. In terms of “Pathogenesis”, ChatGPT-4o was significantly better than ChatGPT-4o mini (P < 0.0167). In terms of “Risk Factor”, Gemini Advanced performed significantly better than ChatGPT-4o mini (P < 0.0167). Overall, ChatGPT-4o performed well in answering the FAQs about PMOP, significantly better than ChatGPT-4o mini and Gemini Advanced (P < 0.0167), with only two “poor” responses and 28 “excellent” ratings (Fig. 2a). As for the responses to questions from the 2022 ACOG-PMOP Guideline. ChatGPT-4o similarly outperformed Gemini Advanced (P < 0.0167). ChatGPT-4o had the highest percentage of “excellent” answers, at 62.5%. Google Gemini had the lowest percentage of “excellent” answers. Google Gemini had the lowest percentage of “excellent” answers, at 12.5% (Fig. 2b). The specific content of the AI-LLMs’ answers to all the questions is shown in Supplementary Tables 3a-b.
Self-correcting capacity of ChatGPT-4o Mini, ChatGPT-4o, and gemini advanced
Table 8 shows the changes after self-correction of the ChatGPT-4o mini scale for questions with an AS ≤ 2. ChatGPT-4o mini had a mean AS (1.78 ± 0.15) for initial responses and 4.22 ± 0.26 for self-corrected responses, which was significantly greater, and the ratings increased significantly (P < 0.05). Table 9 shows the changes after self-correction for questions ChatGPT-4o with an AS ≤ 2. The mean AS after self-correction was significantly greater than the initial mean AS (4.08 ± 0.14 vs. 1.75 ± 0.25, P < 0.05). Table 10 shows the changes after Gemini Advanced self-correction for questions with an AS ≤ 2. The mean AS after Gemini Advanced self-correction was significantly greater than the initial mean AS (4.03 ± 0.61 vs. 1.75 ± 0.24), and the ratings were also significantly improved (P < 0.05). These findings suggest that ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced all have strong self-correcting abilities. Supplementary Tables 4a-c show the post-self-correction content of ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced for questions with AS ≤ 2, respectively. Specific parts of the initial responses that contained errors are highlighted in yellow. In addition, a professional orthopedic surgeon evaluated and prompted the parts of the initial content that were incorrect.
Discussion
Compared with premenopausal women, perimenopausal women are at an earlier risk of developing osteoporosis than men are due to the rapid decline in estrogen levels and significantly accelerated bone loss, leading to an increased risk of fracture [20]. According to the European Vertebral Osteoporosis Study (EVOS), the prevalence of vertebral fractures in women aged 50–79 years is approximately 12.0%, and after 50 years of age, the prevalence of vertebral fracture increases with age [21]. The National Health and Nutrition Examination Survey (NHANES) suggested that in the United States, 6.2% of adults aged 65 years and over had osteoporosis at the lumbar spine or femur neck. The prevalence of osteoporosis at either skeletal site was higher among women (24.8%) than among men (5.6%) [22]. Based on Cummings SR et al., an epidemiological survey of postmenopausal osteoporosis in white women estimated that a 50-year-old woman has a 15–20% lifetime risk of hip fracture and a 50% risk of any osteoporotic fracture [23]. Hip fractures can result in poor quality of life, a dependent living situation, and an increased risk of death [24]. Postmenopausal osteoporosis seriously affects women’s work and quality of life, and it has become a public health problem that urgently needs to be solved [25]. In recent years, postmenopausal osteoporosis treatment and prevention have been increasingly considered and valued by the medical field [26].
With the development of artificial intelligence, AI-LLMs have become more widely used in medical fields, such as radiology, medical care, and medical education [27,28,29]. According to a study by Yunus Balel et al., ChatGPT-4o is a valuable tool for suggesting topics for the evaluation of oral and maxillofacial surgery systems [30]. The ability of AI-LLMs (ChatGPT-4o and Claude 3-Opus) to process images can also help medical researchers quickly diagnose the benign and malignant nature of tumors, showing promise for future applications in medical imaging [31]. In a study by Enes Efe Is et al., the performance of ChatGPT-4o and Google Gemini in answering questions at the rheumatology board level was evaluated, and the results revealed that ChatGPT-4o answered the questions significantly more accurately than Google Gemini did; however, Google Gemini was more self-correcting than ChatGPT-4o was. This result suggests that AI-LLMs perform differently when faced with different questions and prompts [32]. However, no study has tested the performance of AI-LLM chatbots in answering questions related to postmenopausal osteoporosis.
When general FAQs about PMOP were answered, in “Diagnosis”, Gemini Advanced had more concise answers than ChatGPT-4o mini, but ChatGPT-4o was significantly more readable than ChatGPT-4o mini. In terms of “Pathogenesis”, ChatGPT-4o and Gemini Advanced had significantly fewer total syllables than ChatGPT-4o mini. In terms of “Risk Factor”, Gemini Advanced has significantly better readability than ChatGPT-4o mini, and in answering general FAQs about PMOP, Gemini Advanced has significantly fewer total syllables and total sentences than ChatGPT-4o mini, and ChatGPT-4o’s overall readability is better than that of ChatGPT-4o mini. In answering questions related to the 2022 ACOG-PMOP Guideline, Gemini Advanced is more concise and has better readability than ChatGPT-4o mini and ChatGPT-4o in terms of total words and total syllables. The above results suggest that, owing to the different algorithms and versions of AI-LLMs, they perform differently in information processing and Q&A, and Google Gemini and ChatGPT-4 may focus more on providing concise and direct answers to improve the efficiency of information delivery. In addition, ChatGPT-4o is more readable, which may be related to the fact that ChatGPT-4o is an optimized version of ChatGPT-4 that focuses on efficiency and performance, and may use a simpler sentence structure and common vocabulary, avoiding complex terminology and lengthy expressions to make the content more understandable. Notably, Gemini’s Advanced responses are also very concise, with illustrations in some paragraphs to help readers understand the text more fully, but according to the reviewers, who rated Gemini Advanced’s responses as “poor,” Gemini Advanced’s responses were too concise. Gemini Advanced’s responses were too concise to provide clear answers to some questions, resulting in shorter answers (Supplementary Table 2, 3a-c). In contrast, ChatGPT-4o mini provided more comprehensive information, attempting to cover more context and detail to ensure that the user was fully understood, resulting in an increased number of characters and words, and conversely, a much reduced readability.
In terms of “Diagnosis”, Gemini Advanced patients had significantly lower AS than the ChatGPT-4o (P < 0.05). In terms of “Pathogenesis”, ChatGPT-4o mini was also significantly less accurate than ChatGPT-4o. In terms of “Risk Factor”, Gemini Advanced had a significantly greater AS than both ChatGPT-4o mini. For the FAQs concerning PMOP overall, ChatGPT-4o had a significantly higher AS than ChatGPT-4o mini and Gemini Advanced. In response to questions related to the 2022 ACOG-PMOP Guideline, Gemini Advanced had a significantly lower AS than ChatGPT-4o mini and ChatGPT-4o (P < 0 05) (Table 4). The difference in scores between ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced, may be due to a number of factors. ChatGPT-4o, introduced by OpenAI in May 2024, is an optimized version of ChatGPT-4 with more parameters and computational power, and is more focused on efficiency and performance. In addition, ChatGPT-4o mini is a smaller model of the most cost-effective ChatGPT-4o introduced by OpenAI in July 2024, supporting a wide range of tasks with its low cost and low latency, with limited exposure to a limited amount of data, especially in specialized domains, which tend to miss some of the most recent or more detailed medical information, however, it has academic benchmarks in textual intelligence and multimodal inference that both outperform GPT-3.5 Turbo and other smaller models. Gemini Advanced (Gemini 1.5 Pro) is a new generation of AI-LLMs released by Google in February 2024, with a context window of millions of tokens capable of comprehending long texts, audio, and videos; it specializes in logical reasoning and code generation, and accessing up-to-date web information. It is fundamentally different from ChatGPT-4omini and ChatGPT-4o (developed by OpenAI) in that it processes information and answers questions. However, in our study, we found that except for the “Risk Factor”, where Gemini Advanced has a higher AS than ChatGPT-4o mini and ChatGPT-4o, the rest of the questions answered by Gemini Advanced are not very satisfactory, especially in regard to answering questions related to the 2022 ACOG-PMOP Guideline. For the guideline, it performed worse than ChatGPT-4o mini and ChatGPT-4o. This may be related to the fact that ChatGPT-4o may focus more on the knowledge understanding and reasoning ability of the medical domain in terms of model architecture and optimization strategy, for example, it optimizes specifically for medical terminology and logical relationships, and thus will be more accurate in dealing with related questions.
In our study, we compared the ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced self-correcting abilities for questions rated “poor”. Our study revealed that the ChatGPT-4o mini had a total of nine responses rated “poor” across all the questions, with a mean AS of 1.78 ± 0.15 and a significantly greater mean AS of 4.22 ± 0.26 after correction (P < 0.05). The ChatGPT-4o had three questions with responses of “poor”, with a mean AS of 1.75 ± 0.25 before correction and 4.08 ± 0.14 after correction, indicating significantly higher scores and levels (P < 0.05). Sixteen questions from Gemini Advanced received “poor” responses. The results revealed that Gemini Advanced scores and grades changed significantly before and after correction (P < 0.05). For responses rated as “poor,” according to professional orthopedic surgeons, the “poor” responses were primarily due to a lack of specificity in detail, failure to follow guidelines, and an inability to professionally answer the questions posed. These findings suggest that these AI-LLMs failed to cover the latest medical advances and treatment options, dealt with somewhat complex and specialized medical issues, and had limited reasoning ability and insufficient depth. However, after correction, the scores and ratings of all the AI-LLMs increased significantly, thus suggesting that these AI-LLMs may try to make better use of their knowledge base to generate more accurate responses when asked a second time, by using more effective information retrieval strategies, and by “rethinking” their previous responses, ChatGPT ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced all possess strong self-correcting abilities.
Overall, this study evaluates the performance of three AI-LLMs (ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced) in helping to answer FAQs about PMOP and 2022 ACOG-PMOP Guideline. The results showed that the quality of ChatGPT-4o’s responses was superior than that of other models overall, with a higher proportion of “excellent” ratings (P < 0.05). ChatGPT-4o’s superior performance provides more solid evidence that more advanced models will provide more reliable and clinically relevant output. Our findings are similar to previous studies showing AI-LLMs’ potentials in healthcare is consistent. For example, Wang et al. found that advanced AI-LLMs such as ChatGPT-4, showed greater consistency and accuracy in answering specialized medical questions about osteoarthritis [16]. Similarly, in the study of Zhi Wei Lim et al., ChatGPT-4o has a higher potential in providing accurate and comprehensive answers to myopia-related queries [33]. In our study, ChatGPT-4o demonstrated a greater ability to generate evidence-based treatment recommendations, which is critical to support clinical decision-making in PMOP’s care, and these findings have important implications for AI-LLMs into clinical practice. For example, ChatGPT-4o can serve as a valuable tool for clinicians seeking quick access to evidence-based guidelines or patient education materials. However, caution is needed. Clinicians should eliminate uncertainties in diagnosis and treatment by repeating questions or cross-referencing to verify that the AI -LLMs’ results are available with credible resources, especially in complex cases.
Strengths and limitations
Although we collected questions about PMOP from multiple sources and had a small number of questions, the coverage of the questions was perhaps incomplete and this study is based on evaluation of simulated Q&A scenarios, rather than real clinical data. In the future, comparisons should be made to distinguish them from actual patient questions and answers in order to enhance the generalizability of our results. The scoring system used in this study was a Likert scale. Even among experienced orthopedic specialists, there may be differences in their understanding of, and emphasis on, criteria such as accuracy, completeness, and clarity. For example, some experts may be more concerned with the accuracy of the answer, whereas others may be more concerned with whether the answer is easy for the patient to understand. In this regard, we conducted a Fleiss’s kappa coefficient test on the ratings given by the four orthopedic specialists, and the Fleiss’s kappa values were 0.238, 0.105, and 0.290, indicating that the consistency of the ratings given by the four orthopedic specialists was relatively low, which may be related to the individual specialists. This may be related to the different clinical experiences and research directions of individual experts. Moreover, owing to the time factor limitations of our study, the training data and algorithms of the AI model were constantly updated, so the test results only reflected the performance of the model at a specific point in time. The training data of each AI-LLM may introduce regional, temporal, or demographic biases, which may affect the performance and applicability of the model in different clinical environments. And our study focused on limited AI-LLMS, not including other large AI-LLMs (e.g., DEEPseek, etc.). The performance of the model may change over time. Additionally, due to model updates, different results may be obtained even when testing with the same problem, which can affect the reproducibility of the study. In conclusion, as technology continues to evolve, large-scale language models will play an increasingly important role in the treatment of postmenopausal osteoporosis. They can assist physicians in diagnosis and treatment, provide patient education and support, facilitate medical research, and promote telemedicine. To better realize these perspectives, there is a need to further improve the accuracy, reliability, and safety of the models, as well as to enhance their integration with the health care system. Moreover, there is a need to focus on ethical and social implications to ensure that AI technologies are applied in accordance with human values and interests.
Conclusion
Our study revealed that ChatGPT-4o and Gemini Advanced’s answers to PMOP-related questions were more concise, clearer, and understandable, and Gemini Advanced’s answers featured illustrations that helped patients fully comprehend the text, however Gemini Advanced’s shortcomings were that the answers were too concise and not tailored to the relevant questions, which needs to be improved. ChatGPT-4o significantly outperformed ChatGPT4o mini and Gemini Advanced in answering PMOP-related FAQs. ChatGPT4o mini and ChatGPT-4o were significantly better at answering 2022 ACOG-PMOP Guideline-related questions. ChatGPT4o mini and ChatGPT-4o significantly outperformed Gemini Advanced, and our results also suggest that ChatGPT4o mini, ChatGPT-4o, and Gemini Advanced have stronger self-corrective abilities, a finding that may be related to the fact that the current AI-LLMs have stronger feedback mechanisms and dynamic adaptabilities.
Data availability
All data generated or analysed during this study are included in this published article [and its supplementary information files].
Abbreviations
- PMOP:
-
Postmenopausal osteoporosis
- AI-LLMs:
-
Artificial intelligence large-scale language models
- FAQs:
-
Frequently asked questions
- 2022 ACOG-PMOP Guideline:
-
Management of postmenopausal osteoporosis:2022 ACOG clinical practice guideline No. 2
- FRE:
-
Flesch reading ease
- NHANES:
-
National health and nutrition examination survey
References
Walker MD, Shane E. Postmenopausal osteoporosis. N Engl J Med. 2023;389(21):1979–91. https://doiorg.publicaciones.saludcastillayleon.es/10.1056/NEJMcp2307353.
Porter JL, Varacallo M. Osteoporosis. In: StatPearls. Treasure Island (FL) ineligible companies. Disclosure: Matthew Varacallo declares no relevant financial relationships with ineligible companies.: StatPearls Publishing. Copyright. © 2024, StatPearls Publishing LLC.; 2024.
Ramchand SK, Leder BZ. Sequential therapy for the Long-Term treatment of postmenopausal osteoporosis. J Clin Endocrinol Metab. 2024;109(2):303–11. https://doiorg.publicaciones.saludcastillayleon.es/10.1210/clinem/dgad496.
Reid IR. A broader strategy for osteoporosis interventions. Nat Rev Endocrinol. 2020;16(6):333–9. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41574-020-0339-7.
Zhang X, Wang Z, Zhang D, Ye D, Zhou Y, Qin J, Zhang Y. The prevalence and treatment rate trends of osteoporosis in postmenopausal women. PLoS ONE. 2023;18(9):e0290289. https://doiorg.publicaciones.saludcastillayleon.es/10.1371/journal.pone.0290289.
Management of Postmenopausal Osteoporosis. ACOG clinical practice guideline 2. Obstet Gynecol. 2022;139(4):698–717. https://doiorg.publicaciones.saludcastillayleon.es/10.1097/aog.0000000000004730.
Eastell R, Rosen CJ, Black DM, Cheung AM, Murad MH, Shoback D. Pharmacological management of osteoporosis in postmenopausal women: an endocrine society** clinical practice guideline. J Clin Endocrinol Metab. 2019;104(5):1595–622. https://doiorg.publicaciones.saludcastillayleon.es/10.1210/jc.2019-00221.
LeBoff MS, Greenspan SL, Insogna KL, Lewiecki EM, Saag KG, Singer AJ, Siris ES. The clinician’s guide to prevention and treatment of osteoporosis. Osteoporos Int. 2022;33(10):2049–102. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s00198-021-05900-y.
Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large Language models in medicine: the potentials and pitfalls: A narrative review. Ann Intern Med. 2024;177(2):210–20. https://doiorg.publicaciones.saludcastillayleon.es/10.7326/m23-2772.
Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, Löffler CML, Schwarzkopf SC, Unger M, Veldhuizen GP, et al. The future landscape of large Language models in medicine. Commun Med (Lond). 2023;3(1):141. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s43856-023-00370-1.
Shah NH, Entwistle D, Pfeffer MA. Creation and adoption of large Language models in medicine. JAMA. 2023;330(9):866–9. https://doiorg.publicaciones.saludcastillayleon.es/10.1001/jama.2023.14217.
Ellaway RH, Tolsgaard M. Artificial scholarship: LLMs in health professions education research. Adv Health Sci Educ Theory Pract. 2023;28(3):659–64. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s10459-023-10257-4.
Delsoz M, Madadi Y, Munir WM, Tamm B, Mehravaran S, Soleimani M, Djalilian A, Yousefi S. Performance of ChatGPT in Diagnosis of Corneal Eye Diseases. medRxiv 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.1101/2023.08.25.23294635
Potapenko I, Malmqvist L, Subhi Y, Hamann S. Artificial Intelligence-Based ChatGPT responses for patient questions on optic disc Drusen. Ophthalmol Ther. 2023;12(6):3109–19. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s40123-023-00800-2.
Grünebaum A, Chervenak J, Pollet SL, Katz A, Chervenak FA. The exciting potential for ChatGPT in obstetrics and gynecology. Am J Obstet Gynecol. 2023;228(6):696–705. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.ajog.2023.03.009.
Wang L, Chen X, Deng X, Wen H, You M, Liu W, Li Q, Li J. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit Med. 2024;7(1):41. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41746-024-01029-4.
Bellinger JR, De La Chapa JS, Kwak MW, Ramos GA, Morrison D, Kesser BW. BPPV information on Google versus AI (ChatGPT). Otolaryngol Head Neck Surg. 2024;170(6):1504–11. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/ohn.506.
Carlà MM, Gambini G, Baldascino A, Boselli F, Giannuzzi F, Margollicci F, Rizzo S. Large Language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google gemini comparison. Graefes Arch Clin Exp Ophthalmol. 2024;262(9):2945–59. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s00417-024-06470-5.
Lee Y, Shin T, Tessier L, Javidan A, Jung J, Hong D, Strong AT, McKechnie T, Malone S, Jin D, et al. Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and bard in generating clinician-level bariatric surgery recommendations. Surg Obes Relat Dis. 2024;20(7):603–8. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.soard.2024.03.011.
Clynes MA, Harvey NC, Curtis EM, Fuggle NR, Dennison EM, Cooper C. The epidemiology of osteoporosis. Br Med Bull. 2020;133(1):105–17. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/bmb/ldaa005.
O’Neill TW, Felsenberg D, Varlow J, Cooper C, Kanis JA, Silman AJ. The prevalence of vertebral deformity in European men and women: the European vertebral osteoporosis study. J Bone Min Res. 1996;11(7):1010–8. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/jbmr.5650110719.
Johnston CB, Dagar M. Osteoporosis in older adults. Med Clin North Am. 2020;104(5):873–84. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.mcna.2020.06.004.
Cummings SR, Black DM, Rubin SM. Lifetime risks of hip, Colles’, or vertebral fracture and coronary heart disease among white postmenopausal women. Arch Intern Med. 1989;149(11):2445–8.
Cummings SR, Melton LJ. Epidemiology and outcomes of osteoporotic fractures. Lancet. 2002;359(9319):1761–7. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/s0140-6736(02)08657-9.
Bhatnagar A, Kekatpure AL. Postmenopausal osteoporosis: A literature review. Cureus. 2022;14(9):e29367. https://doiorg.publicaciones.saludcastillayleon.es/10.7759/cureus.29367.
Jeong HG, Kim MK, Lim HJ, Kim SK. Up-to-Date knowledge on osteoporosis treatment selection in postmenopausal women. J Menopausal Med. 2022;28(3):85–91. https://doiorg.publicaciones.saludcastillayleon.es/10.6118/jmm.22007.
Daungsupawong H, Wiwanitkit V. LLMs in radiology through prompt engineering: Comment. Rofo 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1055/a-2295-3839
Haltaufderheide J, Ranisch R. The ethics of ChatGPT in medicine and healthcare: a systematic review on large Language models (LLMs). NPJ Digit Med. 2024;7(1):183. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41746-024-01157-x.
Benítez TM, Xu Y, Boudreau JD, Kow AWC, Bello F, Van Phuoc L, Wang X, Sun X, Leung GK, Lan Y, et al. Harnessing the potential of large Language models in medical education: promise and pitfalls. J Am Med Inf Assoc. 2024;31(3):776–83. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/jamia/ocad252.
Balel Y, Zogo A, Yıldız S, Tanyeri H. Can ChatGPT-4o provide new systematic review ideas to oral and maxillofacial surgeons? J Stomatol Oral Maxillofac Surg. 2024;125(5s2):101979. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jormas.2024.101979.
Chen Z, Chambara N, Wu C, Lo X, Liu SYW, Gunda ST, Han X, Qu J, Chen F, Ying MTC. Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s12020-024-04066-x.
Is EE, Menekseoglu AK. Comparative performance of artificial intelligence models in rheumatology board-level questions: evaluating Google gemini and ChatGPT-4o. Clin Rheumatol. 2024;43(11):3507–13. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s10067-024-07154-5.
Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun CH, Lam JSH, Chen DZ, Goh JHL, Tan MCJ, Sheng B, et al. Benchmarking large Language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google bard. EBioMedicine. 2023;95:104770. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.ebiom.2023.104770.
Acknowledgements
Not applicable.
Funding
This work was supported by the Key Program of the Natural Science Foundation of Tianjin (award number S24YBL069).
Author information
Authors and Affiliations
Contributions
RL and JL designed the study. ZS provided the funding. RL contributed to the data collection. RL and JL wrote the manuscript. JY and JL provided resources and participated in the data analysis. JY, RL and ZS independently rated the responses. RL and JL performed the data validation and edited the manuscript. ZS and HY confirmed the authenticity of all the raw data. All authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval
Not applicable.
Consent for publication
Not applicable.
Human ethics and consent to participate declarations
Not applicable.
The name of the approval committee or the internal review board (IRB)
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Liu, R., Liu, J., Yang, J. et al. Comparative analysis of ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced in the treatment of postmenopausal osteoporosis. BMC Musculoskelet Disord 26, 369 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12891-025-08601-3
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12891-025-08601-3