Differential item functioning analysis on the Geriatric Depression Scale-15: An iterative hybrid ordinal logistic regression

The elderly population has extensively increased globally, so depression like a common problem in late life may convert to one of the economic, social, and health challenges of the 21st century. Due to the high cost of clinical diagnosis of depression, it is necessary to provide effective questionnaires like the 15-item Geriatric Depression Scale (GDS-15) for screening. But, the measurement invariance of GDS-15 is still unknown in the general population. In our study, 1473 participants of all Iran’s ethnic groups were asked to answer GDS-15 and demographic factors such as human settlements, employment, disease, marital status, age, gender, homebound, financial status, and ethnicity. Then, the lordif package in R 3.1.3 was used to assess differential item functioning (DIF) items that behave unevenly across demographic factors. Our findings reveal that women, homebound patients, poorer, and non-Persian mother tongue score classic psychological symptoms higher than peoples of the same depression score in other groups. Since, psychologists have to remove or replace these items before using this questionnaire for screening geriatric depression.


Introduction
I n late life, depression is one of the reversible disorders that increased healthcare expenses and decreased quality of life [1]. In 1999, a systematic review reported only 13.5% of the depression prevalence in elderly people aged 55 and older, but depressive symptoms had proliferated over time [2,3]. Since population dynamic is one of the most important factors to determine health care needed in society, the growing trend of the elderly population can convert depression into one of the economic, social, and health challenges of the 21st century [4,5]. To prevent this problem, we should investigate methods for prompt recognition of this illness. The clinical interview is time-consuming and costly, while screening and diagnostic instruments are available to overcome these dilemmas [6].
When assessing geriatric depression, screening instruments should comprise simple and easily understood items to fit this population. Furthermore, psychological symptoms should have greater weight than somatic ones because of their adequate power to discriminate depressed from nondepressed elderly [7]. The 15-item Geriatric Depression Scale (GDS-15), which has 92% sensitivity and 89% specificity, is one of the most popular instruments to meet these expectations [7,8]. It also had acceptable validity and reliability in clinical practice and research [9e11].
But before assessing group differences on a questionnaire, test items should be fair and appropriate for assessing the knowledge in a specific area across different groups of examinees [12]. In this condition, the differences in performance between groups reflect true differences in the ability level, and there are not due to some items that do not behave comparably for subjects from different groups. Differential item functioning (DIF) signals, that factors related to group membership, affect the probability of response thus threaten fair assessment [13,14]. So, DIF analysis is one important part of assessing validity, especially in psychological instruments [15,16]. Kim, et al. used Item Response Theory (IRT) to discover DIF items in Beck Depression Inventory (BDI) [17]. They showed that persons with low and high depression answered cognitive and somatic symptom items differently. Two other studies reveal that GDS-15 had no significant DIF among workers and health care patients for age, level of education, sex, and race (18,19

Statistical analysis
In the logistic regression model (OLR), the logit of observing each dichotomous item response relates to two explanatory variables: observed ability (q), a continuous variable, and a category group variable (G) as follows: Under this formulation, an item shows uniform DIF if b 2 s 0 and b 3 ¼ 0, and non-uniform DIF if b 3 s 0. And these hypotheses can be tested using the G 2 likelihood ratio statistic [20]. To access the size of DIF, effect size measures of Db 1 1 and DR 1 2 were estimated in uniform DIF items and DR 2 3 was estimated in non-uniform DIF items [21,22]. In this article, Db 1 , DR 1 , and DR 2 were assessed largely when were higher than 0.01, 0.07, and 0.07, respectively [23,24].
If the test contains biased items, then a biased criterion of the ability will be used for investigating DIF in the OLR method [25,26]. But, if the latent variable of IRT models were substituted at the ability of OLR models, the logistic regression analysis could overlook this potential limit of ability parameters on DIF detection. To meet this, the lordif package in R 3.1.3 detected DIF items by OLR methods in the first step. Then, incorporate IRTderived ability estimates rather than the ability parameter in the OLR model only for each group category of DIF items separately. And, it detected DIF items by the OLR method again. The two last steps were repeated until reaching the same DIF items in two consecutive stages [27]. Finally, the effect of removing uniform DIF items was also assessed on the group differences.

Measures
In 1983, a team of clinicians and psychiatry researchers picked out 100 of the most efficient items that would not alarm patients or make them overly defensive. Items also incorporated unique elderly cognitive complaints and had a yes/no format to make a simpler self-rating scale for the patients. Then, 47 participants from elderly depressives and normals in California were asked to answer questions. And, the best-correlated items with depressive symptoms shaped the 30-item Geriatric Depression Scale (GDS) [7]. In 1986, practical items shaped the GDS short form with 15 items [28]. Except for items 1, 5, 7, 11, and 13, which scored negatively, positive answers indicated depression. Depression score suggested normal person (range 0-4), mild (range 5-8), moderate (range 9-11), or severe (range [12][13][14][15] depression. To ensure the accuracy of the translation, independent translators converted all scale items to Persian and back to English in 2005. Then, 204 elderly people were asked to answer the questionnaire according they are felt over the past week. Then, the test-retest reliability of the scale was assessed. The Persian version had acceptable reliability and validity (test-retest reliability ¼ 0.58, Cronbach's a ¼ 0.9, r split-Persian mother tongue. This paper was also intended to consider the impact of this factor on the person's perception of the Persian version of GDS-15. So, we randomly selected 800 people aged 60 and older in both Persian and non-Persian mother tongue. In choosing people who had not Persian mother tongue tried to consider Iranian ethnicity ʼs proportions [31]. From April to October 2017, information was gathered from different 16 cities (Isfahan, Nain, Tehran, Birjand, Dihook, Shahekord, Mashhad, Tabriz, Fereidan county, Sanandaj, Kermanshah, Lordegan, Khoramabad, Zirkooh county, Ahwaz, Agh-Ghala). Study persons had not experienced chronic sorrow during the past month and completed informed consent. The incomplete questionnaires were removed from the study (116 for Persian mother tongue, 11 for non-Persian mother tongue). Finally, the lordif and psych packages analyzed information of 684 Persian, 297 Turks, 96 Lurs, 100 Kurds, 98 Baluchis, 98 Arab, and 100 Turkmens.

Results
Item responses were available from 1473 participants (684 Persian mother tongue and 789 non-Persian mother tongue). The mean ages of the analysis population were 69.22 and 69.40 years old for males and females, respectively. Most of the respondents lived in the city (53.6%), have a chronic medical illness (70.8%), and were married (76.3%). About 16.2 percent had home health care, 20.2 percent were currently employed, with financial status 37.5% poor, 21% making ends meet, and 41.5% rich. As Table 1 clearly shows, Cronbach's alpha and coefficient omega were in an acceptable range, but the Standardized Root Mean Squared Residuals (SRMR) were the only Confirmatory Factor Analysis (CFA) indices that were in recommended range by Hu and Bentler [32,33]. For non-Persian mother tongues, even SRMR was out of range.

DIF analysis
In Tables 2 and 3, DIF analysis is provided across human settlements, disease, employment, marital status, age, gender, homebound, financial status, and ethnicity. The chi-square statistics declare 10, 5, 8, 4, 3, 11, 11, 14, and 5 DIF items across these factors, respectively. But, none of the items had large non-uniform DIF (DR 2 0.0189). In uniform DIF items, Db 1 is often large, but the only item 4 in financial status shows large DR 1 (DR 1 ¼ 0.0957). The CvBL was non-significant only in 13 of 60 uniform DIF items (items 1, 8, 9, and 15 for human settlement    groups, item 4 for people with and without chronic illness, items 2, 4, 8 for employment groups, item 6 for gender groups, items 2 and 15 for homebound groups, item 9 for financial status groups, and item 15 for ethnic groups).
Because there was more than one uniform DIF item in all study factors, item score functions would be helpful in the investigation of additive or cancelout effect. To do that, dashed lines are used for city habitat, chronic illness patients, single, old-old people, and yes categories in Fig. 1. For human settlement, four uniform DIF items are in the opposite direction of other threes, and uniform DIF items can cancel out effect size at the domain level. Items 4 and 10 also canceled out item 3 across the presence or absence of chronic medical illness. About employment status, large effect sizes in items 1 and 11 cancel out not only non-significant uniform DIF in items 2, 4, and 8 but also significant uniform DIF in items 9 and 15. In marital status, items 5 and 11 cancel out each other, but items 9 and 15 go in one direction. Two uniform DIF items in age groups also go in one direction. But, if we focused on both effect size measures Db 1 and DR 1 , additive effects would not be meaningful in marital status and age groups.
In Figs. 2 and 3, item score functions show gender, homebound, financial status, and ethnic groups. For gender, homebound, and financial status groups, all uniform DIF items go in one direction. Across ethnic groups, only item 8 is in the opposite direction by items 1, 9, and 15, so it cannot cancel out them. The observed additive effect causes female, homebound patient, poor person, and non-Persian mother tongue scores their depression higher. Table 4 assesses the effect of removing uniform DIF items on group differences in studied factors. As the table clearly shows, the only group differences change the p-value of significant (<0.001) to non-significant (0.17 and 0.035 in gender and financial status) after removing uniform DIF items. However, like other study factors, removing DIF items does not change ethnic and employment status group difference (P-value<0.01). According to all presented findings, uniform DIF is more noticeable in items 5 and 15 for gender groups, items 4, 8, and 14 for homebound groups, items 1, 3, 4, 5, 7, 8, 13, 14, and 15 for financial status groups, and item 1 for ethnic groups.

Discussion
As we know, interpretation of GDS item scores across different group memberships was compared in four studies: Chinese pneumoconiosis workers, homebound patients in New York, a large longitudinal in old age Italian, and [18,34,35]. Across level of education and age groups, Rasch models show DIF in items 3, 4, 9, and 10 for silicosis workers in Hong Kong. But, the Persian version of GDS-15 had no DIF for variable age like the American version. Since Tang et al. studied a limited population, their findings could not be recommended in the general population like this study findings. About health care patients in Westchester, there is also a similar limitation. Hence, in contrast to health care women in America, Iranian women rated their depression higher than men in items 5 and 15. White Americans and other homebound patients also show DIF in item 10, while we evaluate clinically significant DIF in item 1 across the Persian race.
Living in underdeveloped regions can explain the reason why most of the no-Persian population unsatisfied with their life in Iran [36]. Furthermore, old women experience bad spirits and self-deficiency more than men of the same age, because less well-   educated, widowhood and low incomes are more common among older women than older men. [37]. So, they score items "Are you in good spirits most of the time?" and "Do you think that most people are better off than you are?" higher than men.
Similar reasons maybe exist for DIF items in homebound and financial status. Rich people believe that they create their future, not other people or events [38]. So, rich people are in control of their lives and feel more satisfied, lively, excited, happy, strong, full of energy about themselves and their situation [39]. And rich people evaluate themselves lower in items 1, 3, 4, 5, 7, 8, 13, 14, and 15 of the GD depression scale. Also, social isolation and financial worries about medication can cause two important reasons why homebound elderlies score themselves bored, helpless, and even feeling their situation worse than others [40e42].
Our study results are more reliable than previous studies for two reasons. First, as it had been mentioned in before studies, data were drawn from the limited population (pneumoconiosis or homebound patients) while our study selected samples from all over Iran, so the generalizability of previous findings would come under suspicion [14,18,19]. Second, stringent assumptions of rash models might lead to erroneous conclusions that contrast the OLR models. Also, logistic models show unacceptable power, when ability parameter distribution is asymmetrical with an even sample size of 600 [43]. But using the iterative hybrid ordinal logistic regression/item response theory model for DIF detection gave us the strength to overcome these difficulties [26]. Finally, despite all the strengths in this study, further studies are recommended for detecting DIF between cultures and replacing or removing observed DIF items according to the psychologist's idea.
However, The GDS-15 is useful for the detection of depression across human settlements, disease, employment status, marital status, and age groups. This tool might be misleading to compare depression between men and women, home healthcare patients and others, Persian and non-Persian mother tongues, and especially people of different levels of income. So, confirmation studies are warranted on the use of the shortened version of GDS in different ethnic, gender, and homebound groups. But, most of the items are needed to replace in different income levels.