How to measure precision in classical test theory framework?

How to measure precision in classical test theory framework?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

What methods or approaches exist or might be developed to measure precision of measurement in a classical test theory framework? Standards for Educational and Psychological Testing (American Psychological Association, 2014) talks about precision of measurement in several places, but does not seem to provide any references.

Also, to what extent is precision of measurement in IRT theory similar to or different from absolute agreement in score (as opposed to inter-rater reliability or test reliability as measured by alpha) in CTT?

'Precision' in classical test theory

Most accounts of classical test theory do not have a notion of precision as such, but occasionally, reliability may be called precision instead. The relationship is probably most concisely illustrated with the standard dartboards. This is also explained on the Wikipedia Item Response Theory page, but as you can see, in CTT, precision is to reliability what accuracy is to validity.

(Wikipedia Reliability article juxtaposed with a Tufts University guide.

Origin of 'precision' in classical test theory

Cronbach (1951) suggested Coombs (1950) as the origin of the reliability/precision confusion.

Coombs (6) offers the somewhat more satisfactory name "coefficient of precision" for this index which reports the absolute minimum error to be found if the same instrument is applied twice independently to the same subject. A coefficient of stability can be obtained by making the two observations with any desired interval between. A rigorous definition of the coefficient of precision, then, is that it is the limit of the coefficient of stability, as the time between testings becomes infinitesimal.

I'm not entirely sure if I'm interpreting the secondary question right, but IRT precision is a measure of precision under IRT, and ICC is a measure of reliability under CTT. The main difference is that CTT expresses reliability as a single value, while IRT expresses precision for different values of the underlying trait. This isn't specific to absolute agreement, though, so maybe I'm misunderstanding.



In this post, we will explore how measurement error arising from imprecise parameter estimation can be corrected for. Specifically, we will explore the case where our goal is to estimate the correlation between a self-report and behavioral measure–a common situation throughout the social and behavioral sciences.

For example, as someone who studies impulsivity and externalizing psychopathology, I am often interested in whether self-reports of trait impulsivity (e.g., the Barratt Impulsiveness Scale) correlate with performance on tasks designed to measure impulsive behavior (e.g., the Balloon Analogue Risk task). In these cases, it is common for researchers to compute summed or averaged scores on the self-report measure (e.g., summing item responses and divding by the number of items) and use summary statistics of behavioral performance (e.g., percent risky choices) for each subject’s behavioral measure, wherein the resulting estimates are then entered into a secondary statistical model to make inference (e.g., correlation between self-reported trait and behavioral measures). Importantly, this two-stage approach to inference assumes that our summary measures both contain no measurement error, or alternatively that we have estimated these summary mesaures with perfect precision–a very strong assumption that is surely not met in practice.

Here, we will explore how such assumptions can bias our statistical inferences on individual differences. As we will show, this bias arises because these two-stage approaches ignore important sources of measurement error. We will begin with an exploration of traditional methods developed within the context of classical test theory, and we will then transition to the use of more contemporary generative models. Throughout, we will explore relationships between classical and generative approaches, which are actually more similar than they are different in many ways.



For this study, we analyzed 32 high-stakes medical end-of-term exams from three Swiss medical schools conducted in 2016. Our sample covered exams ranging from the first to the fifth year of study. End-of-term exams cover the entire content taught in that term and are used to decide whether a candidate is allowed to pass the term and to continue her or his studies. All included exams were constructed according to the blueprints of the programs and terms, which are all based on the Swiss Catalogue of Learning Objectives [18, 19], and met high-quality standards, e.g. careful item review and revision according to the standards set by Haladyna, Downing [20] and Case and Swanson [21].

The mean number of examinees per exam was 264 (SD = 83 min = 146 max = 378). All exams were multiple-choice exams comprising single-best answer (Type A) items and multiple true-false (MTF) items. The mean number of items per exam was 103 (SD = 428 min = 59 max = 150). On average 30.60% of the items were MTF items (SD = 8.00% min = 18.97%, max = 53.33%). Type A items included five answer options, and MTF items included four answer options. Type A items were scored with a full point when answered correctly otherwise, examinees received no points. MTF items were scored using a partial credit scoring algorithm [22, 23]. For these items, examinees received half a point if more than half of true/false ratings of an item were marked correctly and one point if all were marked correctly. Otherwise, they received no points for the item. Items eliminated in post-hoc review were excluded from analyses (1.5 items per exam on average). Item difficulty covered the whole range, from easy to difficult items (min = 0.018, mean = 0.69, max = 1).

The standard setting of all exams was content-based [24]. Cut scores ranged from 47.5% to 70% of the maximum points, with a mean at 56.6% (SD = 4.7%).

Calculation of conditional reliability

We calculated conditional reliabilities for every exam in both IRT and CTT [12]. In both theories, conditional reliability is a standardization of the cSEM at the score variance (σx 2 ). Conditional reliability is defined as:

To calculate the cSEM in CTT, we used the binominal error model [7, 8]. According to this model, the cSEM is defined as follows:

where X is the score of an exam and k is the number of items.

In IRT the squared cSEM is inversely equal to the test information function (Is) [12]. The cSEM is calculated as follows:

To calculate conditional reliability in IRT, we used a one-parameter logistic (1-PL) IRT model for partial credit scoring. In this model, every score on the theta scale corresponds to only one test score on the sum score “scale”. This correspondence is useful for judging the differences between the two approaches. For estimating theta scores, we used the weighted likelihood estimator [25].

Local independence is a prerequisite for applying a 1-PL model. For testing local independence, we used the Q3 statistic [26, 27]. Mean Q3 value was 0.06 (min = 0.05, max = 0.07), indicating that the data are locally independent.

The 1‑PL model showed an acceptable fit for the data. The mean SRMR (standardized root mean square residual) was 0.06 (min = 0.05, max = 0.08), and the mean SRMSR (standardized root mean square root of squared residual) was 0.08 (min = 0.06 max = 0.14). We also calculated Infit and Outfit for the items in the included exams. On average 4% (min = 0.00%, max = 16.67%) of the items in an exam did not fit with regard to the Infit. Regarding the Outfit, on average 12.57% (min = 0.00%, max = 40%) of the items did not fit. Items that did not fit with regard to the Outfit were mostly easy items with low discrimination indices. (Tab. 1).

Conditional reliability at the cut score as well as maximum and average conditional reliability were calculated. As an index of global reliability, the Cronbach’s alpha (in CTT) and separation index (in IRT) were calculated for each exam.

Influencing variables

The three influencing variables mentioned above were included in our analyses in order to test whether they relate to differences in conditional reliability at the cut score between exams: (1) range of examinees’ performance, (2) year of study, (3) number of items. As an index for the range of examinees’ performance, we used the difference between the maximum and minimum score in an exam. To enable comparison between the exams, examinees’ performance was calculated in percent for all analyses. As we used anonymized data, we were not able to include examinee-specific factors.

Control variables

Exams included in this study are from three different medical schools and contain both Type A and MTF items. The amount of MTF items may influence the test information and thereby conditional reliability. Therefore, we included both medical schools and the percentage of MTF items in the exams as control variables in the regression analyses.

Statistical analyses

To compare the conditional reliability at the cut score in IRT and CTT and to analyze influencing factors, we used analyses of variance (ANOVA) as well as regression analyses. As an index of effect size, we report partial eta 2 and standardized beta. The level of significance was set at p < 0.5. All analyses were conducted using R (version 3.2.0) [28]. To estimate the 1PL IRT model, we used the R package “TAM” [29] and for graphics, we used the R package “ggplot2” [30].

American Educational Research Association American Psychological Association and National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. American Educational Research Association.

American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental Disorders, 5th Edn. Washington, DC: American Psychiatric Association. doi: 10.1176/appi.books.9780890425596.CautionaryStatement

Bechtoldt, M. N. (2015). Wanted: Self-doubting employees-Managers scoring positively on impostorism favor insecure employees in task delegation. Pers. Individ. Dif. 86, 482�. doi: 10.1016/j.paid.2015.07.002

Brauer, K., and Wolf, A. (2016). Validation of the German-language Clance Impostor Phenomenon Scale (GCIPS). Pers. Individ. Dif. 102, 153�. doi: 10.1016/j.paid.2016.06.071

Chae, J. H., Piedmont, R. L., Estadt, B. K., and Wicks, R. J. (1995). Personological evaluation of Clance's impostor phenomenon scale in a Korean sample. J. Pers. Assess. 65, 468�. doi: 10.1207/s15327752jpa6503_7

Chrisman, S. M., Pieper, W. A., Clance, P. R., Holland, C. L., and Glickauf-Hughes, C. (1995). Validation of the clance imposter phenomenon scale. J. Pers. Assess. 65, 456�. doi: 10.1207/s15327752jpa6503_6

Clance, P. R. (1985). The Impostor Phenomenon. Atlanta: Peachtree.

Clance, P. R., and Imes, S. (1978). The imposter phenomenon in high achieving women: dynamics and therapeutic intervention. Psychother. Theory Res. Pract. 15, 241�. doi: 10.1037/h0086006

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37�. doi: 10.1177/001316446002000104

Cozzarelli, C., and Major, B. (1990). Exploring the validity of the impostor phenomenon. J. Soc. Clin. Psychol. 9, 401�. doi: 10.1521/jscp.1990.9.4.401

Cuddy, A. (2012). Amy Cuddy: Your Body Language Shapes who you are [Video file]. Available online at: (accessed October 25, 2018).

Edwards, P. W., Zeichner, A., Lawler, N., and Kowalski, R. (1987). A validation study of the harvey impostor phenomenon scale. Psychotherapy 24, 256�. doi: 10.1037/h0085712

Ferrari, J. R., and Thompson, T. (2006). Impostor fears: links with self-presentational concerns and self-handicapping behaviours. Pers. Individ. Dif. 40, 341�. doi: 10.1016/j.paid.2005.07.012

French, B. F., Ullrich-French, S. C., and Follman, D. (2008). The psychometric properties of the clance impostor scale. Pers. Individ. Dif. 44, 1270�. doi: 10.1016/j.paid.2007.11.023

Fried-Buchalter, S. (1992). Fear of success, fear of failure and the impostor phenomenon: a factor analytic approach to convergent and discriminant validity. J. Pers. Assess. 58, 368�. doi: 10.1207/s15327752jpa5802_13

Harvey, J. C. (1981). The Impostor Phenomenon and Achievement: A Failure to Internalise Success. Temple University, Philadelphia, PA (Unpublished doctoral dissertation).

Harvey, J. C., and Katz, C. (1985). If I'm So Successful, Why Do I Feel Like a Fake? The Impostor Phenomenon. New York, NY: St Martin's Press.

Hellman, C. M., and Caselman, T. D. (2004). A psychometric evaluation of the harvey imposter phenomenon scale. J. Pers. Assess. 83, 161�. doi: 10.1207/s15327752jpa8302_10

Holmes, S. W., Kertay, L., Adamson, L. B., Holland, C. L., and Clance, P. R. (1993). Measuring the impostor phenomenon: a comparison of Clance's IP Scale and Harvey's I-P Scale. J. Pers. Assess. 60, 48�. doi: 10.1207/s15327752jpa6001_3

Jöstl, G., Bergsmann, E., L࿏tenegger, M., Schober, B., and Spiel, C. (2012). When will they blow my cover?. Z. für Psychol. 220, 109�. doi: 10.1027/2151-2604/a000102

Kertay, L., Clance, P. R., and Holland, C. L. (1992). A Factor Study of the Clance Impostor Phenomenon Scale. Unpublished manuscript. Atlanta: Georgia State University.

Kolligian, J., and Sternberg, R. J. (1991). Perceived fraudulence in young adults: is there an “impostor syndrome”? J. Pers. Assess. 56, 308�. doi: 10.1207/s15327752jpa5602_10

Leary, M. R., Patton, K. M., Orlando, E., and Funk, W. W. (2000). The impostor phenomenon: self-perceptions, reflected appraisals, and interpersonal strategies. J. Pers. (2000) 68, 725�. doi: 10.1111/1467-6494.00114

Leonhardt, M., Bechtoldt, M. N., and Rohrmann, S. (2017). All impostors aren't alike𠄽ifferentiating the impostor phenomenon. Front. Psychol. 8:1505. doi: 10.3389/fpsyg.2017.01505

Liberati, A., Altman, D. G., Tetzlaff, J., Mulrow, C., Gøtzsche, P. C., Ioannidis, J. P. A., et al. (2009). The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. PLoS Med. 6:e1000100. doi: 10.1371/journal.pmed.1000100

Matthews, G., and Clance, P. R. (1985). Treatment of the impostor phenomenon in psychotherapy clients. Psychotherapy Private Pract. 3, 71�. doi: 10.1300/J294v03n01_09

McElwee, R. O. B., and Yurak, T. J. (2007). Feeling versus acting like an impostor: Real feelings of fraudulence or self-presentation? Individ. Dif. Res. 5, 201�.

McElwee, R. O. B., and Yurak, T. J. (2010). The phenomenology of the impostor phenomenon. Individ. Dif. Res. 8, 184�.

Modini, M., Abbott, M. J., and Hunt, C. (2015). A systematic review of the psychometric properties of trait social anxiety self-report measures. J. Psychopathol. Behav. Assess. 37, 645�. doi: 10.1007/s10862-015-9483-0

Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Stratford, P. W., Knol, D. L., et al. (2010). The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Qual. Life Res. 19, 539�. doi: 10.1007/s11136-010-9606-8

Molinsky, A. (2016, July 7). Everyone suffers from impostor syndrome - here's how to handle it. Harvard Business Review. Available online at: (accessed October 25, 2018).

Rapee, R. M., and Heimberg, R. G. (1997). A cognitive-behavioral model of anxiety in social phobia. Behav. Res. Therapy 35, 741�. doi: 10.1016/S0005-7967(97)00022-3

Rohrmann, S., Bechtoldt, M. N., and Leonhardt, M. (2016). Validation of the impostor phenomenon among managers. Front. Psychol. 7:821. doi: 10.3389/fpsyg.2016.00821

Sakulku, J. (2011). The impostor phenomenon. Int. J. Behav. Sci. 6, 75�. doi: 10.14456/ijbs.2011.6

Simon, M., and Choi, Y. (2018). Using factor analysis to validate the clance impostor phenomenon scale in sample of science, technology, engineering and mathematics doctoral students. Pers. Individ. Dif. 121, 173�. doi: 10.1016/j.paid.2017.09.039.

Sonnak, C., and Towell, T. (2001). The impostor phenomenon in British university students: relationships between self-esteem, mental health, parental rearing style and socioeconomic status. Pers. Individ. Dif. 31, 863�. doi: 10.1016/S0191-8869(00)00184-7

Stahl, A (2017, December 10). Feel like a fraud? here's how to overcome impostor syndrome. Forbes. Available online at:� (accessed October 25, 2018).

Swann, W. B., Wenzlaff, R. M., and Tafarodi, R. W. (1992). Depression and the search for negative evaluations: more evidence of the role of self-verification strivings. J. Abnorm. Psychol. 101, 314�. doi: 10.1037/0021-843X.101.2.314

Terwee, C. B., Bot, S. D., de Boer, M. R., van der Windt, D. A., Knol, D. L., Dekker, J., et al. (2007). Quality criteria were proposed for measurement properties of health status questionnaires. J. Clin. Epidemiol. 60, 34�. doi: 10.1016/j.jclinepi.2006.03.012

Terwee, C. B., Mokkink, L. B., Knol, D. L., et al. (2012). Rating the methodological quality in systematic reviews of studies on measurement properties: a scoring system for the COSMIN checklist. Qual. Life Res. 21, 651�. doi: 10.1007/s11136-011-9960-1

Topping, M. E., and Kimmel, E. B. (1985). The imposter phenomenon: feeling phony. Academic Psychol. Bull. 7, 213�.

Topping, M. E. H. (1983). The Impostor Phenomenon: A Study of its Construct and Incidence in University Faculty Members. University of South Florida, Tampa (Unpublished doctoral dissertation).

Vergauwe, J., Wille, B., Feys, M., De Fruyt, F., and Anseel, F. (2015). Fear of being exposed: the trait-relatedness of the impostor phenomenon and its relevance in the work context. J. Bus. Psychol. 30, 565�. doi: 10.1007/s10869-014-9382-5

Want, J., and Kleitman, S. (2006). Imposter phenomenon and self-handicapping: Links with parenting styles and self-confidence. Pers. Individ. Dif. 40, 961�. doi: 10.1016/j.paid.2005.10.005

Windle, G., Bennett, K. M., and Noyes, J. (2011). A methodological review of resilience measurement scales. Health Qual. Life Outcomes 9:8. doi: 10.1186/1477-7525-9-8

Wong, K (2018, June 12) Dealing with impostor syndrome when you're treated as an impostor. The New York Times. Available online at: (accessed October 25, 2018).

Keywords: impostor phenomenon, impostorism, validation, measure, psychometric

Citation: Mak KKL, Kleitman S and Abbott MJ (2019) Impostor Phenomenon Measurement Scales: A Systematic Review. Front. Psychol. 10:671. doi: 10.3389/fpsyg.2019.00671

Received: 05 December 2018 Accepted: 11 March 2019
Published: 05 April 2019.

Claudio Barbaranelli, Sapienza University of Rome, Italy

Kevin O'Neal Cokley, University of Texas at Austin, United States
Beatrice G. Kuhlmann, Universität Mannheim, Germany

Copyright © 2019 Mak, Kleitman and Abbott. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.


We like to acknowledge the valuable, thoughtful comments and careful edits from our colleague and coworker Barbara Gandek, from the Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA, USA, as well as the contribution from Felix Fischer, from the Medical Clinic for Psychosomatic Medicine, Charité, Universitätsmedizin Berlin, who shared his work for Figure 1 with us for this paper. This work was supported by the German Research Society (Deutsche Forschungsgemeinschaft DFG RO 2258/2-1, PI Rose), and a NIMH grant (R01MH082953, PI Rose). It was also supported by the University of Massachusetts, Medical School, Worchester, MA, USA, and the Charite, Universitätsmedizin Berlin, Germany.

Access to Document

  • APA
  • Standard
  • Harvard
  • Vancouver
  • Author
  • RIS

Oxford Handbook of Personality Assessment. Oxford University Press, 2012.

Research output : Chapter in Book/Report/Conference proceeding › Chapter

T1 - Test Theory and Personality Measurement

AU - Chernyshenko, Oleksandr S.

N2 - This article reviews traditional approaches for the psychometric analysis of responses to personality inventories, including classical test theory item analysis, exploratory factor analysis, and item response theory. These methods, which can be called "dominance" models, work well for items assessing moderately positive or negative trait levels, but are unable to describe adequately items representing intermediate (or average) trait levels. This necessitates a shift to an alternative family of psychometric models, known as ideal point models, which stipulate that the likelihood of endorsement increases as respondents' trait levels get closer to an item's location. The article describes an ideal point model for personality measures using single statements as items, reanalyzes data to show how the change of modeling framework improves fit, and discusses the pairwise preference format for use in personality assessment. It also considers two illustrative ideal point models for unidimensional and multidimensional pairwise preferences and shows that, after correcting for unreliability, correlations of personality traits assessed with single statements, unidimensional pairs, and multidimensional pairs are very close to unity.

AB - This article reviews traditional approaches for the psychometric analysis of responses to personality inventories, including classical test theory item analysis, exploratory factor analysis, and item response theory. These methods, which can be called "dominance" models, work well for items assessing moderately positive or negative trait levels, but are unable to describe adequately items representing intermediate (or average) trait levels. This necessitates a shift to an alternative family of psychometric models, known as ideal point models, which stipulate that the likelihood of endorsement increases as respondents' trait levels get closer to an item's location. The article describes an ideal point model for personality measures using single statements as items, reanalyzes data to show how the change of modeling framework improves fit, and discusses the pairwise preference format for use in personality assessment. It also considers two illustrative ideal point models for unidimensional and multidimensional pairwise preferences and shows that, after correcting for unreliability, correlations of personality traits assessed with single statements, unidimensional pairs, and multidimensional pairs are very close to unity.

Lesson Five:Classical Test Theory (CTT)

Please read the article below

Classical Test Theory (CTT) has over 80 years history, whose name coming from the comparison with "modern test theory" (i.e. Item Response Theory).

The main statement of CTT is:

X (the raw score) = T (a true component) + E (a random error)

For example, a student took three tests (all of them are measuring 5th grade Math knowledge, and all of their full score is 100). The student got 88, 90, 94 on the three different tests. We can say the student is doing well, because all the scores are close to each other, all of them are not bad, and the student' s performance is stable. Also between 88 and 94, there might be a point, maybe 92, that reflects the student' s true Math ability. However, the scores vary, which might because the student felt good or sick on the test days.

Another example, a student took GRE 3 times, and his quantitative scores are 155,159,157. All three scores reflect this student' s ability and some random error, including not eating breakfast, being sick, something good happened before test, and broke up with girlfriend/boyfriend before test.

To move to the next unit, please click the link below:



About 1400 participants were recruited from four universities in Jiangxi Province, China. Before the survey, participants were informed that their personal information would be kept confidential and the test would take about 20 minutes. Participants volunteered to take part in the survey. After excluding some invalid data due to large missing responses, 1278 participants remained. The mean age was 20.06 years (SD = 1.57, ranging from 18 to 29 years). Table 1 contains the detailed demographic information. This study was approved by the Research Center of Mental Health, Jiangxi Normal University and the Ethics Committee of the Department of Psychology of Jiangxi Normal University. Written informed consent was obtained from all of the participants in accordance with the Declaration of Helsinki.

Table 1. Demographic characteristics (N = 1278)


The initial item bank was determined by referring to previous studies and consisted of 117 items from seven widely used shyness scales, including the Revised Shyness Scale (RCBS Cheek & Buss, Reference Cheek and Buss 1981), Social Avoidance and Distress Scale (SAD Watson & Friend, Reference Watson and Friend 1969), Brief Fear of Negative Evaluation Scale (BFNS Leary, Reference Leary 1983a), Interaction Anxiousness Scale (IAS Leary, Reference Leary 1983b), Shyness Scale (SS Su & Wu, Reference Su and Wu 2008), McCrosky Shyness Scale (MSS McCrosky & Richmond, Reference McCroskey and Richmond 1982), and the Shyness Syndrome Inventory (SSI Melichor & Cheek, Reference Melchior and Cheek 1990). As different shyness scales may contain very similar or even the same topics, in order to avoid overlapping of the items in the item bank, those items with the same topics were removed. Based on previous studies, items from the seven selected scales could be classified into nine domains (Cheek & Buss, Reference Cheek and Buss 1981 Leary, Reference Leary 1983a, Reference Leary 1983b McCrosky & Richmond, Reference McCroskey and Richmond 1982 Melichor & Cheek, Reference Melchior and Cheek 1990 Su & Wu, Reference Su and Wu 2008 Watson & Friend, Reference Watson and Friend 1969): shyness, social avoidance, social distress, cognitive component of shyness, somatic component of shyness, emotional component of shyness, behavioral component of shyness, fear of negative evaluation, and interaction anxiousness. Table 2 contains detailed information about these scales.

Table 2. Sources and proportions of items

Note: RCBS, Revised Shyness Scale SAD, Social Avoidance and Distress Scale SSI, Shyness Syndrome Inventory BFNS, Brief Fear of Negative Evaluation Scale IAS, Interaction Anxiousness Scale MSS, McCrosky Shyness Scale SS, Shyness Scale.

All of the seven chosen scales are self-reported scales. The RCBS contains 13 items with a 5-point Likert-type scale (Very uncharacteristic or untrue, strongly disagree to Very characteristic or true, strongly agree). The SAD contains 19 items and each item has two levels (yes and no). The BFNS and the IAS both contain 12 items with a 5-point Likert-type scale (Not at all characteristic to Extremely characteristic). The SS contains 36 items with a 5-point Likert-type scale (Not at all characteristic to Extremely characteristic). The SSI contains 13 items with a 5-point Likert-type scale (Very uncharacteristic or untrue, strongly disagree to Very characteristic or true, strongly agree). The MSS contains 12 items with a 5-point Likert-type scale (Strongly disagree to Strongly agree).

Except for the MSS and the SSI, the other five scales have a Chinese version. The RCBS was revised into Chinese for college students (Xiang, Ren, Zhou, & Liu, Reference Xiang, Ren, Zhou and Liu 2018). The results demonstrated that the Cronbach’s alpha and retest reliability of the Chinese version of RCBS were .88 and .58 respectively. As for validity, the Chinese version of the RCBS had a close association with the Social Interaction Anxiety Scale (r = .77, p < .01). Peng, Fan, and Li ( Reference Peng, Fan and Li 2003) modified the SAD in China and the results showed that the Cronbach’s alpha and retest reliability of the Chinese version of the SAD were .85 and .76 respectively, and the subscale reliabilities of the Chinese version of SAD were .77 and .73 respectively. Regarding its validity, the Chinese version of SAD had a significant correlation with the IAS (r = .67, p < .01). The Chinese versions of the BFNS and IAS were developed by Wang, Wang, and Ma ( Reference Wang, Wang and Ma 1999). The Chinese version of the BFNS had a Cronbach’s alpha of .90 and a retest reliability of .75. Regarding validity, the Chinese version had a close correlation with the SAD (r = .51, p < .01). The Chinese version of the IAS had a Cronbach’s alpha of .87 and a retest reliability of .80. Regarding validity, the Chinese version of the IAS had a close correlation with the RCBS (r = .60, p < .01). The SS is a Chinese scale developed by Su and Wu ( Reference Su and Wu 2008), with a Cronbach’s alpha of .95 and a retest reliability of .90. Their findings indicated that the subscale reliabilities of the Chinese version of the SS were .80

.87. Regarding validity, the SS had a close correlation with the RCBS (r = .88, p < .01).

Melchior and Cheek ( Reference Melchior and Cheek 1990) reported that in the SSI revision sample of 326 college students, the alpha internal consistency coefficient was .94 it had a 45-day retest reliability of .91 for a sample of 31 college students, with a correlation of .96 with the RCBS. The MSS had a Cronbach’s alpha of .90 and had a significant correlation of .01 with the RCBS (McCrosky & Richmond, Reference McCroskey and Richmond 1982). The MSS and the SSI were translated into Chinese. The translation of the MSS and the SSI were performed by six researchers with extensive experience in translation of self-report measurement. Three of them performed a forward translation of the items, and the other researchers performed an independent review of these translations. Following this, if there were different opinions on translation, discussions and revisions were needed by the six researchers and a professor of psychology. Revisions and seminars were repeated until consistent results are obtained. The confirmatory factor analysis (CFA) showed that the Chinese version of the MSS had the same structure as the original MSS (Tucker-Lewis index [TLI] = 0.89, confirmatory fit index [CFI] = 0.91, root mean square error of approximation [RMSEA] = 0.07, standardized root mean square residual [SRMR] = 0.06). The alpha coefficient for the Chinese version of the MSS was .80, and it has a close association with the RCBS (r = .42, p < .01). Regarding the Chinese version of the SSI, after setting the error terms of item 10 with item 8, and item 2 with item 1 to be related due to their content being very similar, the SSI had the same structure with the original SSI, with TLI = 0.91, CFI = 0.93, RMSEA = 0.05 and SRMSR = 0.04. The alpha coefficient for the Chinese version of the SSI was .70, and the SSI has a close association with the RCBS (r = .73, p < .01). These indicated that the Chinese version of the MSS and SSI have acceptable reliability and validity.

To validate the proposed CAT-Shyness, the Shyness Questionnaire (Shy-Q Bortnik et al., Reference Bortnik, Henderson and Zimbardo 2002) was chosen as the external criteria scale. It is commonly used to diagnose shyness symptoms in a clinical setting. It is considered that the average score of participants is more than 3.5 (Henderson, Gilbert, & Zimbardo, Reference Henderson, Gilbert, Zimbardo, Hofmann and DiBartolo 2014). There are 35 items in the scale, which are divided into four dimensions: self-blame, seeking approval, fear of rejection, and self-restriction of expression. The scale is 5-point Likert-type scale, with 1 = Not at all characteristic and 5 = Extremely characteristic. In this study, the Cronbach’s alpha was .88.

Construction of the CAT-Shyness Item Bank

For construction of the CAT-Shyness item bank, statistical analyses based on IRT were sequentially carried out, including the IRT analyses of unidimensionality, local independence, item fit, item discrimination, and DIF.


Within the framework of IRT, the unidimensionality assumption was checked first. Given the clear correlation shown between the different personality traits (Muñiz, Suárez-Álvarez, Pedrosa, Fonseca-Pedrero, & García-Cueto, Reference Muñiz, Suárez-Álvarez, Pedrosa, Fonseca-Pedrero and García-Cueto 2014), a unidimensional hypothesis for the battery was established. Robust maximum likelihood estimation method was used in the exploratory factor analysis (EFA).

In EFA, the unidimensional hypothesis is established when the first factor explains at least 20% of the total variance (Reckase, Reference Reckase 1979) and the explanatory variance ratio of the first factor to the second factor is more than 4 (Reeve et al., Reference Reeve, Hays, Bjorner, Cook, Crane and Teresi 2007).

To confirm acceptable unidimensionality of the dataset, we first ran an EFA and eliminated items with factor loadings below 0.30 (Nunnally, Reference Nunnally 1978) on the first factor, and then reran the EFA to investigate the unidimensionality of the item pool.

Parameter estimation

Based on the 1278 response data, item parameters were estimated by expectation-maximization (EM) algorithm via IRTPRO2.1.

Model selection

In IRT, choosing an appropriate model for data analysis is the premise to ensure the accuracy of data analysis results. In this study, the commonly used Akaike information criterion (AIC), Bayesian information criterion (BIC), and -2 log-likelihood (-2LL) were used to determine which model fit best. The smaller these test-fit indices are, the better the model fit (Posada & Crandall, Reference Posada and Crandall 2001).

Under the IRT framework, IRT models can be divided into two main categories: the difference models (or cumulative logit models) and the divided-by-total models (or adjacent logit models Tu, Zheng, Cai, Gao, & Wang, Reference Tu, Zheng, Cai, Gao and Wang 2017). The graded response model (GRM Samejima, Reference Samejima 1969) is a typical model in difference models in addition, the generalized partial credit model (GPCM Muraki, Reference Muraki 1992) is a representative model of divided-by-total models. The GPCM is an extension of the partial credit model (PCM Masters, Reference Masters 1982) by adding the discrimination parameter. The GRM has the same number of item parameters as the GPCM and belongs to the class of models that measures the response in order. After investigating a large number of studies, the above two models were not only commonly used polytomously scoring models in IRT, but also commonly used in CAT (e.g. Paap, Kroeze, Terwee, Palen, & Veldkamp, Reference Paap, Kroeze, Terwee, Palen and Veldkamp 2017). Therefore, the model with the smaller test-fit indices between the GRM and the GPCM was selected for further analysis.

Local independence

Local independence is also a necessary assumption of IRT models. It means that when controlling for trait levels, the response to any item is unrelated to the response for any other item (Embretson & Reise, Reference Embretson and Reise 2000). In other words, there are no other underlying factors explaining the response behavior. Yen’s Q3 statistic (Yen, Reference Yen 1993) was used to test local independence, where Q3 values higher than 0.36 were represented as locally dependent (Flens, Smits, Carlier, van Hemert, & de Beurs, Reference Flens, Smits, Carlier, van Hemert and de Beurs 2016). Therefore, items with a Q3 larger than 0.36 were removed from the item pool.

Item fit

The item-fit test was used to determine whether the item fitted to the IRT model, and the item-fit test was performed using the S-χ 2 statistic (Orlando & Thissen, Reference Orlando and Thissen 2003). Items with p values of S-χ 2 less than .01 were eliminated from the original item bank (Flens et al., Reference Flens, Smits, Carlier, van Hemert and de Beurs 2016).

Item discrimination parameters

In GRM and GPCM, which are both two-parameter models, the relation is determined by two parameters: the discrimination parameter (a), giving information about the discriminative ability of an item and item threshold parameter (b), indicating the location or difficulty of an item. According to Fliege’s criteria (Fliege, Becker, Walter, Bjorner, & Rose, Reference Fliege, Becker, Walter, Bjorner and Rose 2005), we deleted items with low discrimination (<.7).

Differential item functioning

DIF was analyzed to identify item bias for a wide range of variables, such as gender or region, to build nonbiased item banks. DIF analyses were conducted using the polytomous logistic regression method (Swaminathan & Rogers, Reference Swaminathan and Rogers 1990) via the package lordif (Choi, Gibbons, & Crane, Reference Choi, Gibbons and Crane 2011). Change in McFadden’s pseudo R 2 was used to evaluate effect size, and the hypothesis of no DIF was rejected when R 2 change ≥ .2 (Flens et al., Reference Flens, Smits, Carlier, van Hemert and de Beurs 2016), so these items were removed from the final analysis. We evaluated DIF for region (rural, city) and gender (male, female) groups.

The IRT analyses were all done in R package mirt (Version 1.24 Chalmers, Reference Chalmers 2012). The analyses of unidimensionality, local independence, item discrimination, item fit, and differential item functioning were repeated until all remaining items of CAT-Shyness sufficiently satisfied the above rules.

CAT-Shyness Simulated Study

After the final item bank was established, the CAT simulation was carried out. Based on the CAT-shyness real item bank parameters, the performance of the CAT-shyness in different shy levels was simulated to test its feasibility and rationality and its related algorithm. The social shyness trait levels of the subjects were simulated and ranged from −3.5 to 3.5 intervals of 0.25. Each point simulated 100 subjects, and a total of 2900 subjects were simulated. All analyses were done in R (Version 3.4.1) and catR package for R studio (Magis & Raiche, Reference Magis and Raiche 2011).

Starting point, scoring algorithm, item selection algorithm, and stopping rule

The first step was to determine the starting point. In CAT simulation, item selection depends on the participants’ responses to a given item. At first, however, the participant knows nothing about prior information. Therefore, a simple and effective method is to randomly select the first item from the final item bank (Magis & Barrada, Reference Magis and Barrada 2017).

The second step used a scoring algorithm to estimate the score on the latent trait of the simulated subjects. The expected a posterior estimation (EAP) method was used to estimate the person parameters. First, this method can effectively utilize the information provided by the entire posterior distribution, and the EAP algorithm has high stability. Second, it does not need iteration and the calculation process is simpler. The simplicity and stability of the EAP makes it a widely used method for CAT simulations (e.g., Bulut & Kan, Reference Bulut and Kan 2012 Chen, Hou, & Dodd, Reference Chen, Hou and Dodd 1998). Third, the accuracy of EAP estimates are higher than the MLE (e.g., Sorrel, Barrada, de la Torre, & Abad, Reference Sorrel, Barrada, de la Torre and Abad 2020).

The third step was to determine the item selection algorithm. Maximum Fisher information criterion (van der Linden, Reference van der Linden 1998) is the most widely used item selection algorithm in CAT programs. Its purpose is to improve the accuracy of measurement, but it is also likely to lead to uneven exposure of items in the item bank and reduce the safety of testing (Barrada, Olea, Ponsoda, & Abad, Reference Barrada, Olea, Ponsoda and Abad 2009). However, as Likert-type scales require participants to respond in the usual way, the response results without correct answers greatly reduces the safety of the test. Therefore, maximum Fisher information criterion was chosen as the item selection algorithm.

Finally, the stopping rules were based on the standard error (SE) of measurement. That is, the CAT will be stopped if participants’ SE of measurement reaches the predefined SE of measurement, which is also called the variable length termination rule.

The relationship between the SE and the Fisher information can be defined as

$< m>left( < m< heta >> ight) = over<^n left( < m< heta >> ight)> >>$

where n is the number of items the participant has answered. In this study, several stopping rules with different SEs were performed, including SE ≤ .50, SE ≤ .45, SE ≤ .40, SE ≤ .35, SE ≤ .30, SE ≤ .25 and SE ≤ .20.

Properties of the item pool

In order to explore the estimation results of simulated subjects under different stopping rules, bias, mean absolute deviation (MAD), root mean square error (RMSE), correlation coefficient between the subject’s true shyness trait, and the estimated shyness trait by CAT-Shyness were all investigated to determine the effectiveness of the CAT-Shyness related algorithms.

The exposure rate (ER) index was used to measure the security of the item pool. ERj = ƒj/N ERj is the exposure rate of item j, and ƒj is the number of times that j is selected. The smaller the ERj, the lower the exposure rate. The chi-squared statistic is used to reflect the overall exposure of the item bank as

$<< m>^2> = ^< m> - Eleft( <>> ight)> ight]>^2>>>over<<>> ight)>>>>$

where E(ERj) = L/M is the expected exposure rate of item j, L represents the test length, and M is number of items in the item pool (Chang & Ying, Reference Chang and Ying 1999). The chi-squared index reflects the difference between the observed item exposure rate and expected exposure rate. The smaller the chi-squared index, the safer the item pool.

The CAT-Shyness real study

In this part, we used real participants’ data that had already been collected and used in development of the item pool. The CAT program stopping rules were also set to when the SE (θ) of measurement reached .50, .45, .40, .35, .30, .25 or .20. The parameter estimation method and item selection algorithm have been discussed above.

Characteristics of the CAT-Shyness

To investigate the characteristics of the CAT-Shyness, several statistics were calculated: number of items used (including the means and standard deviations), mean standard error of theta estimates, marginal reliability, Pearson’s correlation between the estimated theta in the CAT-Shyness, and the estimated theta via the entire item bank. The marginal reliability is the mean reliability for all levels of theta (Smits, Cuijpers, & van Straten, Reference Smits, Cuijpers and van Straten 2011). The ER index is also calculated to measure the security of the item pool.

In addition, the number of selected items under several stopping rules was plotted as a function of the final theta estimation and test information curve. The test information shows the measurement precision of the CAT-Shyness: the larger the value, the smaller the error of the theta estimation.

Convergent-related validity of the CAT-Shyness

Convergent-related validity refers to how closely the new scale is related to other variables and other measures of the same construct (Paul, Reference Paul 2017). To further investigate the convergent-related validity of the CAT-Shyness, the Shy-Q (Bortnik et al., Reference Bortnik, Henderson and Zimbardo 2002), which is widely used in diagnosing shyness, was selected as the criterion scale. Pearson’s correlation between the estimated theta in the CAT-Shyness and the score of the Shy-Q was calculated to address the convergent-related validity of CAT-Shyness.

Predictive utility (sensitivity and specificity) of the CAT-Shyness

The area under curve (AUC) under the receiver operating characteristic (ROC) curve index was used as an additional criterion to investigate the predictive utility (sensitivity and specificity Smits et al., Reference Smits, Cuijpers and van Straten 2011) of the CAT-Shyness. A larger AUC index indicates a better diagnostic effect (Kraemer & Kupfer, Reference Kraemer and Kupfer 2006). We used the Shy-Q (Bortnik et al., Reference Bortnik, Henderson and Zimbardo 2002) as the classified variable for shyness. Moreover, the estimated theta in CAT-Shyness was used as a continuous variable to plot the ROC curve under each stopping rule. The meaning of the AUC sizes is shown in Table 3 (Forkmann et al., Reference Forkmann, Kroehne, Wirtz, Norra, Baumeister, Gauggel and Boecker 2013).

Table 3. AUC indicator size description

Note: AUC, area under curve.

Determination of the critical value was calculated by maximizing the Youden Index (YI = sensitivity + specificity – 1 Schisterman, Perkins, Liu, & Bondell, Reference Schisterman, Perkins, Liu and Bondell 2005). The sensitivity indicates the probability that a patient is accurately diagnosed with a disease, and specificity is the probability of patients without disease who test negative. The bigger the two values, the better the effect of the diagnosis.

Parametric model measurement: reframing traditional measurement ideas in neuropsychological practice and research

Objective: Neuropsychology is an applied measurement field with its psychometric work primarily built upon classical test theory (CTT). We describe a series of psychometric models to supplement the use of CTT in neuropsychological research and test development.

Method: We introduce increasingly complex psychometric models as measurement algebras, which include model parameters that represent abilities and item properties. Within this framework of parametric model measurement (PMM), neuropsychological assessment involves the estimation of model parameters with ability parameter values assuming the role of test 'scores'. Moreover, the traditional notion of measurement error is replaced by the notion of parameter estimation error, and the definition of reliability becomes linked to notions of item and test information. The more complex PMM approaches incorporate into the assessment of neuropsychological performance formal parametric models of behavior validated in the experimental psychology literature, along with item parameters. These PMM approaches endorse the use of experimental manipulations of model parameters to assess a test's construct representation. Strengths and weaknesses of these models are evaluated by their implications for measurement error conditional upon ability level, sensitivity to sample characteristics, computational challenges to parameter estimation, and construct validity.

Conclusion: A family of parametric psychometric models can be used to assess latent processes of interest to neuropsychologists. By modeling latent abilities at the item level, psychometric studies in neuropsychology can investigate construct validity and measurement precision within a single framework and contribute to a unification of statistical methods within the framework of generalized latent variable modeling.

Keywords: Neuropsychology construct validity measurement precision parametric models psychometric theory.


Pasquale Anselmi * , Daiana Colledani and Egidio Robusto
  • Department of Philosophy, Sociology, Education and Applied Psychology, University of Padua, Padua, Italy

Three measures of internal consistency – Kuder-Richardson Formula 20 (KR20), Cronbach’s alpha (α), and person separation reliability (R) – are considered. KR20 and α are common measures in classical test theory, whereas R is developed in modern test theory and, more precisely, in Rasch measurement. These three measures specify the observed variance as the sum of true variance and error variance. However, they differ for the way in which these quantities are obtained. KR20 uses the error variance of an 𠇊verage” respondent from the sample, which overestimates the error variance of respondents with high or low scores. Conversely, R uses the actual average error variance of the sample. KR20 and α use respondents’ test scores in calculating the observed variance. This is potentially misleading because test scores are not linear representations of the underlying variable, whereas calculation of variance requires linearity. Contrariwise, if the data fit the Rasch model, the measures estimated for each respondent are on a linear scale, thus being numerically suitable for calculating the observed variance. Given these differences, R is expected to be a better index of internal consistency than KR20 and α. The present work compares the three measures on simulated data sets with dichotomous and polytomous items. It is shown that all the estimates of internal consistency decrease with the increasing of the skewness of the score distribution, with R decreasing to a larger extent. Thus, R is more conservative than KR20 and α, and prevents test users from believing a test has better measurement characteristics than it actually has. In addition, it is shown that Rasch-based infit and outfit person statistics can be used for handling data sets with random responses. Two options are described. The first one implies computing a more conservative estimate of internal consistency. The second one implies detecting individuals with random responses. When there are a few individuals with a consistent number of random responses, infit and outfit allow for correctly detecting almost all of them. Once these individuals are removed, a 𠇌leaned” data set is obtained that can be used for computing a less biased estimate of internal consistency.


  1. Arashilkis

    All the same, and so on indefinitely

  2. Chapman

    It is possible to fill a blank?

  3. Elsu

    It seems to me that it has already been discussed, take advantage of the forum search.

  4. Joselito

    It is a simply magnificent idea

  5. Zion

    Perhaps, I shall agree with your phrase

  6. Faeshura

    Specially registered at the forum, in order to participate in the discussion of this issue.

Write a message