Faking on self-report
personality tests is common and a strong drawback of such tests. Many
approaches have been tried to counteract this serious source of error, see e.g.
a recent papers in the Journal of Applied
Psychology (Bangerter, Roulin, & König,
2012; Fan, et al., 2012).
The UPP
test (Sjöberg, 2010/2012) is a
self-report personality test and as such it is vulnerable to faking in
high-stakes testing situations. However, this test uses a simple but powerful
methodology for correcting test scores for faking. It measures separately two
social desirability (SD) dimensions, one overt (similar to the classical
Crowne-Marlowe scale (Crowne & Marlowe,
1960)) and one covert. The covert scale uses items similar to conventional
personality items but selected for their strong correlation with the overt
scale. The two scales are highly correlated and give similar results when used
to correct test scales for faking.
The
correction procedure uses regression models where each test scale in turn is
the dependent variable and the SD scales are independent variables. It is
necessary to fit a new model for each test scale because the different scales
are related to SD in different ways, correlations varying widely. The corrected
test scales are the residuals in these regression models.
This procedure
gives corrected test scales which correlate zero with SD. So far, so good, but
does it also work? In other words, can it be validated on empirical data? One
way to validated it is to study groups tested under different levels of
involvement, from incumbents where test results have no consequences, to
applicants where they do, and consequences are very important. In a recent
study of applicants to the officers' training program in the Swedish Army, I
had a chance to study this question, using the UPP test and its SD scales. (Previous
studies had given similar results). Data were available for 5 groups:
A. Norm
B. Incumbents
C. Applicants
(low consequences of test results)
D. Applicants
(moderate consequences)
E. Applicants
(high-stakes testing)
I expected increasing
SD scale values in the order A - E. I also expected test scales to have the
same rank order, if they were sensitive to SD, such as emotional stability. Finally,
I expected the group differences in emotional stability to vanish if the test
data were corrected for faking using the two SD scales (and a multiple regression
model). For the results, see Figs. 1 and 2 below, and Table 1.
Tabell 1. Mean values of emotional stability
(standardized scales), uncorrected and corrected data, effect size and
one-way ANOVA of group differences.
|
||
Grupp
|
Before
correction
|
Corrected
for SD
|
A. Norm
|
-0.25
|
-0.05
|
B. Incumbents
|
0.05
|
0.07
|
C. Applicants (low consequences of test
results)
|
0.43
|
0.28
|
D. Applicants (moderate consequences)
|
0.56
|
0.06
|
E. Applicants (high-stakes testing)
|
0.73
|
0.11
|
Effect size (eta2)
|
0.147
|
0.006
|
One-way ANOVA
|
F(4,1638)
= 70.693, p < 0.0005
|
F(4,1828)
= 2.763, p = 0.026
|
Note that the effect size decreased to about 5 %.
In other work on leader effectiveness, using 360 degrees feedback as criterion, I found that the validities of the test scales increased after correction for SD according to the same method (Sjöberg, Bergman, Lornudd, & Sandahl, 2011), see Fig. 3.
In other work on leader effectiveness, using 360 degrees feedback as criterion, I found that the validities of the test scales increased after correction for SD according to the same method (Sjöberg, Bergman, Lornudd, & Sandahl, 2011), see Fig. 3.
In
conclusion, a simple method for correction for faking has been found to
successfully remove about 95 % of the variance due to SD in test responses, and
such a method increased the validity of the test scores against an external
criterion.
It is often argued that SD scales really measure "personality", such as need for approval, and not a tendency to distort responses. However, the present results strongly refute this view. It is very plausible that different levels of consequences of testing should lead to different levels of motivation for impression management, but unlikely that they should result in different levels of some personality dimension such as need for approval.
It is often argued that SD scales really measure "personality", such as need for approval, and not a tendency to distort responses. However, the present results strongly refute this view. It is very plausible that different levels of consequences of testing should lead to different levels of motivation for impression management, but unlikely that they should result in different levels of some personality dimension such as need for approval.
References
Bangerter, A.,
Roulin, N., & König, C. J. (2012). Personnel selection as a signaling game.
[doi:10.1037/a0026078]. Journal of
Applied Psychology, 97, 719-738.
Crowne, D. P.,
& Marlowe, D. (1960). A new scale of social desirability independent of
psychopathology. Journal of Consulting
and Clinical Psychology, 24, 349-354.
Fan, J., Gao, D.,
Carroll, S. A., Lopez, F. J., Tian, T. S., & Meng, H. (2012). Testing the
efficacy of a new procedure for reducing faking on personality tests within
selection contexts. [doi:10.1037/a0026655]. Journal
of Applied Psychology, 97, 866-880.
Sjöberg, L.
(2010/2012). A third generation
personality test (SSE/EFI Working Paper Series in Business Administration
No. 2010:3). Stockholm: Stockholm School of Economics.
Sjöberg, L., Bergman,
D., Lornudd, C., & Sandahl, C. (2011). Sambandet
mellan ett personlighetstest och 360-graders bedömningar av chefer i hälso- och
sjukvården. (Relationship between a personality test and 360 degrees judgments of health care managers). Stockholm: Karolinska Institute, Institutionen för lärande,
informatik, management och etik (LIME).