Sidor

Friday, December 21, 2012

Norms - a crucial issuse in testing



Test norms need to be specific to the user and the test context. This should be obvious, still is often ignored, perhaps due to the expenses involved. What happens if norms are not specific?

1. A very important aspect is that of faking. Faking is abundant in job applicants. If norms are collected from incumbents or, even worse, the population at large, test scores can be grossly misleading. The reason is that many applicants fake and the distribution of their test scores is shifted towards a higher mean than for incumbents who fake very little or not at all. As a consequence, test scores for applicants will be systematically overestimated. In a stanine scale, the error could easily be 2 or 3 steps. This problem could be greatly mitigated by using a correction procedure using one or several scales for measuring the tendency to respond in a socially desirable manner. In our data, about 95 % of the effect is eliminated this way. Note, however, that the correction model must be scale specific since scales are usually not equally vulnerable to distortion.

2. Test scores may be strongly dependent on the organizational context. In some contexts, independences is not a desired trait and people will on the average have low scores on this trait. Another example is perseverance in the face of failure. If failure is rarely obvious, test takers will report low perseverance. For reasons such as these, norms need to be specific to the organizations.

It is not excessively demanding to construct specific norms, given modern IT technology, and the sample size need to be only as small as 300, or even in some cases 120. The first step is to realize the importance of specific norms, of norms corrected for impression management if they are based on incumbents or the population at large, and the fact that the sample size can be fairly small. In our practice we work with such norms, but many Swedish test providers seem unaware of the issue and that the problems can be solved with relatively modest resources.

Wednesday, August 22, 2012

Successfully dealing with faking on a self-report personality test



Faking on self-report personality tests is common and a strong drawback of such tests. Many approaches have been tried to counteract this serious source of error, see e.g. a recent papers in the Journal of Applied Psychology (Bangerter, Roulin, & König, 2012; Fan, et al., 2012).

The UPP test (Sjöberg, 2010/2012) is a self-report personality test and as such it is vulnerable to faking in high-stakes testing situations. However, this test uses a simple but powerful methodology for correcting test scores for faking. It measures separately two social desirability (SD) dimensions, one overt (similar to the classical Crowne-Marlowe scale (Crowne & Marlowe, 1960)) and one covert. The covert scale uses items similar to conventional personality items but selected for their strong correlation with the overt scale. The two scales are highly correlated and give similar results when used to correct test scales for faking.

The correction procedure uses regression models where each test scale in turn is the dependent variable and the SD scales are independent variables. It is necessary to fit a new model for each test scale because the different scales are related to SD in different ways, correlations varying widely. The corrected test scales are the residuals in these regression models.

This procedure gives corrected test scales which correlate zero with SD. So far, so good, but does it also work? In other words, can it be validated on empirical data? One way to validated it is to study groups tested under different levels of involvement, from incumbents where test results have no consequences, to applicants where they do, and consequences are very important. In a recent study of applicants to the officers' training program in the Swedish Army, I had a chance to study this question, using the UPP test and its SD scales. (Previous studies had given similar results). Data were available for 5 groups:

A. Norm
B. Incumbents
C. Applicants (low consequences of test results)
D. Applicants (moderate consequences)
E. Applicants (high-stakes testing)

I expected increasing SD scale values in the order A - E. I also expected test scales to have the same rank order, if they were sensitive to SD, such as emotional stability. Finally, I expected the group differences in emotional stability to vanish if the test data were corrected for faking using the two SD scales (and a multiple regression model). For the results, see Figs. 1 and 2 below, and Table 1. 


Fig. 1. Means of SD scales


Fig. 2. Means of emotional stability before and after SD correction



Tabell 1. Mean values of emotional stability (standardized scales), uncorrected and corrected data, effect size and one-way ANOVA of group differences.
Grupp
Before correction
Corrected for SD
A. Norm
-0.25
-0.05
B. Incumbents
0.05
0.07
C. Applicants (low consequences of test results)
0.43
0.28
D. Applicants (moderate consequences)
0.56
0.06
E. Applicants (high-stakes testing)
0.73
0.11
Effect size (eta2)
0.147
0.006
One-way ANOVA
F(4,1638) = 70.693, p < 0.0005
F(4,1828) = 2.763, p = 0.026

Note that the effect size decreased to about 5 %.

In other work on leader effectiveness, using 360 degrees feedback as criterion, I found that the validities of the test scales increased after correction for SD according to the same method (Sjöberg, Bergman, Lornudd, & Sandahl, 2011), see Fig. 3. 

Fig. 3. Validities of uncorrected and corrected persnality scales


In conclusion, a simple method for correction for faking has been found to successfully remove about 95 % of the variance due to SD in test responses, and such a method increased the validity of the test scores against an external criterion. 

It is often argued that SD scales really measure "personality", such as need for approval, and not a tendency to distort responses. However, the present results strongly refute this view. It is very plausible that different levels of consequences of testing should lead to different levels of motivation for impression management, but unlikely that they should result in different levels of some personality dimension such as need for approval.

References

Bangerter, A., Roulin, N., & König, C. J. (2012). Personnel selection as a signaling game. [doi:10.1037/a0026078]. Journal of Applied Psychology, 97, 719-738.
Crowne, D. P., & Marlowe, D. (1960). A new scale of social desirability independent of psychopathology. Journal of Consulting and Clinical Psychology, 24, 349-354.
Fan, J., Gao, D., Carroll, S. A., Lopez, F. J., Tian, T. S., & Meng, H. (2012). Testing the efficacy of a new procedure for reducing faking on personality tests within selection contexts. [doi:10.1037/a0026655]. Journal of Applied Psychology, 97, 866-880.
Sjöberg, L. (2010/2012). A third generation personality test (SSE/EFI Working Paper Series in Business Administration No. 2010:3). Stockholm: Stockholm School of Economics.
Sjöberg, L., Bergman, D., Lornudd, C., & Sandahl, C. (2011). Sambandet mellan ett personlighetstest och 360-graders bedömningar av chefer i hälso- och sjukvården. (Relationship between a personality test and 360 degrees judgments of health care managers). Stockholm: Karolinska Institute, Institutionen för lärande, informatik, management och etik (LIME).

Tuesday, August 14, 2012

Validity of integrity tests


Traditionally the view has been that integrity tests (actually honesty tests) have very high validity, based on an early meta-analysis (Ones, Viswesvaran, & Schmidt, 1993). Some skeptical comments have pointed out that many of the studies in this meta-analysis came directly from reports from test vendors. Yet the high validity of integrity tests it has become an established truth, and a basis for an entire industry producing integrity tests, based on Schmidt and Hunter (1998) who wrote that the g-factor + integrity is the best basis for prediction of work performance. This is probably wrong.

A current and updated meta-analysis clearly shows that validities of integrity tests are not higher than 0.2, perhaps as low as 0.1 (Van Iddekinge, Roth, Raymark, & Odle-Dusseau, 2012a, 2012b), even if they are corrected for measurement error in criteria and range restriction in the test. The earlier estimates were at level 0.4, i.e. higher than the standard personality test. It appears now that the skeptics have been right: the high validities come from test providers' own information, independent research does not confirm therm. A rather high value of validity can be obtained with self-ratings of counterproductive behavior at work, but this is not very interesting.

This is an example of how early meta-analysis can result in errors. Van Iddekinge et al. have published a very  ambitious project. The result is clear. Integrity test seems not to have significant practical value. And then we have not even discussed that such tests can easily be faked..

References

One, DS, Viswesvaran, C., & Schmidt, FL (1993). Comprehensive meta-analysis of integrity test validities: findings and implications for personnel selection and theories of job performance. Journal of Applied Psychology Monograph, 78, 679-703.

Schmidt, F. L. & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262-274.

Van Iddekinge, CH, Roth, PL, Raymark, PH, & Odle-Dusseau, HN (2012a). The criterion-related validity of integrity tests: An updated meta-analysis. [Doi: 10.1037/a0021196]. Journal of Applied Psychology, 97 (3), 499-530.

Van Iddekinge, CH, Roth, PL, Raymark, PH, & Odle-Dusseau, HN (2012b). The critical role of the research question, inclusion criteria the, and transparency in meta-Analyses of integrity test research: A reply to Harris et al. (2012) and Ones, Viswesvaran, and Schmidt (2012). [Doi: 10.1037/a0026551]. Journal of Applied Psychology, 97 (3), 543-549.

Friday, August 10, 2012

Optimal combination of personality and intelligence


Personality and intelligence are both related to job performance, but how should they be weighted for optimal results? The most straightforward approach is a linear combination, and indeed there is little evidence for other types of models. Once this is decided the final question is what weights should be given to the two types of information, in order to maximize predictive efficiency. It is well-known that they tend to be uncorrelated, hence the crucial question is how valid they are in relation to job performance criteria. Intelligence, or GMA (the g factor) correlates around 0.6 with job performance (Schmidt & Hunter, 1998). "Personality" is a less stringent term, and could mean many things. However, I shall take personality as referring to an optimal index of subscales, and such indices have been found to correlate around 0.55 with job performance (de Colli, 2011; Sjöberg, 2010; Sjöberg, Bergman, Lornudd, & Sandahl, 2011), after correction for measurement errors in criteria and range restriction in the independent variable (Schmidt, Shaffer, & Oh, 2008). Hence, intelligence and personality, in this sense, are equally efficient as predictors and an evidence-based strategy is to treat them that way, with equal weights.

It should be noted that the usual Big Five dimensions are much weaker predictors of job performance, as shown in a number of meta-analyses (Barrick, Mount, & Judge, 2001). To get an efficient personality predictor it is necessary to form an index based on focused and narrow scales (Bergner, Neubauer, & Kreuzthaler, 2010; Christiansen & Robie, 2011; Sjöberg, 2010/2012). Big Five personality tests are not sufficient for optimal prediction of job performance.


References

Barrick, M. R., Mount, M. K., & Judge, T. A. (2001). Personality and performance at the beginning of the new millennium: What do we know and where do we go next? International Journal of Selection and Assessment, 9, 9-30.
Bergner, S., Neubauer, A. C., & Kreuzthaler, A. (2010). Broad and narrow personality traits for predicting managerial success. [doi:10.1080/13594320902819728]. European Journal of Work and Organizational Psychology, 19, 177-199.
Christiansen, N. D., & Robie, C. (2011). Further consideration of the use of narrow trait scales. [doi:10.1037/a0023069]. Canadian Journal of Behavioural Science/Revue canadienne des sciences du comportement, 43, 183-194.
de Colli, D. (2011). Ett nytt svenskt arbetspsykologiskt test och arbetsprestation inom polisen – samtidig validitet: Mälardalens högskola, Akademin för hållbar samhälls- och teknikutveckling.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262-274.
Schmidt, F. L., Shaffer, J. A., & Oh, I.-S. (2008). Increased accuracy for range restriction corrections: Implications for the role of personality and general mental ability in job and training performance. Personnel Psychology, 61, 827-868.
Sjöberg, L. (2010). Upp-testet och kundservice: Kriteriestudie. Forskningsrapport 2010:6. Stockholm: Psykologisk Metod AB.
Sjöberg, L. (2010/2012). A third generation personality test (SSE/EFI Working Paper Series in Business Administration No. 2010:3). Stockholm: Stockholm School of Economics.
Sjöberg, L., Bergman, D., Lornudd, C., & Sandahl, C. (2011). Sambandet mellan ett personlighetstest och 360-graders bedömningar av chefer i hälso- och sjukvården. Stockholm: Karolinska Institutet, Institutionen för lärande, informatik, management och etik (LIME).

Saturday, July 21, 2012

Job interest and performance: a revised view



Is job interest of any importance to job performance? It seems very likely that it should be, but as pointed out by Nye et al. (Nye, Su, Rounds, & Drasgow, 2012), "interest measures are generally ignored in the employee selection literature" (p. 384). Part of the reason seems to be that previous meta-analytic work reported a very low correlation between interest and performance, only about 0.1 (Hunter & Hunter, 1984). However, Nye at al. criticized the often cited meta-analysis published by Hunter and Hunter and conducted a very extensive new analysis of the relation between interest and performance. They came up with a different conclusion: for studies where the interest scales matched the character of the jobs, the estimated correlation was 0.36, after correction for measurement errors and indirect range restriction. They concluded that interest should be considered in selection contexts. 

This is not the only example showing that earlier meta analyses of the effectiveness of predictors of job performance may be quite misleading. A recent publication on integrity tests by van Iddekinge et al. (2012) showed that earlier meta analytic work (Ones et al., 1993), cited by Hunter and Hunter, grossly over-estimated the validity of integrity tests.

The recent Nye at al. work  is undoubtedly very important. However, even stronger results can probably be obtained with specific interest measures. Vocational interest does not measure interest in a specific job, but in a class of jobs. In the UPP test, we measure routinely interest in the specific job under consideration, either in selection or in various types of follow-up. As an example, data from a study of employees in customer service in a finance company (Sjöberg, 2010) was re-analyzed. The correlation between job (not vocational) interest and supervisor rated performance on core job tasks was 0.55, after correction for measurement error and indirect range restriction. The specific interest measure is proximal to job performance, while vocational interest is distal, hence it should be expected to have a lower correlation.

What creates interest (Sjöberg, 2006)? For a given task content, optimal challenge may be the answer to the question. Interests are also probably somewhat elastic, i.e. you may develop a new interest under favorable circumstances (support, optimal challenge). Maybe one should try measure not only interest but also potential for developing interest. In a selection situation, it must be expected that interest scores are contaminated with impression management, and there is a need to correct for that factor. Alternatively, indirect measurement can be attempted, such as knowledge of facts. People who are strongly interested inform themselves about a job or area of study, hence know more. I tried this idea in the selection of applicants to the Stockholm School of Economics, with some success.

References


Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72-98.
Nye, C. D., Su, R., Rounds, J., & Drasgow, F. (2012). Vocational interests and performance: A quantitative summary of over 60 years of research. Perspectives on Psychological Science, 7(4), 384-403.
Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities: findings and implications for personnel selection and theories of job performance. Journal of Applied Psychology Monograph, 78, 679-703.
Van Iddekinge, C. H., Roth, P. L., Raymark, P. H., & Odle-Dusseau, H. N. (2012). The criterion-related validity of integrity tests: An updated meta-analysis. [doi:10.1037/a0021196]. Journal of Applied Psychology, 97(3), 499-530.
Sjöberg, L. (2006). What makes something interesting? (Review of the book, Exploring the psychology of interest by Paul J. Silvia). PsycCRITIQUES, 51 (46, Article 4), No Pagination Specified.
Sjöberg, L. (2010). UPP-testet och kundservice: Kriteriestudie. (The UPP test and customer service: A criterion study). Forskningsrapport 2010:6. Stockholm: Psykologisk Metod AB.
Sjöberg, L. (2010/2012). A third generation personality test (SSE/EFI Working Paper Series in Business Administration No. 2010:3). Stockholm: Stockholm School of Economics.
Click here,

Thursday, July 19, 2012

Dealing with test complexity



People have a limited ability to make complex judgments without the support of computers and explicit decision rules. This fact has been well-known for many years. An often cited classic is a paper by Miller [12] . Expert judgments of many kinds, including the assessment of job applicants, have confirmed this general principle   [3; 8] . There are some interesting exceptions in special cases, if the experts get fast and clear feedback based on valid theory [9] .  These conditions are rarely present in the assessment of job applicants.

It is usual for judges to come to different conclusions if the information they use is complex and extensive - a common situation. Furthermore, assessments tend to vary over time. At the same time that we have these limitations in our judgment capacity, we have a tendency to fall prey to an illusion. The more information we get, the more confident we are - but beyond a modest limit, judgments become worse as in formation increases. See Fig. 1. 


Figure 1.  Decision quality as a function of amount of information. 

Most personality tests give a complicated picture of a person. This is reasonable since everyone "knows" that people are complicated. Popular tests provide results for 30-40 dimensions. It is likely that such abundance of information is popular due to the information illusion discussed above.  More information makes us more confident. Research has, however, shown that explicit rules for combining formation gives better results. Such a rule can simply be based on the decision maker's own systematic strategy, so-called boot-strapping [7] , or explicitly judged importance weights. The use of weights is an effective way of answering the question: "How do I interpret this test result?" The alterative approach is use a holistic evaluation based on the pattern of results. Holism has traditionally had a strong position in the interpretation of test results, but it cannot be justified on empirical and scientific grounds [14]

Subjective interpretation typically results in narrative texts which may be very credible, due to a number of psychological factors. Such factors have been discussed as enabling "cold reading", i.e. credible inferences about a person, which lack factual basis [13] . Historical examples show how credibility of the Rorschach test was established  by "wizards" who could seemingly produce surprisingly correct statements about a person on the basis of responses to  that test [18] , in spite of the fact that this test, as well as other projective techniques have been found to lack validity [6; 10] . I give two examples of research, which illustrate how illusory credibility may be established.
The Forer effect. Flattering texts, which are full of statements which are generally true  and which say "both A and its Opposite B" are perceived  as very accurate. Forer showed this in a classic study a long time ago [5] ; results which have been replicated many times [4; 16] .  

Forer gave a group of students a "test" which he said would reveal their personalities. After some time a returned with narrative texts said to be based on the responses to the test. Each students got his or her text, but they were all the same. They were asked to judge how well the texts described their personalities. About 90 % said that the texts fitted very well. Here is what they got (typical astronomical texts):

"You have a need for other people to like and admire you, and yet you tend to be critical of yourself. While you have some personality weaknesses you are generally able to compensate for them. You have considerable unused capacity that you have not turned to your advantage. Disciplined and self-controlled on the outside, you tend to be worrisome and insecure on the inside. At times you have serious doubts as to whether you have made the right decision or done the right thing. You prefer a certain amount of change and variety and become dissatisfied when hemmed in by restrictions and limitations. You also pride yourself as an independent thinker; and do not accept others' statements without satisfactory proof. But you have found it unwise to be too frank in revealing yourself to others. At times you are extroverted, affable, and sociable, while at other times you are introverted, wary, and reserved. Some of your aspirations tend to be rather unrealistic. "

MBTI and PPA excel in using statements of this type , and they provide popular reading for those who have taken the tests. They are perceived to be almost perfectly accurate and to give self insights, but they simply flatter [15]  and/or confirm already existing self beliefs. Once credibility is established the tester can give important advice about selection, team composition and personal development. No research exists, which shows such advice to be useful, but since the test report is so persuasive the advice is probably also believed.

The "Draw-a-man"-effect". The draw-a-man test is credible to many users although it has no demonstrated validity
[17] . This is because of common-sense thinking about what various aspect of a drawing could mean. Example: large muscles mean problem with male self-image, large eyes imply paranoid tendencies, etc. Inn addition, there is selective memory of cases which supported these speculations, the others are forgotten or explained away [1; 2] .

The UPP test deals with complexity with aggregate variables, which are linear composites of selected subscales. Extensive research, over a period of 50 years,  has shown that this approach is superior to subjective integration of information [8; 11] . For a reveiew of work on UPP, click here.

References

[1]. Chapman, L. J., & Chapman, J. P. (1967). Genesis of popular but erroneous psychodiagnostic observations. Journal of Abormal Psychology, 73, 193-204.

[2]. Chapman, L. J., & Chapman, J. P. (1969). Illusory correlation as an obstacle to the use of valid psychodiagnostic signs. Journal of Abnormal Psychology, 74, 271-280.

[3]. Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243, 1668-1674.

[4]. Dickson, D. H., & Kelly, I. W. (1985). The 'Barnum Effect in Personality Assessment: A Review of the Literature. Psychological Reports 57, 367-382.

[5]. Forer, B. R. (1949). The fallacy of personal validation: a classroom demonstration of gullibility. Journal of Abnormal & Social Psychology, 44, 118-123.

[6]. Garb, H. N., Lilienfeld, S. O., & Wood, J. M. (2004). Projective techniques and behavioral assessment. In S. N. Haynes & E. M. Heiby (Eds.), Comprehensive handbook of psychological assessment, Vol. 3: Behavioral assessment (pp. 453-469). Hoboken, NJ, US: John Wiley & Sons Inc.

[7]. Goldberg, L. R. (1970). Man versus model of man: A rationale plus some evidence for a method of improving clinical inferences. Psychological Bulletin, 73, 422-432.

[8]. Grove, W. M., & Meehl, P. E. (1996). Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical-statistical controversy. Psychology, Public Policy, and Law, 2, 293-323.

[9]. Kahneman, D., & Klein, G. (2009). Conditions for intuitive expertise: A failure to disagree. [doi:10.1037/a0016755]. American Psychologist, 64, 515-526.

[10]. Lilienfeld, S. O., Wood, J. M., & Garb, H. N. (2000). The scientific status of projective techniques. Psychological Science in the Public Interest, 1, 27-66.

[11]. Meehl, P. E. (1954). Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. Minneapolis: University of Minnesota Press.

[12]. Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63, 81-97.

[13]. Rowland, I. (2005). The full facts book of cold reading, 4th edition. London: Full Facts Books.

[14]. Ruscio, J. (2002). The emptiness of holism. Skeptical Inquirer, 26, 46-50.

[15]. Thiriart, P. (1991). Acceptance of personality test results. Skeptical Inquirer, 15, 166-172.

[16]. Trankell, A. (1961). Magi och förnuft i människobedömning. Stockholm: Bonnier.

[17]. Willcock, E., Imuta, K., & Hayne, H. (2011). Children’s human figure drawings do not measure intellectual ability. [doi:10.1016/j.jecp.2011.04.013]. Journal of Experimental Child Psychology, 110, 444-452.

[18]. Wood, J. M., Nezworski, M. T., Lilienfeld, S. O., & Garb, H. N. (2003). What's wrong with the Rorschach?: Science confronts the controversial inkblot test. San Francisco, CA, US: Jossey-Bass.


Tuesday, June 26, 2012

Publiceras flera "signifikanta" resultat än vad forskarna faktiskt funnit?

Det är en vanlig misstanke att ej signifikanta resultat inte publiceras och att därför den vetenskapliga litteraturen ger en felaktig bild av hur starka sambanden faktiskt är ("the file drawer problem"). En sådan felfaktor skulle snedvrida meta-analyser som normalt enbart bygger på publicerade arbeten. I en aktuell artikel har emellertid  Dalton et al. (2012) gått igenom dels ett stort antal publicerade arbeten, dels många opublicerade doktorsavhandlingar. Resultaten är slående: Det är ungefär lika många signifikanta samband i båda fallen. På grundval av denna omfattande studie kan man dra den slutsatsen att problemet med "the file drawer" inte existerar eller åtminstone att det är betydligt mindre allvarligt än vad man hittills trott.

Referens

Dalton, D. R., Agunis, H., Dalton, C. M., Bosco, F. A., & Pierce, C. A. (2012). Revisiting the file drawer problem in meta-analysis: Assessment  of published and nonpublished correlation matrices. Personnel Psychology, 65(2), 221-249.

Thursday, June 21, 2012

Integritetstestens validitet


Traditionellt har det ansetts att integritetstest (egentligen test på ärlighet) har mycket hög validitet, på grundval av en tidig meta-analys (Ones, Viswesvaran, & Schmidt, 1993). En del skeptiska kommentarer har pekat på att en stor del av de studier denna analys byggde på ej var publicerade utan kom direkt från rapporter från testleverantörerna. Ändå har det blivit en etablerad sanning, och en grundval för en hel industri som producerar integritetstestningar, utifrån Schmidt och Hunter (1998) som skrev att g-faktorn + integritet är den bästa grunden för prognos av arbetsresultat. Det är nog fel.

En aktuell och uppdaterad meta-analys visar tydligt att validiteterna hos integritetstesten inte är högre än 0.2, kanske så låga som 0.1 (Van Iddekinge, Roth, Raymark, & Odle-Dusseau, 2012a, 2012b), t o m om de är korrigerade för mätfel i kriterierna och begränsad spridning i testen. De tidigare uppskattningarna låg på nivån 0.4, alltså högre än de vanliga personlighetstesten. Det tycks som om skeptikerna har haft rätt: de höga validiteterna kommer från testleverantörernas egen information, oberoende forskning bekräftar den inte. Ett ganska högt värde på validiteten kan man få mot självskattningar av kontraproduktivt beteende i jobbet, men detta är ganska ointressant. Skattningar av andra som kriterium ger validiteter om kring 0.1. Schmidt och Hunter uppskattade validiteten till 0.41, vilket nu framstår som starkt vilseledande.

Detta är ett exempel på att tidiga meta-analyser kan leda fel. Van Iddekinge et al. har gjort ett enormt ambitiöst arbete. Resultatet är tydligt. Integritetstest tycks inte ha nämnvärt praktiskt värde. Och då har vi inte ens diskuterat att sådana test, liksom alla, kan fejkas.

Referenser

Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities: findings and implications for personnel selection and theories of job performance. Journal of Applied Psychology Monograph, 78, 679-703.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262-274.
Van Iddekinge, C. H., Roth, P. L., Raymark, P. H., & Odle-Dusseau, H. N. (2012a). The criterion-related validity of integrity tests: An updated meta-analysis. [doi:10.1037/a0021196]. Journal of Applied Psychology, 97(3), 499-530.
Van Iddekinge, C. H., Roth, P. L., Raymark, P. H., & Odle-Dusseau, H. N. (2012b). The critical role of the research question, inclusion criteria, and transparency in meta-analyses of integrity test research: A reply to Harris et al. (2012) and Ones, Viswesvaran, and Schmidt (2012). [doi:10.1037/a0026551]. Journal of Applied Psychology, 97(3), 543-549.

Sunday, June 3, 2012

Begåvningstest och studieresultat vid HHS

Begåvningstest har ofta ett bra prognosvärde för studieresultat, ett exempel finns i en rapport jag gjorde på uppdrag av Handelshögskolan i Stockholm för ganska länge sedan, men resultaten står sig nog, se rapporten här. Rapporten återger de skilda perspektiv som gällde på 1980-talet och numera. Det är slående hur viktig g-faktorn var i dessa data, och hur en enkel sammanvägning av 7 test med lika vikter var jämförbar med en multipel regressionsmodell.

Referens

Sjöberg, L. (2012). Begåvningstest vid urval av sökande till Handelshögskolan i Stockholm. Stockholm: Psykologisk Metod AB.

Kontrollbehov ledde till avsked

I Dagens Industri den 1 juni 2012 läser jag att "Ledarstilen fick Södras VD på fall". Ordföranden citeras: "Det har funnits ett stort kontrollbehov".

Det är ett intressant fall, eftersom UPP-testet är ett av de få personlighetstesten (det enda?) som mäter just kontrollbehov, med en validerad och normerad skala. (Plus ett antal andra relevanta egenskaper, allt kontrollerat för skönmålning. Se denna artikel i Dagens Industri, här.) Ett stort antal chefskandidater har testats i skarpt läge och utgör en lämplig norm, men även andra normer kan användas. Se vidare om testet här.

Personlighet är i hög grad avgörande för om högt uppsatta chefer lyckas eller misslyckas, se en  utmärkt översikt av Hogan m. fl. här.

Alla har inte den personlighet som krävs för ett ledarskap som måste fungera både socialt och ekonomiskt. Man kan få värdefull information om en kandidat med hjälp av test, även sådana aldrig ger ett slutgiltigt svar. Vi vet dock att vissa test knappast har något samband alls med arbetsresultat; detta gäller bl a de populära Big Five-testen, se en översikt här.

Monday, May 28, 2012

Open access: Lärobok i forskningsmetodik

Mer och mer information blir gratis tillgänglig på nätet. Här är en ambitiös lärobok i forskningsmetodik som just har publicerats

klicka här.

Information om skönmålning kan leda till förbättrad kvalitet i testdata


Skalor för att mäta skönmålning har ofta ifrågasatts. Här är en studie som använts sådana skalor för att besluta om omtestning, som befanns ge mera rättvisande resultat.


Self-report personality questionnaires often contain validity scales designed to flag individuals who intentionally distort their responses toward a more favorable characterization of themselves. Yet, there are no clear directives on how scores on these scales should be used by administrators when making high-stakes decisions about respondents. Two studies were conducted to investigate whether administrator-initiated retesting of flagged individuals represents a viable response to managing intentional distortion on personality questionnaires. We explored the effectiveness of retesting by considering whether retest responses are more accurate representations of a flagged individual's personality characteristics. A comparison of retest scores to a baseline measure of personality indicated that such scores were more accurate. Retesting should only work as a strategy for dealing with intentional distortion when individuals choose to respond more accurately the second time. Thus, we further explored the emotional reaction to being asked to retest as one possible explanation of why individuals who engage in intentional distortion respond more accurately upon retest.


Referens
Ellingson, J. E., Heggestad, E. D., & Makarius, E. E. (2012). Personality retesting for managing intentional distortion. [doi:10.1037/a0027327]. Journal of Personality and Social Psychology, 102, 1063-1076.

Friday, April 13, 2012

Ledarskap och personlighetens mörka sida

Harms et al. (2011) har publicerat ett mycket ambitiöst försök att validera HDS-testet mot ledarskap i det militära. Det är ett av mycket få arbeten som har försökt validera HDS. Se den utmärkta litteraturgenomgången i artikeln.

Officerskadetter följdes under 4 års utbildning där deras ledarförmåga bedömdes varje år. Data på HDS samlades in och relaterades (korrelationer) till ledarskap och utvecklingen av ledarskap.

HDS är ett test som ofta används, även i Sverige, för att mäta den mörka sidan, dvs kliniska eller subkliniska syndrom av typ av passiv aggression. Det är alltså inte utpräglade kliniska problem som mäts, utan tendenser till sådana, som kan vara av varierande styrka. Det är definitivt dimensioner bortom Big Five som det handlar om, men tyvärr tog Harms et al. inte in data på Big Five, inte heller försökte de studera skönmålning, som givetvis är viktig i HDS, som i alla självrapporttest.

De 11 HDS-skalorna bör alla vara negativt relaterade till ledarförmåga, om testet lyckas mäta det som det avser att mäta. Hur blev det?

Enskilda skalor hade svaga samband med ledarskap. 7 av de 11 skalorna hade inga samband alls med kriterierna. I övriga 4 fall fanns en tendens till negativa samband men för vissa skalor blev sambanden positiva. Kanske berodde sådana resultat på att de negativa effekter kommer först på längre sikt, säger författarna, men det är ju bara en spekulation så länge inga sådana data finns. Det kan dock ligga något i denna misstanke eftersom en metaanalys fann tydliga samband mellan narcissism, Machiavellianism och kontraproduktivt beteende bland icke-ledare (O'Boyle et al., 2011). Mönstret av samband mellan subkliniska tendenser och ledarskap ger emellertid inget stöd åt att använda HDS för att mäta den mörka sidan som det var tänkt. Nedanstående tabell som bygger på deras publicerade resultat ger detaljer. 


Korrelationer mellan HDS-skalor och ledarskap i amerikanska armén, Harms et al. (2011), 12 kriterier med likartade trender
Kliniskt begrepp (DSM-IV) Innehåll Korrelation i genomsnitt
Borderline Lynnig och socialt inkonsistent -0.03
Paranoid Skeptisk, misstänksam -0.18
Avoidant Negativ till förändringar 0.08
Schizoid Tillbakadragen, förstår ej andra -0.03
Passive-aggressive Självgående, ignorerar andras krav -0.04
Narcissistic Har extrem självuppskattning, tål ej kritik 0.06
Antisocial Tycker om att ta risker och testa gränser -0.08
Histrionic Dramatisk, vill bli uppmärksammad 0.06
Schizotypal Visar ovanligt tänkande, ibland kreativ -0.23
Obsessive-compulsive Mycket noggrann och kritisk 0.16
Dependent Vill vara till lags, beroende av andras uppskattning 0.16


Harms et al. lyckades visa ganska höga samband mellan ledarskap och HDS i multipla regressionsanalyser, ca 15 % förklarad varians, motsvarande en multipel korrelation omkring 0.4. Detta är intressant men för praktikern tämligen lite användbart. Analysen tar ju fram vikter som kan vara negativa eller positiva och den prognos av ledarskap man tvingas göra måste grundas på en ad hoc-modell som inte stämmer med teorin och som är bara delvis psykologiskt begriplig. Vikterna är på ett komplicerat sätt beroende av hela mönstret av korrelationer. Fältet öppnas för spekulationer vilkas värde för prognoser är okänt och tveksamt. Vi ställs inför frågan om validiteten av subjektiva bedömningar utifrån testdata eller andra kvantitativa data och sådana bedömningar är väl studerade inom många områden, med svaga resultat.


HDS är ett kreativt pionjärarbete men tycks ha kommit till utbredd praktisk användning alldeles för tidigt. Det behövs mera empirisk forskning och teoretisk analys av personlighetens mörka sida och ledarskap och andra arbetsrelevanta dimensioner. Som det ser ut nu tycks det praktiska värdet vara svagt.

Referenser


Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243, 1668-1674.

Grove, W. M., & Meehl, P. E. (1996). Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical-statistical controversy. Psychology, Public Policy, and Law, 2, 293-323.


Harms, P. D., Spain, S. M., & Hannah, S. T. (2011). Leader development and the dark side of personality. [doi:10.1016/j.leaqua.2011.04.007]. The Leadership Quarterly, 22(3), 495-509.


O'Boyle Jr, E. H., Forsyth, D. R., Banks, G. C., & McDaniel, M. A. (2011). A meta-analysis of the dark triad and work behavior: A social exchange perspective. [doi:10.1037/a0025679]. Journal of Applied Psychology, No Pagination Specified.

Thursday, April 5, 2012

Work samples

Det är en spridd uppfattning att arbetsprov är den bästa prediktorn för arbetsresultat. Se emellertid en kritisk artikel av Roth et al, som skriver i sin sammanfattning:


"Work sample tests have been used in applied psychology for decades as important predictors of job performance, and they have been suggested to be among the most valid predictors of job performance. As we examined classic work sample literature, we found the narrative review by Asher and Sciarrino (1974) to be plagued by many methodological problems. Further, it is possible that data used in this study may have influenced the results (e.g., r = .54) reported by Hunter and Hunter in their seminal work in 1984. After integrating all of the relevant data, we found an observed mean correlation between work sample tests and measures of job performance of .26. This value increased to .33 when measures of job performance (e.g., supervisory ratings) were corrected for attenuation. Our results suggest that the level of the validity for work sample tests may not be as large as previously thought (i.e., approximately one third less than previously thought). Further, our work also summarizes the relationship of work sample exams to measures of general cognitive ability. We found that work sample tests were associated with an observed correlation of .32 with tests of general cognitive ability."


Arbetsprov tycks alltså i snitt ha en validitet på samma, ganska blygsamma, nivå som vanliga personlighetstest, omkring 0.3. Kanske kan de ge tillskott utöver sådana test, men bättre är nog att använda ett mera effektivt test, som ju finns. Arbetsprov tar dessutom tid och kostar mycket mera än test.




Referens


Roth, P. L., Bobko, P., & McFarland, L. A. (2005). A meta-analysis of work sample test validity: Updating and integrating some classic literature. [doi:10.1111/j.1744-6570.2005.00714.x]. Personnel Psychology, 58(4), 1009-1037

Thursday, March 8, 2012

Jämförelse mellan sökande och antagna

I denna rapport beskrivs analyser av data från UPP-testningar med sökande och antagna till officersutbildning vid Försvarshögskolan. Sökande jämfördes med antagna med avseende på skönmålning. Skillnaden i skönmålning var mycket stor (de sökande skönmålade mycket mera), och det också skillnaden i sådana testdimensioner som emotionell stabilitet och positiv attityd. Jämfört med normdata gav både sökande och antagna högre värden i skönmålning. Efter korrektion av testdata för skönmålning kvarstod inga signifikanta eller i övrigt nämnvärda skillnader. Dessa resultat stödjer den i UPP-testet använda modellen för korrektion för skönmålning. Det visade sig också att korrektionen gav testresultat som inte var positivt snedfördelade med många värden i den högre delen av skalan, vilket var fallet med okorrigerade data, samt att rangordningen av sökande skilde sig starkt beroende på om testdata korrigerades eller ej. Läs rapporten här.

Referens

Sjöberg, L. (2012). Skönmålning på ett personlighetstest bland sökande och antagna till officersutbildning. Rapport 2012:1. Stockholm: Psykologisk Metod AB.



Sunday, February 26, 2012

UPP-testet vid Försvarshögskolan

En studie har genomförts med sökande till och deltagare i officersutbildning, där UPP-testet jämfördes med CTI (Commander Trait Inventory). CTI är liksom UPP ett självrapporttest men innehållet i de två testen är i övrigt mycket olika. CTI bygger till stor del på Jungs typteori och har även ett stort inslag av mätning av psykopati. Det är ett intressant och kreativt försök att nalkas problemet att mäta personlighet inför officersutbildning. Tyvärr kan man säga - låta vara efterklokt - att chansen att lyckas var liten, och det av åtminstone dessa tre skäl:

1. Jungs teori har aldrig visat sig användbar för att predicera beteende i arbetslivet. Mycket omfattande erfarenhet med Myers-Briggstestet är stöd för detta påstående. (Sjöberg, 2005).
2. Psykopati är den del av "den mörka triaden" som visat sig ha minsta prognosvärdet mot arbetslivet. Bättre resultat har man fått med narcissism, som ingår i UPP-testet, och som är en annan del av "den mörka triaden" (O'Boyle et al., 2011).
3. Vid ansökningar om arbete eller utbildning föreligger stor risk för skönmålning. CTI hanterar inte den risken, men UPP-testet gör det med betydande framgång (Sjöberg, 2009, 2011). Drygt 90 % av effekterna av skönmålning elimineras.

I den nu aktuella studien jämfördes CTI och UPP mot s k proxykriterier, dvs dimensioner som på goda grunder kan förväntas vara relaterade till arbetsresultat. UPP visade sig vara ungefär 3 gånger mera effektivt än CTI. Dessutom bedömdes UPP mycket positivt av dem som tog testet medan sådana data ej förelåg för CTI.

Denna forskning fortsätter.

Referenser

O'Boyle Jr, E. H., Forsyth, D. R., Banks, G. C., & McDaniel, M. A. (2011). A meta-analysis of the dark triad and work behavior: A social exchange perspective. [doi:10.1037/a0025679]. Journal of Applied PsychologyNo Pagination Specified.
 


Sjöberg, L. (2005). En kritisk diskussion av Myers-Briggs testet. (A critical discussion of the Myers-Briggs test). Organisational Theory & Practice. Scandinavian Journal of Organisational Psychology, 15, 21-28. Klicka här.


Sjöberg, L. (2009). UPP-testet: Korrektion för skönmålning. Forskningsrapport 2009:3. Stockholm: Psykologisk Metod AB. Klicka här.

Sjöberg, L. (2011). Ökad testvaliditet genom korrektion för skönmålning. Forskningsrapport 2011:2. Stockholm: Psykologisk Metod AB. Klicka här.

Sjöberg, L., Bäccman, C., & Gustavsson, B. (2012). Personlighetstestning vid antagning till FHS officersutbildning. ILM Serie T:39, 2011. Karlstad: Institutionen för ledarskap och Management, Försvarshögskolan. Klicka här.

Tuesday, February 14, 2012

Thursday, February 2, 2012

Rorschach och psykopati

I en abstract av en publicerad meta-analys kan man läsa:

Gacono and Meloy (2009) have concluded that the Rorschach Inkblot Test is a sensitive instrument with which to discriminate psychopaths from nonpsychopaths. We examined the association of psychopathy with 37 Rorschach variables in a meta-analytic review of 173 validity coefficients derived from 22 studies comprising 780 forensic participants. All studies included the Hare Psychopathy Checklist or one of its versions (Hare, 1980, 1991, 2003) and Exner's (2003) Comprehensive System for the Rorschach. Mean validity coefficients of Rorschach variables in the meta-analysis ranged from −.113 to .239, with a median validity of .070 and a mean validity of .062. Psychopathy displayed a significant and medium-sized association with the number of Aggressive Potential responses (weighted mean validity coefficient = .232) and small but significant associations with the Sum of Texture responses, Cooperative Movement = 0, the number of Personal responses, and the Egocentricity Index (weighted mean validity coefficients = .097 to .159). The remaining 32 Rorschach variables were not significantly related to psychopathy. The present findings contradict the view that the Rorschach is a clinically sensitive instrument for discriminating psychopaths from nonpsychopaths.

Referens

Wood, J. M., Lilienfeld, S. O., Nezworski, M. T., Garb, H. N., Allen, K. H., & Wildermuth, J. L. (2010). Validity of Rorschach Inkblot scores for discriminating psychopaths from nonpsychopaths in forensic populations: A meta-analysis. [doi:10.1037/a0018998]. Psychological Assessment, 22(2), 336-349.

Wednesday, February 1, 2012

Testgranskning: HDS

Personlighetens mörka sida: HDS-testet

De flesta personlighetstest som används  svenskt arbetsliv mäter normalpersonligheten. I vissa fall som OPQ, MBTI och PPA (Thomas) har man medvetet gåt in för att ge en positiv bild av varje testad person. UPP-testet utgår från en positiv syn på människor inom normalområdet men erkänner också att det finns "subkliniska" varianter av dimensioner som passiv aggression, perfektionism och narcissism; dessa mäts med särskilda skalor. HDS tar ett mera radikalt grepp och mäter enbart sådana aspekter av "den mörka sidan". Det är viktigt att försöka sig på detta, inte minst med tanke på den aktuella diskussionen om "den mörka triaden . I en studie av HDS-data insamlade i skarpt läge fann jag:


•    HDS är ett försök att mäta ett antal olika aspekter på personlighetens ”dark side”. De 11 skalorna har en 2-dimensionell struktur. De är starkt korrelerade med FFM-faktorer och skönmålning, multipla korrelationer omkring 0.40 – 0.65. Om man kontrollerar för FFM och skönmålning finns ingen tydlig struktur kvar i HDS-skalorna.

•    Reliabiliteten hos delskalorna i HDS var dessutom låg, upp till 50 % av variansen var felvarians.

•    Felvarians, skönmålning och "Big Five" svarade för ca 75 % av HDS-skalornas varians


I oberoende granskning (Buros-institutet, se http://www.unl.edu/buros/), av två oberoende vetenskapligt väl kvalificerade utvärderare som inte är anonyma,  riktas tung kritik mot testet:


–    ” All but 4 of these 11 reliability coefficients are below the minimum level of .70. ” (Utvärderare 1)

–    ” The validational data relating to performance is also disappointing, both in quantity and quality. ” (Utvärderare 1)

–    ” Reliability evidence is mixed. Three-month test-retest coefficients appear acceptable; however, the research sample was limited to 60 graduate students. Alpha coefficients ranged from .50 to .78, with seven falling below .70, suggesting considerable caution with respect to the use of the scales in making decisions about individual clients. ” (Utvärderare 2)

–    ” Modest, but meaningful correlations between HDS scores and managerial behavior ratings from an unspecified mix of subordinates, peers, and supervisors are also demonstrated. ” (Utvärderare 2)

Saville et al. (2008) har gjort en stor jämförande studie av olika kommersiella personlighetstest. De noterar att efter endast en vecka hade vid omtestning med HDS endast 8 % oförändrad profil. Det stämmer ju bra med andra resultat som har visat låg reliabilitet hos detta test.

Referenser
Fox, G. (2001). Review of the Hogan Development Survey In B. S. Plake & J. C. Impara (Eds.), The fourteenth mental measurements yearbook. Lincoln NE: Buros Institute of Mental Measurements.

Huebner, E. S. (2001). Review of the Hogan Development Survey. In B. S. Plake & J. C. Impara (Eds.), The fourteenth mental measurements yearbook. Lincoln NE: Buros Institute of Mental Measurements.

Saville, P. (2008, January). Project Epsom: How Valid Is Your Questionnaire? Phase 1: Saville Consulting Wave®, OPQ®, Hogan Personality Inventory & Development Survey, 16PF5, NEO, Thomas International DISC, MBTI and Saville Personality Questionnaire Compared in Predicting Job Performance. Paper presented at the The British Psychological Society Division of Occupational Psychology Conference: “Personality Questionnaires – Valid Inferences, False Prophecies”. För denna rapport klicka här.

Tuesday, January 24, 2012

Gick det att lura UPP-testet?

I höstas inbjöds läsarna av denna blogg att försöka lura UPP-testet. Gick det? Svaret finns här.

Saturday, January 21, 2012

Nytt om faking

O'Connell et al. (2011) har publicerat en intressant studie om faking på personlighetstest.Förutom en vanlig skala på social önskvärdhet (SD-skala) mätte de skönmålning med två andra metoder: CVI (bygger på att skönmålning skapar tillskott till korrelation mellan annars orelaterade skalor), samt en skala som frågar om man haft erfarenhet av ej existerande verksamheter (ren lögnskala,. alltså). Resultaten är något komplexa men SD-skalan fungerade bäst.

Författarna är bekymrade över kritiken av SD-skalor och försökte därför hitta alternativ, med begränsad framgång. Att faking förekom och var både stark och vanlig bland jobbsökande var emellertid glasklart. Något måste uppenbarligen göras och fortsatt metodutveckling efterfrågas. Att bara förneka att problemet förekommer är enfaldigt, och sannerligen ingen lösning.

SD-skalor kan eventuellt fungera mot kriterier av objektiv typ, eller bedömningar av hur bra man skött själva kärnuppgifterna i jobbet. När det gäller socialt stödjande insatser i jobbet kanske SD-skalor korrelerar positivt med kriteriet och därför sänker validiteten hos personlighetstest om de används för någon form av korrektion. I jobb av typ kundservice där det sociala inslaget är mycket stort kan det fungera på att liknande sätt.

Idén att mäta faking med en ren lögnskala är inressant men kräver ju speciell anpassning för varje typ av jobb. Kanske kan man använda item av den typ som ibland har används för att mäta icke existerande attityder. 

Referens
O'Connell, M. S., Kung, M.-C., & Tristan, E. (2011). Beyond Impression Management: Evaluating three measures of response distortion and their relationship to job performance. International Journal of Selection and Assessment, 19, Issue 4 (December), pages 340–351, 340-351.

Monday, January 9, 2012

Mera om OPQ och validitet

Ju mera man studerar det brittiska OPQ-testet, detta i Sverige mycket populära test, och dess validitet desto mera förbryllad blir man. En mycket ambitiös sammanställning av massor av data, publicerad av Bartram (1995) tyder snarast på värden omkring 0,05 mot olika kriterier. Enda undantaget är s k kanoniska korrelationer mellan uppsättningar av personlighetsskalor och uppsättningar av kriteriemått, där han når nivån 0.5. Men det är väl känt att denna typ av korrelationsberäkning är extremt känslig för slumpvariationer i data, numera sällan använd och betraktad med stor skepsis (Saville, 2008). Mysteriet tätnar: vad är egentligen vitsen om OPQ?

Referenser
Bartram, D. (2005). The Great Eight Competencies: A Criterion-Centric Approach to Validation. [doi:10.1037/0021-9010.90.6.1185]. Journal of Applied Psychology, 90, 1185-1203.


Saville, P. (2008, January). Project Epsom: How Valid Is Your Questionnaire? Phase 1: Saville Consulting Wave®, OPQ®, Hogan Personality Inventory & Development Survey, 16PF5, NEO, Thomas International DISC, MBTI and Saville Personality Questionnaire Compared in Predicting Job Performance. Paper presented at the The British Psychological Society Division of Occupational Psychology Conference: “Personality Questionnaires – Valid Inferences, False Prophecies”. Klicka här.
Free counter and web stats