mtq48 licensed user training workshop module 11 ... · psychometric tests are statistical tools....
TRANSCRIPT
1 © AQR International 2016
MTQ48 Licensed User Training
Workshop
Module 11: Psychometric Measures
2 © AQR International 2016
It is important that we understand exactly what a psychometric measure is.
Designed properly and used carefully they are extremely valuable to the
trained user.
They can help us to understand people better and, importantly, to make better
predictions about them – their potential performance, their behaviour, their
wellbeing, their impact on others, etc.
Note that we use to the term “better predictions” and not perfect predictions.
Psychometric tests are statistical tools. They deal in probabilities not certainties. When you have
completed a test and an output such as a score is achieved what we can say is that “people with
this score typically show a particular set of characteristics”. What this means is that statistically we
have a particular degree of confidence that x% of people who achieve this score exhibit a set of
characteristic associated with that score.
That doesn’t mean that every single person who achieves that score with be identical in behaviour.
Many, if not most, will do. It does mean that there should be a process to feedback and verify
results before assigning a value to them. We’ll look at feedback a little later. The results are
therefore often open to interpretation. The test site does generate a reasonably comprehensive
expert report but it still needs checking.
A simple test is to identify what you might know about the individual form other sources –
interviews and discussions, looking at the work they have carried out, comments from others, and
then compare it with the picture emerging from the use of the
test. If all the pictures from all of the sources are consistent
then you can be increasingly confident that you have a
reasonably accurate picture of the individual and you are
beginning to understand them to a good degree. If there are
inconsistencies, then it is important to probe and examine
these differences.
3 © AQR International 2016
There are a lot of instruments available. Few actually possess good
reliability predictive and reasonable predictive power. Ultimately the
construction of a good quality test takes a great deal of time and thought to ensure that it does
what it claims to do. Psychometric tests are critically dependent upon good design.
This requires a great deal of testing and re-testing. Developing a high quality test is a little like
drilling for oil. You know it’s probably there but you might have to drill a few holes to find it. And
sometimes you still can’t find it.
If using a psychometric test. It is important to know some fundamentals about tests:
What kind of test is it? This, impacts on the kind of outputs you can expect.
If it’s a statistical measure is it technically reliable and valid.
Features which are essential include:
The items must be transparent – their purpose must be clear to the person answering the
question. Otherwise there is no way that you can interpret the answer.
The items must be clear, in the same way, to all the people who answer the question.
Otherwise there is no way in which you can compare results.
The item must also be relevant to the trait or quality being measured.
4 © AQR International 2016
Increasingly users are concerned about adverse impact. Does the test
discriminate against a particular group and if so is this fair? This is often a
very difficult area to evidence one way or the other.
Psychometric tests help users - managers, coaches, trainers to make much better decisions about
people; they are not infallible tools. Personality tests measure habitual behaviour – what is our
default response in given situations. There are two distinct categories of tests: Ipsative tests and
normative tests.
Ipsative Tests
Ipsative tests are tests which ask individuals what are their beliefs about
themselves on given scales. But this is done against their own idea of what
that scale might represent.
So, you might be asked to suggest where you sit on a scale which measures
introversion at one end and extraversion at the other. The way that this is
done is often to make the individual choose between two options. For
example, would you rather read a book or go out with friends or do both. One
problem here is that everyone might have a different view about what
introversion and extraversion means.
One consequence is that ipsative tests can rarely be used in recruitment and section situations
because they can’t compare people reliably if they all have different views.
They are useful in coaching counselling and personal development work because they are effective
at identifying what the individual thinks of themselves. They are a good starting point for a
purposeful discussion which can begin with “So you think you are…Let’s look at this a bit more
closely. What might this mean for your behaviour?”
A good example of a popular and very useful ipsative measure is MBTI – the Myers-Briggs Type
Indicator.
5 © AQR International 2016
Normative Tests
Normative tests on the other hand are constructed differently.
Items are designed assess the individuals response to a specific
activity or situation. Responses are often gathered across a range of
responses. Test designers commonly use something called a Likert
scale which has a range of options available for an answer. A wide
used scale is a 5 point scale which offers options from strongly agree,
through agree through to neither agree nor disagree and on to
strongly disagree. Some use 6 and 7 point scale but the principle is the
same.
Because each item is usually very specific as to meaning and interpretation and we can establish a
norm, it is possible to compare responses between different individuals. So, in addition to being
able to start a purposeful discussion, one can also differentiate between individuals. This gives
normative measures a different kind of potency and they are often used in recruitment and
selection applications.
The norm is established by testing a large number of individuals
who are representative of the population to which the tests will
be or might be applied. The sample size is selected according to a
formula. The sample should also be what is called a stratified
sample. That is it should represent the types of people who are
to be found in the population for example, by gender, by age, by
ethnic group, etc.
6 © AQR International 2016
Patterns of scores are then allocated to sten scores which will typically be
associated with a level of the characteristic being demonstrated by the
individual. Sten scores are explained later in the workshop but they generally run from 1 to 10.
With some characteristics, 1 might represent a low score and 10 a high score. For most personality
traits they simply represent two poles of a bipolar spectrum.
The norm enables the user to assess whether someone is higher or lower than the norm and to
what extent. That will provide guidance as to what might be the typical habitual behaviour of
someone at a particular score.
If it is planned then to use the measure with different populations (such as another country or a
different age group) it is good practice to carry out an equivalency study to confirm that the norms
and distributions are still representative of the population which is now being examined. Good test
publishers will develop global norms and specific norms.
MTQ48, ILM72, Carrus and The Prevue Assessment are examples of normative measures. Others
popular measures include 15fq, 16pf, OPQ and NEO.
7 © AQR International 2016
Sten Scales
There are two scales in popular use. The first is the normal distribution curve. This reflects the fact
that a lot of natural qualities in nature such as height, shoe size, hand size, etc. are normally
distributed.
This is represented by the so called Bell Curve because it is shaped like a bell.
The area under the curve represents the percentage of the population which exhibit this
score/quality.
Sten one represents approximately 2.5% of the population who would show the
significant or extreme characteristics of this end of the population. If we were looking at
height this might be the range of heights that the shortest 2.5% of the population
measure. If it is personality and the scale is introversion – extraversion, then this might
represent the 2.5% of the population who are most introverted.
Sten two represents approximately the next 4.5% and sten three represents the next 9%.
Sten 4, 5, 6 and 7 represent the next 15, 19, 19 and 15% of the population. In total around
68% of the population.
8 © AQR International 2016
Technically statisticians talk about this representing one standard deviation
from the mean. In lay language, this means that this represents a large
group of people who are more similar than different. If we were looking at the height of males in
the UK, this might represent males from a height of 1.65 meters to a height of 1.80 meters. If we
looked at a group of these people casually we would probably judge them all to be of normal or
average height.
If we were looking at height then Sten 10 might represent the range of heights of the tallest 2.5% of
the population. If it is personality and the scale is introversion – extraversion, then this might
represent the 2.5% of the population who are most extraverted.
The normal distribution curve is commonly used where we are looking at individuals who are drawn
from any point in a whole population. MTQ48 (in all its forms), Carrus, and Prevue use the normal
distribution curve. As do 15fq, 16pf, OPQ and NEO.
Sometimes the population in question and the samples are drawn from selected parts of the
population and you cannot guarantee that the shape of the curve for this population is normally
distributed.
In this case the practice is to use a Sten scoring system which simply breaks up the range of
responses into ten equal tenths. So Sten 1 represents 10% of the population in question, Sten 2 the
next 10% and so on. The ILM72 because it is looking at people who are in leadership positions and
there is no reason that this should be normally distributed, uses this Sten scoring scale. Information
about which sten scale applies to which measure can be found in the technical manual for the
measure in question.
Stens 8, 9 and 10 behave like Stens 1, 2 and 3 in reverse. They represent the next 9%,
4.5% and 2.5% of the population.
9 © AQR International 2016
Reliability
Reliability is simply the most important tests for a psychometric measure. If a test is not reliable it
won’t work. A reliable measure is one which you can complete today and again in say four weeks’
time and get the same or very similar results.
Assuming nothing significant has happened to you in the meantime and there is no reason that you
should have changed, you should get, within reason, the same score on the second occasion as you
did on the first. If you do get a different response then the test is either faulty in some way or it is
picking up something that is changing but for which it was not designed. That is, it’s unreliable. If it
is reliable, users can rely on the information generated to help them understand better the
individual with whom they are working.
There is a technical calculation which provides a measure of reliability. The formula for calculating
reliability can be found in the technical manual. The output is a number on a scale from 0 to 1.0.
1.00 is a perfect score – that can never be achieved. There is always a little bit of natural variance.
The British Psychological Society and the US Department of Labor both provide guidance as to what
is an acceptable score for a measure to be regarded as a good measure. That score is 0.70 or
greater. Obviously the higher the score, the more reliable is the measure.
Reliability scores for AQR measures can be found in the technical manual for the measure. All AQR
measures exceed a reliability score of 0.70.
10 © AQR International 2016
Validity
This simply represents how effective the measure is at measuring what it is supposed to measure.
There are a number of types of validity:
Face Validity
This is a judgement about the content and presentation of the measure. If the person completing
the measure doesn’t feel that the measure is a serious measure, either because of its content – the
questions appear odd, or its appearance, then they may not respond carefully or properly.
Content Validity
Content validity is a technical question about the items (that is the questions) used in the
questionnaire. Do they look like questions which are relevant to the purpose of the questionnaire,
and do they look like questions that will generate relevant information.
Construct Validity
Construct validity addresses a fundamental question “does the method measure the claimed
attribute”.
There is no objective formula to assess content or construct validity. Assessment is generally by
peer review from experts.
Concurrent or Predictive Validity
Concurrent or predictive validity goes to the heart of the matter. Can the tests make predictions
about the individual in terms of the structure and content of the measure upon which we can rely?
Like reliability, concurrent validity can be assessed through a technical calculation. The process and
formula for calculating concurrent validity can be found in the technical manual.
11 © AQR International 2016
The British Psychological Society and the US Department of Labor both
provide guidance as to what is an acceptable score for a measure to be
regarded as a good measure. For concurrent validity, that score is 0.20 or greater. Obviously the
higher the score, the more reliable is the measure. However scores of 0.40 or greater are extremely
rare
Concurrent validity scores for AQR measures can be found in the technical manual for the measure.
Where the data is available all AQR measures exceed a score of 0.20.
Psychometric & Test Administration
When it comes to tests administration and test use, there
are some basic requirements for good practice. These are
designed to support getting the best possible co-
operation from a candidate, which normally means that
the data captured is more reliable. It also ensures that the
process is reasonable, efficient and effective from the
perspective of the user and the candidate.
The first and possibly most important consideration is to ensure that it is appropriate to use the
test for the purpose at hand. The test has been designed to assess a specific set of qualities. It can
only be useful in terms of those qualities.
When inviting a candidate to complete the measure it is equally important to explain to the
candidate:
What the test is
What is its purpose
How the information will be used
How the candidate will get feedback on results
12 © AQR International 2016
Encourage the candidate to respond honestly. Most personality type
measures are designed to be completed with a candidate’s first instinctive
responses.
It is useful to identify the benefits for the candidate in completing high quality psychometric
measure. It provides them with a reliable insight into aspect of their make-up. They can reflect on
that. It can provide them with an insight into how others might see them. In all instances good
quality output is valuable to anyone. It’s in everyone interest to be as straight as possible.
Candidates sometimes overthink their response to a question, and select a poor option for a
response. Often they are suspicious about the motives of the test user and seek to manipulate their
responses. Hence the value of being open, transparent and encouraging during test administration.
It is very good practice to provide feedback to everyone who
completes a psychometric measure. Best practice is to do this orally
and provide the individual with the opportunity to ask questions
and discuss the meaning of the output. This can be done face to
face or over the phone or through a VOIP system. Realistically this
is not always possible. Individuals may be a long way away from the
user and may not be readily accessible to the user. Sometimes the volume
of use is simply too great for everyone to receive detailed feedback. Good
practice would include providing the individual with written feedback and
an opportunity (directly or indirectly) to ask questions about the results.
All AQR tests generate a feedback report which summarises the candidates
results, what they mean, what might be potential implications and
sometimes what might be appropriate development actions if there is a
need for such. These are designed to be capable of being sent to a
candidate with or without direct support. This provides a practical solution
to the challenge of providing feedback in most circumstances.
13 © AQR International 2016
There are a number of issues which commonly arise when people first look
at psychometric measures:
1. Faking
Do people fake responses to the questionnaires? Some do, but most don’t.
Faking is an issue which applies to all psychometric measures – especially good measures.
One of the features of a high quality measure, as we have seen, is that the purpose of the items
used (i.e. the questions) must be clear to all those who complete the questionnaire. Otherwise,
how do we know that everyone has completed the questionnaire with the same understanding of
the questionnaire? If that is not the case we can’t compare results and we can’t discriminate
between responses. And that’s often the whole purpose of a questionnaire – to assess and explain
differences.
If the purpose is clear, then there will be those who are tempted to respond in a socially desirable
way. That is, they will anticipate what the user is looking for and may adjust responses to try to
deliver this expectation. Mostly this doesn’t work for a whole variety of reasons. However people
still do it. Some will “fake good” – they will try to overstate their position. Others will “fake bad”
and try to understate their position. It is thought that faking is more likely to occur in a recruitment
setting than in a personal or organisational development setting. It’s not too difficult to see why
that might be.
Good and careful test design will seek to take faking into account as far it able to, bearing in mind
that a psychometric is a statistical instrument.
Some measures purport to deal with this by creating a faking or social
desirability scale which attempts to measure whether someone is
producing skewed responses and to what extent they are doing this.
Generally the view is that, for the most part faking scales do not work.
14 © AQR International 2016
The best way of improving the integrity of response is to practice good, open
and honest test administration. The more that individual’s understand the
purpose of a questionnaire and the importance of good quality information, the more likely they
are to respond truthfully and honestly.
2. Norm Group
People will also ask about the norm group. That is a good question because
in a normative measure your scores are being compared to a norm. So the
selection of the norm is important. All AQR measures are based on a global
norm which is checked (through an equivalency study) each time the test is
to be used in a new territory or sector.
The core norm group are generally drawn from a working age population – between the ages of 18
and 65. The norm groups are created from non –self-selecting samples. In other words the norm
samples are people who generally cannot opt out of completing the questionnaire. Using samples
of volunteers can mean that the sample may not be a typical sample at all, and that creates issues
for interpreting the results.
However AQR are also creating norm groups for its own products, for specific groupings, cultural,
ethic, age, etc. Many are available as a feature on AQR sites.
3. Cultural Sensitivity
The third area where questions are frequently asked is the area of cultural sensitivity and fairness.
The most common question is “are there differences between male and female responses?”
Globally many are interested to know if there are geographic or cultural differences. And there are.
Information about these differences can be found in the technical manuals for each product. All
those who complete the Licensed User Training program are automatically updated with
information as it emerges.
15 © AQR International 2016
Notes
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
16 © AQR International 2016
Tel: 0044 0 1244 572050
Fax: 044 0 124 572051
Email: [email protected]
Website: www.aqrinternational.co.uk