mtq48 licensed user training workshop module 11 ... · psychometric tests are statistical tools....

1 © AQR International 2016

MTQ48 Licensed User Training

Workshop

Module 11: Psychometric Measures


It is important that we understand exactly what a psychometric measure is.

Designed properly and used carefully they are extremely valuable to the

trained user.

They can help us to understand people better and, importantly, to make better

predictions about them – their potential performance, their behaviour, their

wellbeing, their impact on others, etc.

Note that we use to the term “better predictions” and not perfect predictions.

Psychometric tests are statistical tools. They deal in probabilities not certainties. When you have

completed a test and an output such as a score is achieved what we can say is that “people with

this score typically show a particular set of characteristics”. What this means is that statistically we

have a particular degree of confidence that x% of people who achieve this score exhibit a set of

characteristic associated with that score.

That doesn’t mean that every single person who achieves that score with be identical in behaviour.

Many, if not most, will do. It does mean that there should be a process to feedback and verify

results before assigning a value to them. We’ll look at feedback a little later. The results are

therefore often open to interpretation. The test site does generate a reasonably comprehensive

expert report but it still needs checking.

A simple test is to identify what you might know about the individual form other sources –

interviews and discussions, looking at the work they have carried out, comments from others, and

then compare it with the picture emerging from the use of the

test. If all the pictures from all of the sources are consistent

then you can be increasingly confident that you have a

reasonably accurate picture of the individual and you are

beginning to understand them to a good degree. If there are

inconsistencies, then it is important to probe and examine

these differences.


There are a lot of instruments available. Few actually possess good

reliability predictive and reasonable predictive power. Ultimately the

construction of a good quality test takes a great deal of time and thought to ensure that it does

what it claims to do. Psychometric tests are critically dependent upon good design.

This requires a great deal of testing and re-testing. Developing a high quality test is a little like

drilling for oil. You know it’s probably there but you might have to drill a few holes to find it. And

sometimes you still can’t find it.

If using a psychometric test. It is important to know some fundamentals about tests:

What kind of test is it? This, impacts on the kind of outputs you can expect.

If it’s a statistical measure is it technically reliable and valid.

Features which are essential include:

The items must be transparent – their purpose must be clear to the person answering the

question. Otherwise there is no way that you can interpret the answer.

The items must be clear, in the same way, to all the people who answer the question.

Otherwise there is no way in which you can compare results.

The item must also be relevant to the trait or quality being measured.


Increasingly users are concerned about adverse impact. Does the test

discriminate against a particular group and if so is this fair? This is often a

very difficult area to evidence one way or the other.

Psychometric tests help users - managers, coaches, trainers to make much better decisions about

people; they are not infallible tools. Personality tests measure habitual behaviour – what is our

default response in given situations. There are two distinct categories of tests: Ipsative tests and

normative tests.

Ipsative Tests

Ipsative tests are tests which ask individuals what are their beliefs about

themselves on given scales. But this is done against their own idea of what

that scale might represent.

So, you might be asked to suggest where you sit on a scale which measures

introversion at one end and extraversion at the other. The way that this is

done is often to make the individual choose between two options. For

example, would you rather read a book or go out with friends or do both. One

problem here is that everyone might have a different view about what

introversion and extraversion means.

One consequence is that ipsative tests can rarely be used in recruitment and section situations

because they can’t compare people reliably if they all have different views.

They are useful in coaching counselling and personal development work because they are effective

at identifying what the individual thinks of themselves. They are a good starting point for a

purposeful discussion which can begin with “So you think you are…Let’s look at this a bit more

closely. What might this mean for your behaviour?”

A good example of a popular and very useful ipsative measure is MBTI – the Myers-Briggs Type

Indicator.


Normative Tests

Normative tests on the other hand are constructed differently.

Items are designed assess the individuals response to a specific

activity or situation. Responses are often gathered across a range of

responses. Test designers commonly use something called a Likert

scale which has a range of options available for an answer. A wide

used scale is a 5 point scale which offers options from strongly agree,

through agree through to neither agree nor disagree and on to

strongly disagree. Some use 6 and 7 point scale but the principle is the

same.

Because each item is usually very specific as to meaning and interpretation and we can establish a

norm, it is possible to compare responses between different individuals. So, in addition to being

able to start a purposeful discussion, one can also differentiate between individuals. This gives

normative measures a different kind of potency and they are often used in recruitment and

selection applications.

The norm is established by testing a large number of individuals

who are representative of the population to which the tests will

be or might be applied. The sample size is selected according to a

formula. The sample should also be what is called a stratified

sample. That is it should represent the types of people who are

to be found in the population for example, by gender, by age, by

ethnic group, etc.


Patterns of scores are then allocated to sten scores which will typically be

associated with a level of the characteristic being demonstrated by the

individual. Sten scores are explained later in the workshop but they generally run from 1 to 10.

With some characteristics, 1 might represent a low score and 10 a high score. For most personality

traits they simply represent two poles of a bipolar spectrum.

The norm enables the user to assess whether someone is higher or lower than the norm and to

what extent. That will provide guidance as to what might be the typical habitual behaviour of

someone at a particular score.

If it is planned then to use the measure with different populations (such as another country or a

different age group) it is good practice to carry out an equivalency study to confirm that the norms

and distributions are still representative of the population which is now being examined. Good test

publishers will develop global norms and specific norms.

MTQ48, ILM72, Carrus and The Prevue Assessment are examples of normative measures. Others

popular measures include 15fq, 16pf, OPQ and NEO.


Sten Scales

There are two scales in popular use. The first is the normal distribution curve. This reflects the fact

that a lot of natural qualities in nature such as height, shoe size, hand size, etc. are normally

distributed.

This is represented by the so called Bell Curve because it is shaped like a bell.

The area under the curve represents the percentage of the population which exhibit this

score/quality.

Sten one represents approximately 2.5% of the population who would show the

significant or extreme characteristics of this end of the population. If we were looking at

height this might be the range of heights that the shortest 2.5% of the population

measure. If it is personality and the scale is introversion – extraversion, then this might

represent the 2.5% of the population who are most introverted.

Sten two represents approximately the next 4.5% and sten three represents the next 9%.

Sten 4, 5, 6 and 7 represent the next 15, 19, 19 and 15% of the population. In total around

68% of the population.


Technically statisticians talk about this representing one standard deviation

from the mean. In lay language, this means that this represents a large

group of people who are more similar than different. If we were looking at the height of males in

the UK, this might represent males from a height of 1.65 meters to a height of 1.80 meters. If we

looked at a group of these people casually we would probably judge them all to be of normal or

average height.

If we were looking at height then Sten 10 might represent the range of heights of the tallest 2.5% of

the population. If it is personality and the scale is introversion – extraversion, then this might

represent the 2.5% of the population who are most extraverted.

The normal distribution curve is commonly used where we are looking at individuals who are drawn

from any point in a whole population. MTQ48 (in all its forms), Carrus, and Prevue use the normal

distribution curve. As do 15fq, 16pf, OPQ and NEO.

Sometimes the population in question and the samples are drawn from selected parts of the

population and you cannot guarantee that the shape of the curve for this population is normally

distributed.

In this case the practice is to use a Sten scoring system which simply breaks up the range of

responses into ten equal tenths. So Sten 1 represents 10% of the population in question, Sten 2 the

next 10% and so on. The ILM72 because it is looking at people who are in leadership positions and

there is no reason that this should be normally distributed, uses this Sten scoring scale. Information

about which sten scale applies to which measure can be found in the technical manual for the

measure in question.

Stens 8, 9 and 10 behave like Stens 1, 2 and 3 in reverse. They represent the next 9%,

4.5% and 2.5% of the population.


Reliability

Reliability is simply the most important tests for a psychometric measure. If a test is not reliable it

won’t work. A reliable measure is one which you can complete today and again in say four weeks’

time and get the same or very similar results.

Assuming nothing significant has happened to you in the meantime and there is no reason that you

should have changed, you should get, within reason, the same score on the second occasion as you

did on the first. If you do get a different response then the test is either faulty in some way or it is

picking up something that is changing but for which it was not designed. That is, it’s unreliable. If it

is reliable, users can rely on the information generated to help them understand better the

individual with whom they are working.

There is a technical calculation which provides a measure of reliability. The formula for calculating

reliability can be found in the technical manual. The output is a number on a scale from 0 to 1.0.

1.00 is a perfect score – that can never be achieved. There is always a little bit of natural variance.

The British Psychological Society and the US Department of Labor both provide guidance as to what

is an acceptable score for a measure to be regarded as a good measure. That score is 0.70 or

greater. Obviously the higher the score, the more reliable is the measure.

Reliability scores for AQR measures can be found in the technical manual for the measure. All AQR

measures exceed a reliability score of 0.70.


Validity

This simply represents how effective the measure is at measuring what it is supposed to measure.

There are a number of types of validity:

Face Validity

This is a judgement about the content and presentation of the measure. If the person completing

the measure doesn’t feel that the measure is a serious measure, either because of its content – the

questions appear odd, or its appearance, then they may not respond carefully or properly.

Content Validity

Content validity is a technical question about the items (that is the questions) used in the

questionnaire. Do they look like questions which are relevant to the purpose of the questionnaire,

and do they look like questions that will generate relevant information.

Construct Validity

Construct validity addresses a fundamental question “does the method measure the claimed

attribute”.

There is no objective formula to assess content or construct validity. Assessment is generally by

peer review from experts.

Concurrent or Predictive Validity

Concurrent or predictive validity goes to the heart of the matter. Can the tests make predictions

about the individual in terms of the structure and content of the measure upon which we can rely?

Like reliability, concurrent validity can be assessed through a technical calculation. The process and

formula for calculating concurrent validity can be found in the technical manual.


The British Psychological Society and the US Department of Labor both

provide guidance as to what is an acceptable score for a measure to be

regarded as a good measure. For concurrent validity, that score is 0.20 or greater. Obviously the

higher the score, the more reliable is the measure. However scores of 0.40 or greater are extremely

rare

Concurrent validity scores for AQR measures can be found in the technical manual for the measure.

Where the data is available all AQR measures exceed a score of 0.20.

Psychometric & Test Administration

When it comes to tests administration and test use, there

are some basic requirements for good practice. These are

designed to support getting the best possible co-

operation from a candidate, which normally means that

the data captured is more reliable. It also ensures that the

process is reasonable, efficient and effective from the

perspective of the user and the candidate.

The first and possibly most important consideration is to ensure that it is appropriate to use the

test for the purpose at hand. The test has been designed to assess a specific set of qualities. It can

only be useful in terms of those qualities.

When inviting a candidate to complete the measure it is equally important to explain to the

candidate:

What the test is

What is its purpose

How the information will be used

How the candidate will get feedback on results


Encourage the candidate to respond honestly. Most personality type

measures are designed to be completed with a candidate’s first instinctive

responses.

It is useful to identify the benefits for the candidate in completing high quality psychometric

measure. It provides them with a reliable insight into aspect of their make-up. They can reflect on

that. It can provide them with an insight into how others might see them. In all instances good

quality output is valuable to anyone. It’s in everyone interest to be as straight as possible.

Candidates sometimes overthink their response to a question, and select a poor option for a

response. Often they are suspicious about the motives of the test user and seek to manipulate their

responses. Hence the value of being open, transparent and encouraging during test administration.

It is very good practice to provide feedback to everyone who

completes a psychometric measure. Best practice is to do this orally

and provide the individual with the opportunity to ask questions

and discuss the meaning of the output. This can be done face to

face or over the phone or through a VOIP system. Realistically this

is not always possible. Individuals may be a long way away from the

user and may not be readily accessible to the user. Sometimes the volume

of use is simply too great for everyone to receive detailed feedback. Good

practice would include providing the individual with written feedback and

an opportunity (directly or indirectly) to ask questions about the results.

All AQR tests generate a feedback report which summarises the candidates

results, what they mean, what might be potential implications and

sometimes what might be appropriate development actions if there is a

need for such. These are designed to be capable of being sent to a

candidate with or without direct support. This provides a practical solution

to the challenge of providing feedback in most circumstances.


There are a number of issues which commonly arise when people first look

at psychometric measures:

1. Faking

Do people fake responses to the questionnaires? Some do, but most don’t.

Faking is an issue which applies to all psychometric measures – especially good measures.

One of the features of a high quality measure, as we have seen, is that the purpose of the items

used (i.e. the questions) must be clear to all those who complete the questionnaire. Otherwise,

how do we know that everyone has completed the questionnaire with the same understanding of

the questionnaire? If that is not the case we can’t compare results and we can’t discriminate

between responses. And that’s often the whole purpose of a questionnaire – to assess and explain

differences.

If the purpose is clear, then there will be those who are tempted to respond in a socially desirable

way. That is, they will anticipate what the user is looking for and may adjust responses to try to

deliver this expectation. Mostly this doesn’t work for a whole variety of reasons. However people

still do it. Some will “fake good” – they will try to overstate their position. Others will “fake bad”

and try to understate their position. It is thought that faking is more likely to occur in a recruitment

setting than in a personal or organisational development setting. It’s not too difficult to see why

that might be.

Good and careful test design will seek to take faking into account as far it able to, bearing in mind

that a psychometric is a statistical instrument.

Some measures purport to deal with this by creating a faking or social

desirability scale which attempts to measure whether someone is

producing skewed responses and to what extent they are doing this.

Generally the view is that, for the most part faking scales do not work.


The best way of improving the integrity of response is to practice good, open

and honest test administration. The more that individual’s understand the

purpose of a questionnaire and the importance of good quality information, the more likely they

are to respond truthfully and honestly.

2. Norm Group

People will also ask about the norm group. That is a good question because

in a normative measure your scores are being compared to a norm. So the

selection of the norm is important. All AQR measures are based on a global

norm which is checked (through an equivalency study) each time the test is

to be used in a new territory or sector.

The core norm group are generally drawn from a working age population – between the ages of 18

and 65. The norm groups are created from non –self-selecting samples. In other words the norm

samples are people who generally cannot opt out of completing the questionnaire. Using samples

of volunteers can mean that the sample may not be a typical sample at all, and that creates issues

for interpreting the results.

However AQR are also creating norm groups for its own products, for specific groupings, cultural,

ethic, age, etc. Many are available as a feature on AQR sites.

3. Cultural Sensitivity

The third area where questions are frequently asked is the area of cultural sensitivity and fairness.

The most common question is “are there differences between male and female responses?”

Globally many are interested to know if there are geographic or cultural differences. And there are.

Information about these differences can be found in the technical manuals for each product. All

those who complete the Licensed User Training program are automatically updated with

information as it emerges.


Notes

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________


Tel: 0044 0 1244 572050

Fax: 044 0 124 572051

Email: [email protected]

Website: www.aqrinternational.co.uk

mailto:[email protected]

http://www.aqrinternational.co.uk/

mtq48 licensed user training workshop module 11 ... · psychometric tests are statistical tools....

Documents