In: Statistics and Probability
what type of validity evidence is most closely related to content sampling error
Validity evidence may be documented at both the item and total
test levels. Here the focus is
only on documentation of validity evidence at the total test level.
At this level, the Standards for
Educational and Psychological Testing (American Educational
Research Association [AERA],
American Psychological Association [APA], & National Council on
Measurement in Education calls for evidence of the type noted above
regarding content coverage, response
processes, internal structure, and relations to other variables.
Examples for each source of
validity evidence are provided using illustrations from some
large-scale assessment programs’
technical documentation. We subsequently provide a discussion of
each of these sources.
Evidence of Content Coverage
In part, evidence of content coverage is based on judgments about
“the adequacy with which
the test content represents the content domain” As a whole, the
test
comprises sets of items that sample student performance on the
intended domains. The
expectation is that the items cover the full range of intended
domains and that there are a
sufficient number of items so that scores credibly represent
student knowledge and skills in
those areas. Without a sufficient number of items, a potential
threat to the validity of the
construct exists because the construct may be underrepresented
Validity tells you how accurately a method measures something.
If a method measures what it claims to measure, and the results
closely correspond to real-world values, then it can be considered
valid. There are four main types of validity:
Construct validity: Does the test measure the concept that it’s
intended to measure?
Content validity: Is the test fully representative of what it aims
to measure?
Face validity: Does the content of the test appear to be suitable
to its aims?
Criterion validity: Do the results correspond to a different test
of the same thing?
1) Construct validity
Construct validity evaluates whether a measurement tool really
represents the thing we are interested in measuring. It’s central
to establishing the overall validity of a method.
What is a construct?
A construct refers to a concept or characteristic that can’t be
directly observed, but can be measured by observing other
indicators that are associated with it.
Constructs can be characteristics of individuals, such as
intelligence, obesity, job satisfaction, or depression; they can
also be broader concepts applied to organizations or social groups,
such as gender equality, corporate social responsibility, or
freedom of speech.
Example
There is no objective, observable entity called “depression” that
we can measure directly. But based on existing psychological
research and theory, we can measure depression based on a
collection of symptoms and indicators, such as low self-confidence
and low energy levels.
What is construct validity?
Construct validity is about ensuring that the method of measurement
matches the construct you want to measure. If you develop a
questionnaire to diagnose depression, you need to know: does the
questionnaire really measure the construct of depression? Or is it
actually measuring the respondent’s mood, self-esteem, or some
other construct?
To achieve construct validity, you have to ensure that your
indicators andmeasurements are carefully developed based on
relevant existing knowledge. The questionnaire must include only
relevant questions that measure known indicators of
depression.
The other types of validity described below can all be considered
as forms of evidence for construct validity.
2) Content validity
Content validity assesses whether a test is representative of all
aspects of the construct.
To produce valid results, the content of a test, survey or
measurement method must cover all relevant parts of the subject it
aims to measure. If some aspects are missing from the measurement
(or if irrelevant aspects are included), the validity is
threatened.
Example
A mathematics teacher develops an end-of-semester algebra test for
her class. The test should cover every form of algebra that was
taught in the class. If some types of algebra are left out, then
the results may not be an accurate indication of students’
understanding of the subject. Similarly, if she includes questions
that are not related to algebra, the results are no longer a valid
measure of algebra knowledge.
3) Face validity
Face validity considers how suitable the content of a test seems to
be on the surface. It’s similar to content validity, but face
validity is a more informal and subjective assessment.
Example
You create a survey to measure the regularity of people’s dietary
habits. You review the survey items, which ask questions about
every meal of the day and snacks eaten in between for every day of
the week. On its surface, the survey seems like a good
representation of what you want to test, so you consider it to have
high face validity.
As face validity is a subjective measure, it’s often considered the
weakest form of validity. However, it can be useful in the initial
stages of developing a method.
4) Criterion validity
Criterion validity evaluates how closely the results of your test
correspond to the results of a different test.
What is a criterion?
The criterion is an external measurement of the same thing. It is
usually an established or widely-used test that is already
considered valid.
What is criterion validity?
To evaluate criterion validity, you calculate the correlation
between the results of your measurement and the results of the
criterion measurement. If there is a high correlation, this gives a
good indication that your test is measuring what it intends to
measure.
Example
A university professor creates a new test to measure applicants’
English writing ability. To assess how well the test really does
measure students’ writing ability, she finds an existing test that
is considered a valid measurement of English writing ability, and
compares the results when the same group of students take both
tests. If the outcomes are very similar, the new test has a high
criterion validity.