Share this article

Table of Contents

tl;dr AI assessment tools don't get to bypass a century of occupational psychology just because they're technologically sophisticated. The same standards — validity, fairness, transparency — apply, and in some respects apply more rigorously. If a vendor can't clearly explain their scenario design process, their scoring model, their adverse impact data, and what candidates are told, that's a signal to probe harder. The science matters as much as the software.

Why the science behind AI assessment matters as much as the technology

Organisations that once had to choose between assessment rigour and hiring at scale, between the depth of an assessment centre and the logistics of running one for five hundred candidates, now have tools that could, in principle, offer both. Simulation-based AI assessment, done properly, can put candidates inside a realistic version of the role they're applying for, generate rich behavioural evidence, and do it in less than thirty minutes on a mobile phone. That is a powerful advantage for TA teams, who no longer have to choose between accuracy and scale, and can better protect the integrity of their screening from cheating attempt. After all, as candidates have increasing access to sophisticated AI tools that can assi

But there is a version of the same story that should give occupational psychologists some pause. The AI assessment market has grown faster than the scientific frameworks needed to evaluate it. Tools are being sold, purchased, and deployed at volume on the basis of technology demonstrations and vendor assurances, without the same scrutiny that the field has always applied to conventional psychometric instruments. And the organisations buying these tools, most of whose TA leaders are not IO psychologists, have limited ways of knowing the difference between a rigorously validated AI assessment and a compelling piece of software that hasn't been through anything like the same process.

I'm not writing this as a sceptic of AI in assessment, but as someone who thinks the technology has a lot of potential, which will only be realised if the scientific and ethical standards applied to it are as high as the ambition behind it. The questions I'm going to lay out in this article are not reasons to avoid AI-led assessment, but everyone buying into AI technology should be asking these key questions to ensure they get the value and the benefits the technology proposes, not just the hype.

The foundations of ethical assessment design

To understand what's new about the ethical challenges in AI-led assessment, it helps to understand what the field already knows about ethical assessment design more broadly.

Occupational psychology has spent the better part of a century developing a framework for asking whether an assessment is fit for purpose. That framework rests on three obligations:

Validity: does the tool measure what it claims to measure, and does that demonstrably predict job performance?

Fairness: does the tool perform consistently across different groups of candidates, and where differential performance exists, can they be justified as essential factors proven to be critical for the job?

Transparency: can candidates, hiring managers, and (should it come to it) employment tribunals understand what the tool is measuring, how it works, and how decisions based on it were reached?

These obligations emerged from decades of evidence about what happens when assessment processes are improperly developed or implemented. Cognitive ability tests that showed significant differential performance across racial groups, prompting decades of research into whether that performance reflected differences in job-relevant capability or differences in educational access and test familiarity. Personality questionnaires marketed as predictors of job performance that, when subjected to rigorous meta-analysis, showed weaker predictive validity than their vendors claimed. Interview processes that felt objective to the people conducting them but produced outcomes shaped as much by appearance and accent as by any job-relevant competency.

The field learned from those challenges, and the standards that exist - validation requirements, the adverse impact obligations, the transparency norms - and the best practice which surrounds them, have been implemented to mitigate detrimental or harmful outcomes. AI-led assessments don't get to bypass that history by virtue of being technologically sophisticated. If anything, they make the application of those standards more demanding.

When the scenario is the assessment

With a conventional psychometric tool, the instrument and the construct it measures are conceptually distinct. You define what you want to measure, it can be problem-solving, or service orientation, or resilience under pressure, and you design questions and scenarios which identify the underlying behaviours and traits. The link between the constructs assessed and the items assessing them is explicit, documentable, and examinable. A trained reviewer can interrogate it, a validation study can test it, and a candidate can be given a coherent explanation of what they were assessed on.

In AI-led simulation, the scenario is the assessment. It is not a passive questionnaire that just collects responses for measurement. It can actively shape what behaviours are elicited, what skills are required to navigate the situation, what kind of performance is recognisable as competent, and what kind isn't. The workplace it depicts, the characters involved, the nature of the challenge presented, the language and register in which it's written are all design decisions, and all of them carry implicit assumptions about what good performance looks like.

Consider a customer-facing scenario set in a contact centre. The customer is escalating a complaint. The candidate needs to manage the situation. Straightforward enough as a design brief, but whose model of effective de-escalation is being operationalised? A candidate who responds with warm, empathetic directness and moves steadily towards a resolution might score well. A candidate who takes a more formal, tighter-to-the-script, approach, equally effective in practice, perhaps reflecting a different professional background or cultural communication style, may not register as strongly against an implicit template the scenario was built around. That difference doesn't live just in the scoring model, but also in the scenario design, embedded before anyone wrote a line of code. And if it hasn't been explicitly examined, it becomes an invisible criterion that shapes who succeeds and who doesn't.

The same risk applies to language complexity. If a scenario requires candidates to comprehend a rapidly escalating written conversation and respond fluently in real time, it is measuring something. But is that verbal comprehension, relationship building skills, empathy, or something else entirely depending on the context? The key question being whether what is underpinning the candidate’s responses is actually the job-relevant competency the assessment claims to be evaluating, or whether it's a compound of that competency plus reading speed, written fluency, and familiarity with the communication norms of a particular workplace culture. Those are different things, and conflating them produces a tool that looks valid in terms of predicting an outcomes, but not in terms of measuring the skills which explains the prediction, the logic of which high stakes decisions are ultimately being justified against.

This is why IO science must be present at the design stage, not reviewing a finished product, but shaping the brief from the outset. The competency framework needs to drive the scenario concept. The scenario needs to be reviewed, by people with expertise in assessment design, before it becomes an AI-generated experience. The definition of what competent performance looks like within that scenario needs to be made explicit, documented, and tested against the current research, the judgement of subject matter experts and actual job incumbents. None of this is optional if the goal is a scientifically defensible assessment. And none of it is primarily a technology problem. It's a design problem, which means it requires the kinds of expertise that sit in IO psychology and job analysis, not in software engineering.

Organisations evaluating AI assessment vendors should be asking: who designed your scenarios, and what was the process? What competency framework underpins them? What review stages did they go through before launch? If the answers are vague, or if the answers focus on the technology more than the science, buyers should take it as a signal to probe further.

The scoring problem

Scenario design is at least partially visible: candidates go through the experience, hiring managers can review it, and it's possible to form an intuitive view of whether it seems appropriate and well-constructed. Scoring, on the other hand, is invisible and is precisely where clarity is most needed.

In a conventional assessment, the scoring model is explicit by design. A trained assessor (or a validated algorithm) applies a set of criteria that can be inspected and explained. Disagreements between raters can be examined and resolved. The logic of the score is, in principle, auditable. You can sit in front of a hiring manager, a candidate, or a regulatory body and explain precisely how a score was calculated.

In a generative AI assessment, being able to explain scores is just as important, but considerably harder. When a candidate types a nuanced response to a simulated complaint, what exactly is the AI evaluating? Is it the accuracy of the proposed resolution? The empathy in the tone? The structure of the communication? The speed of the response? If it's a combination of these things, what are the relative weights, and how were those weights established? And critically: how does the model handle a response that is unconventional but correct, that achieves the right outcome through an approach the model wasn’t trained to understand?

The answers to these questions are not always comfortable. Large language models (LLMs) learn from patterns in training data, and those patterns contain assumptions about what good professional communication sounds like, about what problem-solving looks like when written down, about the register and vocabulary associated with competence in a particular role. When those assumptions have been carefully examined and validated against actual job performance, they are a strength. When they haven't, they can introduce systematic bias that is difficult to detect.

It’s therefore critically important to feed the models with quality training data. Creating a scoring model using the easiest to find research participants will result in candidates being compared to sub-standard data, not unlike when a conventional psychometric assessment uses an unrepresentative norm group. The mantra of “garbage in, garbage out” very much applies to this situation and great care needs to be taken to ensure the training data reflects the needs of the final assessment in terms of audience and purpose.

A 2024 study from the University of Washington that looked at CV screening, found that three widely used large language models showed significant racial, gender, and intersectional bias in candidate rankings. Not once, across more than three million comparisons, was a black male-associated name ranked above a white male-associated name. It is safe to assume that the developers of those models did not set out to build discriminatory tools, the bias was in the patterns the models had learned, and nobody had looked carefully enough at the outputs before the tools were in use. The mechanism in assessment scoring is different, but the principle is the same: if you don't examine your model's outputs systematically and critically, you will not find the problems until someone else does.

It could be argued that CV screening by humans was not free from bias either. However, the application of AI at scale exponentially increased the presence of bias and exposed far more candidates to flawed scoring models and adversely impacted a greater number of individuals than traditional methods would have. At its worst, introducing poorly designed and unvalidated AI scales up systemic bias more than it scales up valid and fair prediction. What is worse, many vendors will claim their AI to be “bias free”, which is both meaningless and highly deceptive. There will always be some form of bias in the system, what matters most is how it is identified, controlled for, and monitored.

What responsible AI scoring requires, then, is not a different standard from what the field has always applied, it's the same standard, applied with more rigour and more specificity. Assessment tools that use generative AI to produce a realistic testing scenario need to come with documentation that proves construct alignment and inter-rater reliability, and should compare AI scoring against trained human assessors on a meaningful sample of responses. They should be able to show that scoring patterns have been examined across demographic groups, and be transparent about what the model is and isn't sensitive to. Candidates who request it should be able to receive a clear and understandable explanation of what the score represents and how it was reached.

Adverse impact: more complex, not less important

One of the most robust findings in the history of assessment research is that different tools produce different patterns of differential performance across demographic groups. This is not a flaw unique to AI, but it is a characteristic of virtually every assessment method that has been studied at sufficient scale. Cognitive ability tests, structured interviews, situational judgement tests, assessment centre exercises: all of them show differential performance across some groups on some dimensions. The field has spent decades developing methods for understanding those differences, separating construct-relevant from construct-irrelevant variance, and making informed decisions about acceptable trade-offs between predictive validity and adverse impact.

AI-led simulations introduce new potential sources of construct-irrelevant variance that either aren’t present or are less pronounced in traditional assessments.

Written language fluency is the most straightforward example. Almost any scenario-based assessment that requires candidates to express themselves in written form is, to some degree, measuring written communication ability alongside whatever job-relevant competency it nominally targets. For a role where written communication is central to job performance, that's appropriate. For a role where verbal communication and interpersonal judgement are the key competencies, and the assessment measures them through a written simulation, it introduces a source of variance that correlates strongly with educational background and first-language status in ways that have nothing to do with job performance.

Contextual familiarity is another. A simulation set in a contact centre for a telecoms company will feel more familiar to candidates who have worked in similar environments. That familiarity advantage is real and measurable, but there's an important distinction between a candidate who performs well because they have relevant prior experience that really predicts job success, and one who just recognises the context because they've been exposed to it before. Those are different things, and separating them requires careful analysis that will look different for every role and scenario.

Comfort with the medium, or digital fluency, is also worth considering. Some candidates will be at ease typing responses in a chat-based interface under time pressure. Others won't, and that might not have anything to do with how well they'd do the job, and just be a reflection of how much that candidate has been exposed to that type of technology. A well-designed assessment should minimise the degree to which the format itself drives outcomes, through interface design, flexible response options, and accessibility features.

The good news is that the assessment industry already knows how to address these issues. The methods are well-established:

Testing on a representative and diverse pilot group, large enough to reveal whether any candidate group is being systematically disadvantaged. Needless to say - before the tool goes live, not after.

Fixing problems when the data shows them, rather than continuing to expose more candidates to flawed evaluation methods at the risk of disproportionately disadvantaging minority groups.

Monitoring the process in real time, being careful to set reasonable and justifiable pass or fail criteria, and ensuring that the new applicant group is performing as expected compared to the reference group

Continuing that analysis post-deployment. Candidate populations shift, scenarios age, and a tool that looked fair at launch can develop blind spots over time.

Considering ongoing monitoring as an essential step in the process to ensure the tool is still doing what you built it to do.

Organisations can’t compromise on adverse impact analysis on AI assessment tools. The risk is both reputational and regulatory. EU AI Act's full compliance requirements come into force in August 2026, mandating ongoing bias testing for AI systems used in employment decisions. Employment discrimination law applies to AI-generated outcomes, regardless of whether the discrimination was intentional. And organisations whose hiring decisions are later shown to have disadvantaged certain groups in ways that a proper adverse impact analysis would have identified in advance will have trouble re-establishing their employer brand.

What candidates are owed

The EU AI Act classifies AI systems used in recruitment as high-risk, and mandates documented risk assessments, bias testing, human oversight, and the requirement to provide candidates with meaningful explanations of how AI has influenced decisions affecting them. GDPR Article 22 already provides rights in relation to solely automated decisions, including the right to request human review. In the USA, New York City's Local Law 144 requires annual bias audits for automated employment decision tools, with public reporting of results. California finalised regulations in late 2025 clarifying how anti-discrimination law applies to AI hiring tools. The direction of the legislation is clear.

But I want to make a case for candidate transparency that goes beyond compliance, because I think the compliance framing undersells it.

AI-led simulation, at its best, offers candidates something valuable that conventional assessment can rarely provide: a realistic experience of the role before they accept it. A well-designed simulation lets candidates understand what the job involves, the nature of the interactions, the pace, the kind of decisions they'll face, in a way that a job description and a competency framework never quite manage. That's not just good for organisations trying to reduce attrition, but also for candidates making career decisions.

The value of realistic job simulations, however, depends on candidates understanding what they're engaging with. A simulation that feels like a realistic work experience, but doesn't make clear what's being measured or how it will be used, is not offering candidates a realistic job preview, it's just putting them through an opaque evaluation process that purports to be something else. The immersive aspect that makes AI simulation effective is also what makes transparency obligations more important.

In practical terms, this means:

Providing clear pre-assessment communication about what competencies are being evaluated and why they matter for the role.

A transparent explanation of how AI is being used and what role human reviewers play.

Access to score information in a form that is clear and easily understandable.

A simple and accessible process for candidates to raise concerns or request review (not a policy document with a generic email address, but a functioning channel with people on the other end of it).

None of this is technically complex, but it needs to be actively prioritised for these tools to work as intended.

How do I evaluate AI assessment vendors?

Everything discussed above, the scenario design, the scoring model, the adverse impact analysis, the candidate experience, ultimately comes down to one practical question: how do you know, when you're evaluating a tool, whether any of this has been done properly? Here are some questions that organisations looking into AI simulation assessments should ask:

On scenario design: Who built the scenarios, against what competency framework, and through what review process? What job analysis was done to demonstrate role relevance? What expertise was involved in that process? How are scenarios updated as roles, contexts, and candidate populations evolve?

On construct validity: What does this tool claim to measure, and what is the evidence that it measures it?

On AI methodology: What model was selected and how was it trained? What effort went into getting representative and diverse training data? Was the AI technology chosen before the assessment content or because of the assessment content? How is model drift being accounted for?

On scoring explainability: Can you explain what the AI scoring model is evaluating and how it was developed? Can you relate this back to a clear job analysis to demonstrate and justify role relevance? Can you explain a candidate's score to them in plain language? What human review process exists, and what does it involve?

On adverse impact: What does the differential performance data show, across which groups, and over what sample sizes? What sources of construct-irrelevant variance have been identified and addressed? What ongoing monitoring is in place, and what is the process when adverse impact is detected?

On candidate experience: What information do candidates receive before, during, and after the assessment? How transparent are you about the use of AI and the candidates’ rights as data subjects? What access do they have to their results? What process exists for raising concerns and ensuring a human review?

Any vendor who has done their homework properly should be able to answer these questions clearly and cite their research and data as evidence. Although you may want to consider checking that what they shared wasn’t produced by AI ;)

The standard worth holding

AI-led assessments are at an interesting inflection point. In some cases, the technology has outpaced the frameworks that are supposed to underpin it, and the market has, to some degree, run ahead of the science. That's not unusual in early-stage technology adoption, it happened with structured interviewing, with psychometric testing, with competency-based assessment. In each case, the field eventually caught up by tightening the standards, and the tools that had been built with rigour proved their worth, while those that hadn't were gradually abandoned.

The difference with AI is opacity. When a traditional assessment produces biased outcomes, the scoring model usually is (or should be!) explicit enough that you can find the problem, understand it, and fix it. When an AI assessment does the same, the source of the bias can be buried in model weights and training data patterns that are difficult to interrogate.

The organisations that will get the most from AI-led assessment, such asbetter quality of hire, stronger candidate experiences, lower turnover, and more defensible processes; are the ones that treat the scientific and ethical questions with the seriousness they deserve. Not because regulation requires it, though regulation is coming and will require it, but because scientific rigour produces better outcomes. An assessment rooted in science that has been properly validated is, to put it simply, a better tool. It’s always better to buy the substance than the shine, the meat rather than the bones, the book rather than the cover, the wine rather than the bottle (or whatever your chosen metaphor is).

The potential of AI in assessment is real, but so is the responsibility of the people building and buying it to ask the hard questions before the first candidate clicks on the invitation link.

‍