Back to Blog
AI in Hiringai screening softwarebuyer's guideevidence-based screening

AI screening software: a buyer's guide for TA leaders

Most AI screening tools sell speed. The ones worth buying produce evidence. Here are the seven questions every TA leader should ask before signing a contract.

A
Ahmed Admin
April 30, 202611 min read
AI screening software: a buyer's guide for TA leaders
1 views
Share:

Who this guide is for

If you're a senior TA leader at a company hiring at any real volume, anywhere from 200 to 600+ hires a year, and you're being asked to evaluate AI tools for the top of your funnel, this is for you.

You've probably already had three or four demos. They've all sounded broadly similar. They all promise to save your recruiters time. They all have a slick dashboard. None of them have given you a clean answer to the question your CFO will ask in the contract review meeting: what exactly are we buying, and how do we know it works?

This guide is the answer to that question.

First, the category problem

The phrase AI screening software is doing too much work. It currently covers at least four genuinely different categories of tool, and most buyers don't realise they're comparing apples to forklifts.

Before you evaluate anyone, get clear on which category you're actually buying.

Category 1 — AI resume screeners

Parse CVs faster. Rank applicants by keyword match, semantic similarity, or ML scoring. Filter before a human sees the list.

What they're good at: Volume triage. Reducing 10,000 applicants to 1,000 in minutes.

What they're not: Evidence. They're making the same claim-based decision a human recruiter would make from the same CV, just faster. The signal underneath is unchanged.

Buy this if: You have a true logistics filtering problem (right to work, location, salary band, mandatory credentials) and you trust your CV-to-quality correlation. Most companies' CV-to-quality correlation is weaker than they think.

Category 2 — One-way video interview platforms

Candidates record themselves answering pre-set questions. Recordings are watched (or auto-scored) by recruiters.

What they're good at: Asynchronous scale. Letting one recruiter "meet" 200 candidates without 200 calls.

What they're not: Conversations. They produce performances under uniquely uncomfortable conditions, and they punish neurodivergent candidates, ESL candidates, and candidates without quiet home setups. The output is rarely a defensible evaluation. We've written more on why this format is dying here.

Buy this if: Your hiring managers genuinely watch the videos and your candidate experience metrics aren't already sliding. If either of those isn't true, you're paying for theatre.

Category 3 — Conversational AI interviewing platforms

An AI agent runs a real two-way conversation with every applicant. Output: recording, transcript, structured evaluation, scored shortlist.

What they're good at: Producing evidence at the volume of your inbound funnel. Every candidate gets the same structured first round. Every decision is auditable six months later.

What they're not: Plug-and-play. They require you to define focus areas per role and trust the agent to probe in real time. The setup is more thoughtful than uploading questions to a one-way video tool.

Buy this if: You want to replace or upgrade your first-round interview itself, not just the filter before it. Merra is in this category.

Category 4 — Skills assessments and work-sample tests

Coding tests, sales role-plays, written exercises, situational judgement tests.

What they're good at: Measuring specific job-relevant skills with high construct validity. Decades of academic research backs this category.

What they're not: Scalable for general communication, motivation, or fit signals. Also: candidates increasingly resent long unpaid assessments.

Buy this if: The role is highly skill-bounded (engineering, sales SDR motions, customer-support response handling) and you already use these in later rounds. They're not a first-round filter; they're a deeper-round confirmation.

Most TA leaders end up running a stack: a resume triage layer (Category 1), then either a conversational interview (Category 3) or a one-way video (Category 2), then skills assessments (Category 4) for shortlisted candidates. The question this guide answers is which middle layer to buy.

The seven questions to ask every vendor

These are the questions that separate the tools doing real work from the tools selling theatre. Run every demo through this list.

1. What does a single candidate's evidence pack look like?

Ask for a real, anonymised example. Not a screenshot of a dashboard. The actual artefact a hiring manager would open.

What you want to see for each candidate:

  • A recording or transcript of an actual interaction (not just metadata).

  • A structured evaluation tied to the focus areas of this specific role, not a generic rubric.

  • A score that you can trace back to evidence, not a black-box number.

  • A decision summary written in language a human can read and defend.

If the vendor can't show you this within five minutes of you asking, the tool isn't producing evidence. It's producing rankings.

Why this matters: Six months after a hire, when someone asks "why did we advance this person and not that one?", the evidence pack is what you pull. If it doesn't exist, you have a compliance and legal-defensibility problem, not just a hiring problem.

2. How do you handle reliability and score consistency?

Ask this exact question: if you ran the same candidate's interview through your system 10 times, how much would the score vary?

Most vendors will not have a clean answer. The good ones will. The honest ones will give you a confidence interval.

The variance you should expect from a well-built system is on the order of ±1–2 points out of 100, with a 95% confidence interval. That is lower than the variance you'd see between two human interviewers scoring the same call. If a vendor claims zero variance, they're either lying or they're hashing inputs to outputs in a way that breaks the moment a candidate phrases something differently.

Why this matters: Score reliability is the difference between a tool you can defend to your legal team and a tool you can't. We've published our own reliability case study (10 runs on the same transcript, ±1.2 points at 95% CI) here. Every serious vendor should have something similar. If they don't, ask them to produce it before you sign.

3. How is bias measured, monitored, and mitigated?

There is no AI hiring tool that is bias-free. The honest vendors will tell you that. The dishonest ones will claim it.

What you want to hear:

  • Specific demographic groups they monitor against (gender, race/ethnicity where legally collected, age, disability, ESL status).

  • Adverse impact ratios published or available on request.

  • An explicit statement of which biases they've found and what they did about them.

  • A method for re-auditing as the underlying models change.

What you don't want to hear:

  • "Our AI is unbiased because it doesn't see protected characteristics." (False — protected characteristics leak through every linguistic and acoustic feature. Anyone telling you this hasn't done the work.)

  • "We're compliant with [vague regulation]." Ask which clause and how they verified it.

  • "Our customers haven't reported any issues." That's not bias monitoring. That's the absence of bias monitoring.

Why this matters: NYC Local Law 144, EU AI Act, and similar regulations are turning bias auditing from an ethical nice-to-have into a legal requirement. The vendor's answer here tells you whether they've taken this seriously or whether you'll be doing the audit work yourself after the contract is signed.

4. How do candidates experience this?

Ask to go through the candidate flow yourself. End to end. Not a sales-narrated walkthrough — a real one.

What you're feeling for:

  • How long does it take from invite to start? (Anything over five minutes is a drop-off risk.)

  • Is the experience conversational or extractive? Does it feel like a chat, an interview, or an interrogation?

  • What happens to candidates who don't advance? Do they get feedback, silence, or a generic rejection?

  • How does the format treat neurodivergent candidates, ESL candidates, candidates with poor connectivity, candidates on mobile?

The candidate experience is no longer a soft metric. Glassdoor, Reddit, and TikTok turn bad candidate experiences into employer-brand damage in 48 hours. If the tool feels uncomfortable to you when you go through it, it will feel worse to a candidate who actually wants the job.

Why this matters: The companies winning at high-volume hiring right now are the ones whose candidate experience compounds. Every applicant who enjoyed your process tells two friends. Every applicant who felt humiliated by your one-way video tells fifty.

5. What's the integration shape with your ATS, calendar, and downstream tools?

Get specific. Not "yes we integrate." Ask:

  • Does it support the specific ATS you use (Workday, Greenhouse, Lever, Ashby, SmartRecruiters, etc.) at the workflow level — auto-trigger interview invites on stage move, write evaluation back to the candidate record, push score to required fields — or just at the data-export level?

  • Does it sync with the recruiter's calendar for scheduled rounds, or is it fully async?

  • Where does the evidence pack live? In the vendor's system, in your ATS, or both?

  • What happens when a candidate withdraws mid-process? When a role is cancelled? When you switch ATSs in 18 months?

Most AI screening tools claim integration. The depth of that integration is where workflow either breaks or compounds.

Why this matters: A tool that lives in a separate tab is a tool your recruiters will stop opening within six weeks. A tool that writes its output back into your ATS at the right point in the workflow becomes part of how your team works.

6. What does pricing look like at your actual volume?

Ask for the price at three volumes: your current one, double your current one, and your projected volume in 18 months.

Watch for:

  • Per-interview pricing. Aligns vendor incentive to your volume but punishes you for screening more candidates, which is usually the right thing to do.

  • Per-role / per-requisition pricing. Cleaner economics, especially if you run high-applicant-volume reqs.

  • Per-seat pricing. Old-software-era thinking. Be wary if it's the only model offered for an AI tool — it usually means the vendor hasn't figured out their unit economics yet.

  • Hidden setup fees, premium support tiers, integration costs. Get the all-in number, not the headline number.

Why this matters: AI screening pricing is still finding its level across the market. The same shape of tool can be priced 3–4x differently between vendors. The cheapest is rarely the best, but the most expensive isn't automatically either. Get clean numbers, then judge.

7. How does the vendor improve the system over time?

This is the question almost no buyer asks, and it's the one that separates tools you'll renew from tools you'll quietly let lapse.

  • How often do they update the underlying interviewer / evaluator?

  • When they update it, does that change historical scores? (It shouldn't — versioning matters here.)

  • Do they share what they're changing with customers, or do you just notice your scores drifting one quarter?

  • Do they have a customer feedback loop on calibration, or are you just consuming whatever they ship?

Why this matters: AI screening tools are not static products. The model behind them improves (or regresses) every few months. A vendor without a clear versioning, change-log, and calibration process is a vendor whose product will be different in 12 months than the one you bought, and you won't know in which direction.

The shape of a good buying decision

After you've run every shortlisted vendor through those seven questions, the pattern usually becomes clear. The good tools answer all seven cleanly. The mediocre tools answer four or five and hand-wave the rest. The bad tools try to redirect you to the dashboard demo every time you ask one of them.

The bigger pattern, though, is what the answers tell you about the vendor's worldview.

The vendors who build for speed will optimise their answers around throughput, time-saved, recruiter-hours-back. Those are real metrics, but they are not the metric.

The vendors who build for evidence will optimise their answers around defensibility, reliability, audit trail, candidate experience, and how the hiring manager's decision quality changes. That's the metric.

Speed is a side-effect of evidence done well. Evidence is rarely a side-effect of speed.

If I were buying AI screening software today as a TA leader at a 1,000–10,000-employee company, I would only seriously evaluate vendors in Category 3 (conversational AI interviewing) and I would weight question 1 (what does a single candidate's evidence pack look like?) at roughly 40% of my decision.

Everything else flows from that.

A short summary of what to look for

  • Evidence per candidate, not just speed. Recording, transcript, structured evaluation, decision summary. All four, not three.

  • Reliability you can defend. Known and bounded variance, published or producible on request.

  • Bias monitoring that's specific. Demographic groups, adverse impact ratios, mitigation methods.

  • Candidate experience you've felt yourself. Walk the flow end-to-end before signing.

  • Workflow-level ATS integration. Not just data export.

  • Pricing at your real volume. Three volumes, all-in, clean numbers.

  • A vendor that improves with versioning. Not one that quietly drifts.

Where Merra fits

Merra is in Category 3 — conversational AI interviewing. Every applicant gets a structured 10–15 minute video conversation with the interviewer agent, configured per role. The output for every candidate is the recording, the transcript, a scored evaluation against the focus areas, and a decision summary. The hiring team logs in to a ranked shortlist with the full evidence pack on each candidate.

We believe the right way to think about AI screening software is evidence-based first-round screening. We've laid out the worldview behind that in Why we killed the one-way video interview at Merra and built something else.

We've published our own reliability case study (10 runs on the same transcript, ±1.2 points at 95% CI). We're transparent about what bias monitoring we do and where the limits are. We integrate at the workflow level with the major ATSs. Our pricing is per-role, not per-seat.

Not because we're trying to win this guide. Because those are the right answers, regardless of which vendor you end up choosing.

Run a pilot on one role

The fastest way to evaluate any AI screening tool is to run it on a real role with real candidates, against your existing first-round process, and compare the evidence packs.

We'll set up Merra for a single requisition, run every applicant through a structured 10–15 minute video interview, and hand you the ranked shortlist with the recording, transcript, and scored evaluation for every candidate.

If the evidence pack is better than what your current first round produces, you'll know in a week.

Run a pilot on one role and see the evidence pack Merra gives your team.

Start a pilot →

Tags:#ai screening software#buyer's guide#evidence-based screening#vendor evaluation#ai interviewing#ta leadership#hiring tech stack

Ready to hire faster with AI?

See how Merra helps teams screen candidates in minutes, not weeks.

Request a Demo