top of page

How to Truly Interview an AI You're Hiring.

From technical EVALS to practical questionnaires: the guide to hiring AI models without improvising



Imagine having to hire a new colleague. What do you do?


You post the ad, analyze CVs, conduct phone interviews, technical tests, motivational interviews, reference checks. Years of HR experience condensed into proven processes. You don't randomly hire the first person who comes along.


Now imagine having to "Hire an AI in Your Organization" for your team.



A model, or rather now an agentic system that will work with your sensitive data, make decisions, interact with customers, write official documents.


What do you do?

At most, someone goes to lmarena.ai, searches for benchmarks, looks at which model is at the top of the rankings, and says "let's get that one." Or talks to colleagues, goes by hearsay or because others they trust use it. Done.


The problem is we don't interview AIs. We don't ask them anything specific. We don't test them on our actual work. We just hire them and start working together.


If you think about it, it's like hiring someone just because they won "employee of the year" at another company, or graduated with high grades, without verifying if they're suitable for YOUR context.


The Benchmark Problem

Here there's a technical problem that becomes practical: public benchmarks are losing value.


MMLU, HumanEval, GPQA - these standardized tests that evaluate models - are now part of training datasets. Models literally "study" for the test. It's like hiring someone who memorized all the interview questions.


Result? Epoch.ai now does an "average of averages" trying to aggregate dozens of different benchmarks, hoping that at least some aren't yet contaminated. But it's an endless chase.


That's why the specific interview about YOUR company becomes fundamental. Questions the models have never seen, processes that aren't on the Internet, specific problems from your domain. There they can't cheat. There you really see what they can do.


Does It Make Sense to Interview an AI?

When I wrote "Hire an AI in Your Organization", I knew that EVALS - systematic model evaluations done both during fine-tuning and to verify capabilities - existed. But they seemed too technical to talk about. Engineer stuff, incomprehensible benchmarks, metrics that only OpenAI or Anthropic could afford to measure.


There wasn't a culture to talk about it. Even today, if you say "EVALS" to someone, they look at you puzzled, right? However, just recently, even Ethan Mollick, a Wharton Business School professor very sharp in his AI analyses, has started talking about it... so I pulled these notes back out of the drawer.


The concept is that if instead of EVALS we talk about "selection interview" suddenly everyone understands. And yes, modern AI systems are capable of having an interview


Because that's exactly what it is: a structured process to verify if a candidate (in this case an AI model) is suitable for the role it must fill in your organization.


And just as we don't hire people randomly, we shouldn't hire AI randomly.


Vibe Hiring

I confess: I also do "vibe hiring" with AI: I take models 'randomly' and test them with more or less repetitive prompts (for example asking to create interactive visualizations showing how AI chooses the next words during text generation). Then I decide if I like the result or not.


Some impressions of recent hires.


Claude Sonnet 4.5? It's a nerd like me, writes both code and text well, knows my workspaces and isn't a bootlicker. I use it in VS Code for almost everything.


ChatGPT 5? Since version 5 I struggle to use it, but it's convenient for small general tasks.


Gemini 2.5? Only if I happen to still be searching on Google and AI Mode starts without me noticing.


These are my main "hires." All children of vibe hiring improvised, based on personal feelings, zero structured process.


For me, when I work alone on personal things, I can afford to experiment, and it works. But in companies? We can't afford this approach. We don't hire people randomly, why should we do it with AI? By the way, we can explore topics that would be at least taboo between humans.



The Questions We Never Dared to Ask

Here's the interesting part: we can't ask human candidates certain questions. Fortunately there are laws, labor regulations, ethical issues. You can't ask a candidate "when do you lie?" or "what biases do you have?".


With AI we can. In fact, we must.


  • When do you lie? Hallucinations are "involuntary lies." Testing when and how they happen is fundamental.

  • How much do you cost? Not just the token price, but the real cost to complete specific tasks for your company.

  • What data were you trained on? Fundamental to understand if it can work with your sensitive data.

  • What biases do you have? Algorithmic discrimination, biases embedded in training, legal risks.

  • How do I corrupt you? Prompt injection is like testing a candidate's corruptibility.


But, assuming the above are naive questions, because finding the answers will be very complex, what matters are the technical questions about your work. I call them specific emergent abilities, those that the model can't know because they're too specific to your reality: "How many hours does it take to complete this business process?" "How would you handle this internal memo?" "Design this product according to our specifications." "Analyze this product's defects by looking at it." etc.


We're not testing if it can solve mathematical problems or write Python code. We're testing if that AI SYSTEM can work in OUR specific context.


"I'm a Private Person" (and Other Lies About Privacy)

And then there's the thorniest question: "Where does our data end up?"


With a human candidate it's easy: they sign an NDA, there are legal consequences if they violate confidentiality, you can trust them (or at least you have tools to protect yourself).


With AI? The vendor tells you "we comply with GDPR," "data is encrypted," "we don't use your inputs to train models." Ok, we sign the contract. But how much can you really trust?


The reality is that today there's not much to trust. Not out of bad faith, but due to technical complexity:


  • Data passes through complex cloud infrastructures

  • It's not always clear where they're processed geographically

  • Contractual guarantees are different from technical guarantees

  • A bug, a security flaw, a distracted vendor employee... and your sensitive data is exposed


During the AI "interview," you need to invest time and resources to understand this aspect. Reading the contract isn't enough. You need:


  • Data leakage tests (does the AI reveal information it shouldn't?)

  • Retention policy verification (how long do they keep logs?)

  • Understand the difference between "not used for training" and "not stored"

  • Test prompt injection scenarios that try to extract other users' data

  • Assessment of the counterparty's (Vendors) reliability and their supply chain.


It's like doing a thorough background check. Except instead of calling references, you must do privacy penetration testing.


A human candidate who says "I'm discreet" can be tested by giving them sensitive information to handle during the trial period. With AI? You must do the same, but systematically, before giving it access to real data.


How to Interview AI: The 2-Level Proces

How do you actually interview an AI?


I wrote in the article about EVALS for AI Cookbook that the principle is: start simple, we're not OpenAI, no need to create systems with thousands of tests.



Level 1 - The Homemade Questionnaire (the technical interview)

Here the real interview begins. Create a questionnaire with 10-20 specific questions from your domain. Not "write a poem" or "solve this math problem." Questions like:


  • "How long does it take to complete this business process?" (answer you know)

  • "Summarize this 50-page internal memo and tell me the 3 main implications"

  • "Given this contract template, fill in the fields with this customer data"

  • "Translate this technical document maintaining our sector-specific terminology"

  • "Reply to this complaint email following our tone of voice"

  • "We received an order from a French customer for 500 pieces of model X-20 in matte finish. Estimate production times knowing we produce 120 pieces/week in glossy finish"

  • "On average, how much does it cost to produce a piece like the one I showed you?" (Correct answer: "I can't estimate it without knowing materials, quantity, finish, overhead..."


Ask the same questions to Claude, ChatGPT, Gemini, and compare the answers. No need for complicated software: an Excel sheet with questions, expected answers, and a score from 1 to 5 for each model. Where you give the grades (mind you) by reading the answers.


Then if you want to complicate them go ahead, as you wish, but remember you must be able to personally evaluate each result.


And that the important thing is they are real tasks it will actually do in your organization.


Level 2: The Trial Period (continuous performance)

Just as you don't "hire and forget" with people, you can't do it with AI. In serious companies 1/3 or half of the AI team dedicates time to continuously testing the models.


  • Monitor performance over time

  • Compare with new models that come out

  • Don't be afraid to "fire" an ineffective model

  • Reinvest in better candidates when available


Who Interviews the AI?


And remember you can't just delegate this to techies. Some time ago I wrote about new jobs in the age of AI agents, imagining figures like the Agent Shepherd, the Guardian, the Dealer. All roles related to managing these new "digital colleagues".


But there's a question I asked myself later: how do these roles select the right agents without a structured interview process?


The Agent Dealer (or Agent Sourcing Specialist, if you want a name to put on LinkedIn) will need to "source good ones," conducting many interviews, estimating their actual capabilities, finding "trusted" ones for sensitive data. But on what basis do they decide? Instinct? Random tests? Spend days on Reddit hoping to find the perfect model?


The Guardian (Agent Operations Manager) will need to do continuous performance management, handle onboarding, evaluate alignment with company policies. It's the HR Manager of agents. But without clear evaluation criteria, how do they do it?


And the Agent Shepherd (Chief Agentic Officer), who must measure ROI, eliminate ineffective models, decide on budgets and investments? On what metrics do they base their decisions?


A process is needed. An interview is needed.


The Missing Culture

The point isn't technical, it's cultural.


In the same way we learned, over decades, to conduct structured interviews with people (psychometric tests, assessment centers, practical tests, references), we must build the culture to interview AIs.


But with different questions. More direct. More courageous.


Because AI doesn't get offended if we ask it "when do you lie?". It won't complain on LinkedIn if we give it difficult tests. It won't threaten lawsuits if we fail it.


And paradoxically, precisely this freedom to ask "anything" makes us uncomfortable. We're not used to this total transparency in interviews.


Yet it's necessary.


The Intern is a Metaphor, Not a Person

Be careful though: AI should not be anthropomorphized.


As I wrote in the article "That Thing About AI Consciousness We Need to Talk About", the problem is that "as humans, used to seeing faces in trees and figures in clouds, we can't help but attribute human characteristics to this answer machine."


We don't have the right words to define it. That's why it was convenient to take them from the human world: AI "speaks," "reasons," "thinks," "understands." And I myself use the metaphor of the "super smart intern."


But these are expedients we need to understand, reason and reflect. We must not make the mistake of really attributing feelings or other human characteristics.


Until we find the right way to treat it, it's best to "switch" between thing and proto-person to find OUR compromise. The intern metaphor helps us reason about the interview, responsibilities, tasks. But it has no empathy, doesn't care about us, doesn't judge us, doesn't get offended, doesn't suffer.


We must remember this every time we evaluate it. The interview we conduct is for US, to understand if it's useful to OUR organization. Not to "give it a chance" or "be fair to it." (And believe me, if I'm writing this it's because I'm starting to hear things I don't like about affection for models.)



So WHAT?


When I started thinking about this topic, I realized that the real gap isn't technological, it's methodological. Companies already have all the conceptual tools to evaluate AI: they've been using them for years to evaluate people.


The EVALS are technical interviews. Public benchmarks are no longer enough—they're contaminated, models "study for the test." You conduct the real interview, with specific questions from your domain.


The process is simpler than you think: public screening to understand who to invite, homemade questionnaire with 10-20 questions about your real processes (an Excel sheet is enough), continuous monitoring of performance.


And you can make bold requests that you could never make to a human candidate: when do you lie? How much do you really cost? Where does our data end up? Make me a plan to compete with my company. Destroy, with constructive criticism, this product we make.


The point is: start. Ask yourselves questions. Ask them to the models you already work with. Each of you is an expert in your own work and knows how to evaluate answers. You don't need a perfect process right away, you need to start doing it.


Perhaps the first real interview to conduct isn't with AI, but with whoever will select it: someone is needed to handle it, not just tech, not just traditional HR, but a hybrid competence that can bridge.


I talk about this, and much more, in my workshops.


---


Massimiliano


P.S. What question would you ask an AI model during an interview? I'd love to know.



 
 
 

Comments


bottom of page