Why this role is hard to hire for
The title "AI engineer" or "AI systems architect" did not exist in its current form three years ago. The job boards are full of candidates with strong ML credentials — people who can train models, tune hyperparameters, and interpret benchmark results. That is a different skill set from what most companies actually need when they are trying to build production AI systems.
The confusion costs companies months. They hire an ML researcher who cannot ship. Or they hire a software engineer who treats LLMs as black boxes and cannot diagnose when they fail. Or they hire a vendor relationship manager who puts together demo pipelines that collapse at scale.
Here is what to look for instead.
The skill that matters most: systems thinking
A good AI systems architect thinks about AI features the same way a senior software engineer thinks about distributed systems: as a collection of components with interfaces, failure modes, latency budgets, and operational requirements.
The question to ask in an interview: describe a production AI system you built that failed, and walk me through how you diagnosed and fixed it.
A strong answer will discuss specific failure modes (retrieval quality degradation, prompt injection, latency spikes on long contexts, evaluation score drift), the instrumentation that made the failure visible, and the architectural changes that fixed the root cause.
A weak answer will be vague about failure modes, will attribute failures to model quality rather than system design, and will describe fixes in terms of "tried different prompts" rather than structural interventions.
Systems thinkers design for failure from the start. They build evaluation harnesses before shipping. They instrument things that most engineers would not think to instrument. They have opinions about failure isolation and graceful degradation.
What to test in a technical interview
Architecture design. Give a realistic system design problem — a RAG system for a support knowledge base, an agent that automates a multi-step data pipeline, a real-time moderation system. Ask them to design it. Listen for: how they handle failure modes, how they think about evaluation, what trade-offs they make between latency and quality, whether they ask clarifying questions about the constraints.
Production debugging. Describe a system that is behaving badly in production — answer quality has degraded over the past two weeks, users are complaining. What do you investigate? What data do you look at? Strong candidates will ask about the evaluation pipeline, check for data drift, look at the distribution of input types, and investigate prompt and model changes in the deployment history. Weak candidates will suggest tuning the prompt.
Model selection. Ask why they would choose one model over another for a specific use case. The answer should not be "it scored highest on the benchmark" — it should be about the specific capabilities the use case requires, the latency constraints, the context window requirements, and the cost structure.
Build vs buy decisions. Ask when they would use LangChain vs build custom orchestration. Strong candidates have clear opinions based on use case requirements, not familiarity. They know what LangChain is good at (rapid prototyping, standard patterns) and what it is bad at (production performance, custom failure handling).
The red flags
Demo-only thinking. The candidate can describe impressive demos but cannot explain how the system would behave under production load, with adversarial inputs, or when a component fails. Demos work in controlled conditions. Production systems work in all conditions.
Model obsession. The candidate's mental model of AI system quality is primarily about choosing the right model. Model selection matters, but it is one input into system quality, not the primary lever. Systems with mediocre models and excellent retrieval, evaluation, and infrastructure often outperform systems with state-of-the-art models and poor engineering.
Evaluation avoidance. Ask what metrics they use to measure system quality in production. If the answer is vague ("we monitor user feedback") or benchmark-focused ("the model scores well on MMLU"), that is a problem. Production AI systems require rigorous, automated evaluation. Candidates who have not built evaluation infrastructure have not shipped serious production systems.
Framework dependency. The candidate can only describe AI systems in terms of the frameworks they have used. Strong architects understand what the frameworks are doing and can reason about systems without them. Ask them to describe the retrieval step of a RAG pipeline at the level of API calls, not framework abstractions.
The questions they should ask you
Strong candidates ask about the thing that will make or break the engagement: data quality, evaluation methodology, and production constraints.
- What data do you have, and what does its quality look like?
- How do you currently measure whether the AI system is working?
- What does success look like in six months, and how will you measure it?
- What is the latency budget for this feature?
- Who owns the system after the architect hands it off?
Candidates who only ask about technology choices ("what cloud provider do you use?", "are you committed to a specific model vendor?") are thinking about the wrong layer.
The engagement structure that works
The best AI systems architects I know work in one of two modes:
Defined build engagements. A specific system with a clear scope: design, implement, evaluate, document, hand off. These work when the problem is well-understood and the organization has the engineering capacity to maintain what gets built.
Embedded leadership. Ongoing involvement in the AI roadmap, architecture decisions, and team capabilities. These work when AI is core to the product and needs continuous ownership, not just an initial build.
What does not work well: treating an AI architect like a consultant who produces reports and recommendations but does not own implementation. The implementation details are where most of the value is. The architecture is only as good as its execution.
What this actually costs
Good AI systems architects are expensive relative to software engineers and cheap relative to the value they create or protect. The companies that spend a year building an AI system that does not work, or that ships and silently degrades in production, spend far more than the cost of someone who would have done it right.
The question is not whether to invest in AI architecture. It is whether to invest at the planning stage or at the remediation stage. Remediation is always more expensive.