Prompt Engineer (Model Behavior & Evaluation)
Yuna
South Africa
Yuna’s mission is to radically transform how mental health support is accessed and delivered. We provide immediate, private, 24/7 support through empathetic conversational AI—closing gaps created by long wait times, high costs, and limited access to care. Every role at Yuna directly shapes the experience users have in their most vulnerable moments.
This is a foundational hire to our engineering team. As our first model behavior engineer, you will be responsible for how our AI conversations behave in the real world. This role goes beyond writing prompts. You will define what “good” looks like, design and run evaluations, diagnose failures across multi-agent systems, and continuously improve the warmth, safety, usefulness, and alignment of Yuna’s conversations at scale.
You will be at the intersection of product, clinical psychology, and engineering. You will own conversational behavior across prompts, context, routing, memory, model choice, and evaluation.
What you’ll do
- Own conversational behavior across a multi-agent, multi-model conversational system. Beyond prompting, this will require working with context architecture, agent routing, memory, and model selection
- Collaborate with clinicians and mental health experts to design and operate evaluation frameworks for conversational quality, empathy, usefulness, and alignment
- Diagnose failures by analyzing real conversation traces, agent routing, memory usage, and system context
- Define your own priorities by using data, stakeholder feedback, and conversational logs to identify the highest-impact improvements and act on them
- Build the continuous improvement loop, ensuring each failure improves the whole system rather than a patch
Who you are
You are a problem solver at heart. A “jack of all trades” that can define an ambiguous problem, come up with hypotheses, and ship solutions all on your own. You don’t wait for permission, but are confident acting first and pivoting if necessary. You should also be very comfortable with AI and practiced in manipulating it to solve a specific problem.
Required
- Experience owning model or agent behavior in a live production environment
- Designed and operated evaluation systems (qualitative + quantitative)
- Ability to define what “good” means and defend it with evidence
- Comfort working hands-on with tools like LangGraph, LangSmith, or comparable tracing and evaluation frameworks
- High data literacy: able to reason from logs, traces, metrics, and analytics to isolate root causes
- Clear communicator who can collaborate across product, clinical, and engineering teams in ambiguous problem spaces
Nice to have
- Working knowledge of Python; ability to read, debug, and collaborate within agent logic and evaluation code
- Familiarity with alignment concepts (e.g. human values, safety tradeoffs, refusal behavior)
- Proficient with SQL, amplitude, and excel
- Background in psychology, linguistics, neuroscience, education, or adjacent fields
Location: Remote (can work from anywhere, but must overlap working hours with 8am-12pm PST at a minimum)
Employment Type: Full Time
What We Offer
- Competitive salary (based on experience) + equity options
- Remote-first culture with flexibility
- A fast-growing, talented, and empathetic team dedicated to transforming mental health care
- An opportunity to use AI for good
- Level of ownership to make measurable impact, building cutting-edge AI systems that improve lives every day