Harness Engineering: User Experience vs Safety Compliance — A Direction Mainstream Roadmaps Have Collectively Skipped

Abstract: In Q1 2026, “Harness Engineering” surfaced almost simultaneously at OpenAI, Anthropic, and the Chinese startup Nextie, and the “12 Primitives” converged in the open-source community as a shared taxonomy. This essay argues that essentially all mainstream investment in Harness has concentrated in a single dimension — user experience, performance, efficiency — while the dimension that actually determines market access in Safety-Critical domains (autonomous driving, medical AI, financial risk control) has been collectively skipped: safety compliance. By constructing a two-way mapping between the 12 Harness Primitives and SOTIF (ISO 21448), this essay identifies 12 concrete research directions, offered as a starting point for standardization bodies, corporate R&D, third-party institutions, and academic labs to jointly fill in this commons. A ~3000-word Chinese short form is available on the author’s WeChat channel.

1. Two Signals Surfacing At The Same Time

Over the past two months, the term “Harness Engineering” has been high-frequency vocabulary in two very different circles.

1.1 Dense Signals From The AI Engineering Community

In February 2026, OpenAI’s official blog post Harness Engineering: Leveraging Codex in an Agent-First World turned what had been an implicit community consensus into a formally named concept. The following month, Anthropic shipped the Managed Agents architecture, whose technical documentation treats the “Agent Harness” as a first-class engineering object. In April, the Chinese startup Nextie closed two rounds within a single month, with Qi Lu and Kai-Fu Lee jointly backing a company that was only four months old — its core narrative being “collective intelligence + Harness.” The same month, the Chinese tech publication Synced ran a feature titled The New Frontier: Harness, Backed Heavily By Lee and Lu, pushing the term fully into the industry mainstream.

In parallel, the GitHub repository awesome-harness-engineering consolidated the implicit community consensus into a formal taxonomy of 12 Primitives: Agent Loop, Planning, Context Delivery, Tool Design, Skills/MCP, Permissions, Memory, Task Runners, Verification, Observability, Debugging, and HITL (Human-in-the-Loop).

1.2 Quieter But Equally Clear Signals From The Automotive Industry

The automotive industry’s signals have been quieter but equally clear. Leading AD companies are independently reinforcing long-term investment in model-output interpretability and transparency, positioning it as a core strategic direction for the next 5–10 years. Internal documents at several OEM technical teams have begun discussing “Harness” as a concept.

When these two streams of signals arrived together, the first instinct was to align them — isn’t Sense-Plan-Act just Agent Loop? Isn’t ODD checking just Permissions? Isn’t the simulation pipeline just Verification?

But after the alignment, something more valuable emerged.

Both sides are talking about Harness Engineering, but they are not talking about the same Harness. More importantly, neither of the two mainstream trajectories has meaningful intersection with what I have worked on for years — AD safety under SOTIF / ISO 21448.

That is not coincidence. That is structure.

2. User Experience vs Safety Compliance: The Core Distinction

2.1 The Shared Pattern Across Three Mainstream Roadmaps

Laying out a week’s worth of materials, a commonality surfaces across the three mainstream tracks:

Track	Representative Players	Core Requirement	Optimization Goal
Consumer collective intelligence	Nextie / Tuanzi / Xiaobing Island	Token efficiency + long-horizon tasks + multi-agent coordination	User experience
Enterprise Agent platforms	Anthropic Managed Agents / OpenAI Codex Agent-First	Code productivity + long-task reliability	Engineering efficiency
Automotive model interpretability	Leading AD companies	Model-output transparency + user trust	Product experience

All three face the same underlying problem:

How do we make an intelligent system — powerful, but not fully predictable — operate reliably in open environments?

And all three have chosen the same optimization direction: they are investing in the user experience / performance / efficiency dimension of Harness. Making models stronger, faster, more stable, more transparent — that is their main battleground.

This is the inevitable shape of commercial competition. Whoever delivers better UX, whoever serves stronger models, wins.

2.2 Safety Compliance: The Side Everyone Skipped

But Harness has another dimension that the mainstream has collectively skipped — safety compliance.

When an intelligent system needs to win market approval, pass regulatory review, or be held accountable in a judicial proceeding, “strong experience” is no longer enough. It also needs to be “compliance-stable.”

This dimension is not particularly important in consumer-AI contexts (where the failure cost is shallow conversation) or enterprise-Agent contexts (where the failure cost is buggy code). But in Safety-Critical domains like autonomous driving, medical AI, and financial risk control, it is the second bottom line — the one that decides whether the product can go to market at all:

ISO 21448 (SOTIF) requires “every identified hazard must be reproducible”
ISO 26262 requires “an argumentation chain commensurate with the ASIL level”
Regulators require “every on-road decision must be traceable”

Once these requirements are applied to AI Agents, every one of the 12 primitives surfaces a new class of compliance-engineering problems.

2.3 Why The Mainstream Won’t Self-Invest In Safety Compliance

Why won’t mainstream players self-invest in the compliance commons? Two structural reasons:

First, compliance is not a commercial differentiator. If “end-to-end decision traceability” becomes the industry minimum, no one will choose your car or your Agent because of it. It’s a ticket, not a selling point.

Second, compliance is an industry commons. If a single OEM invests five engineers in “formal methods for end-to-end decision traceability,” the outcome is a shared industry standard. Pure commercial entities have no incentive to do this kind of work.

Commons need stewards. Standardization bodies, corporate R&D (forward-looking prototyping groups), third-party institutions, academic labs and universities — these are the natural stewards. Their work is what isn’t urgent today but will be indispensable tomorrow.

In one line: Harness is not a single-dimensional engineering problem. It has a user-experience facet and a safety-compliance facet. The former has been saturated by capital and big tech; the latter is decisive in Safety-Critical contexts, and it is almost empty.

Harness Dual Dimensions

3. 12 Primitives × SOTIF Full Mapping

3.1 Overview Table

#	Harness Primitive	SOTIF (ISO 21448) Clause	Related ISO 26262 Practice	Existing AD Industry Counterpart	Research Gap (Compliance Facet)
1	Agent Loop	Clause 4 / 6	Cyclic execution + WCET	Sense-Plan-Act architecture	Iteration-level safety analysis
2	Planning & Decomposition	Clause 5	Hierarchical FSM + FMEA	Behavior Tree, FSM	Plan-level rollback semantics
3	Context Delivery & Compaction	Clause 6 / 8	ASIL B+ sensor validity	Multi-sensor fusion, Scene Graph	Safety-constrained adaptive compression
4	Tool Design	Clause 8	Actuator FMEA + fail-operational	Drive-by-wire, CAN	AI-friendly actuator error standard
5	Skills & MCP	Clause 4	Software unit integration	AUTOSAR Adaptive Function Bus	Cross-OEM interop (“Automotive MCP”)
6	Permissions & Authorization	Clause 4 / 8	Mode mgmt + FFI	ODD checking, Function Arbiter	Two-stage authorization (fast gate + CoT)
7	Memory & State	Clause 12	Persistent fault memory	World Model, EDR	Cross-trip long-term memory safety
8	Task Runners & Orchestration	Clause 4	Real-time scheduler	Mission Planner, Route Manager	State preservation across reboots
9	Verification & CI Integration	Clause 9 / 10	MIL/SIL/PIL/HIL	Simulation + Scenario Library + CI	Automatic test-invalidity detection
10	Observability & Tracing	Clause 12	Diagnostic Event Manager, DTC	Data uplink, EDR, V2X logs	End-to-end decision traceability
11	Debugging & DX	Clause 6	FuSa Toolchain	Scenario replay tools	End-to-end reproducibility bounds
12	Human-in-the-Loop	Clause 5 / 8	Controllability (C0–C3)	TOR, Remote Operator	Supervisor-capability compliance model

12 Primitives Overview

3.2 Primitive-By-Primitive Analysis

Each primitive is decomposed into five sections: UX facet, compliance facet, standards mapping, existing industry practice, and the key research gap.

Primitive 1 — Agent Loop

UX facet: Mainstream Agent frameworks (LangGraph graph-state, OpenAI’s Item/Turn/Thread protocol, Anthropic Managed Agents loop runtime) compete on “each loop iteration is faster, cheaper in tokens, and stays on-task.” Nextie’s collective-intelligence work operates in the same layer for concurrent throughput.

Compliance facet: Iteration-level safety analysis is a formal gap. Current SOTIF analyses tend to treat the entire loop as a single functional unit, but AI Agent practice has shown that mismatched loop patterns (persistent vs one-shot) materially shift error rates and token consumption. Each iteration boundary hides a trigger condition — a layer of granularity SOTIF has not yet formalized.

Existing AD practice: Sense-Plan-Act (Apollo, Autoware) is an instantiation of Agent Loop under a different name.

Key research gap: iteration-level safety analysis methodology.

Primitive 2 — Planning & Task Decomposition

UX facet: The community is converging around three paradigms — LATS (MCTS variant), Plan-and-Execute, Microsoft TaskWeaver — competing on decomposition accuracy and token efficiency. Production-grade driving uses Behavior Trees and FSMs for decision smoothness.

Compliance facet: Plan-level rollback semantics. When a mid-plan step is found infeasible, how do we roll back safely, and how is the rollback chain covered by the safety argument? Neither ISO 21448 nor 26262 currently formalizes this.

Key research gap: plan-level rollback safety semantics.

Primitive 3 — Context Delivery & Compaction

UX facet: Anthropic’s Compaction API, LLMLingua-2, Active Context Compression — the race is about packing more useful information within tighter budgets. Automotive counterparts include multi-sensor fusion, Scene Graph, and SD/HD map loading strategies.

Compliance facet: How can compaction operations preserve SOTIF’s perception-completeness and safety-margin guarantees? Which context fragments are “safety-essential” and must not be compacted? This is the intersection of UX-dimension compression and ASIL-B+ data-validity requirements.

Key research gap: safety-constrained adaptive context compression.

Primitive 4 — Tool Design

UX facet: AI tool-UX guidelines (clear naming, strict schemas, error messages explaining why/how). Outlines and instructor turn these into runtime-enforced constraints. Automotive counterparts: drive-by-wire interfaces, CAN (ISO 11898), AUTOSAR port definitions.

Compliance facet: AI’s three tool-design axioms — clear errors, idempotency, observability — have not yet entered the ISO 26262 normative scope. In particular, “whether an actuator’s error message is sufficient for an AI caller to understand” is a standards gap.

Key research gap: AI-caller-friendly actuator error information standard.

Primitive 5 — Skills & MCP

UX facet: The Model Context Protocol (MCP), Microsoft’s Playwright MCP (using the accessibility tree instead of screenshots — an order of magnitude in token savings), Composio’s 250+ SaaS integrations, and the A2A Agent-to-Agent protocol are all competing on skill-ecosystem unification and efficiency. AUTOSAR Adaptive does “functions-as-services” on the vehicle side.

Compliance facet: A cross-OEM “Automotive MCP” does not exist. Every OEM maintains a proprietary skill interface. This is an opening where Chinese standardization bodies could plausibly lead an international specification.

Key research gap: cross-OEM interoperability protocol for automotive skills.

Primitive 6 — Permissions & Authorization

UX facet: Claude Agent SDK’s five-layer evaluation (hooks → deny → mode → allow → canUseTool); Microsoft’s Authorization Fabric with PEP/PDP; the two-stage classifier (fast gate + CoT on flags only) that replaces “ask every time” approval fatigue.

Compliance facet: How to apply the fast-gate + CoT pattern to ODD boundary checking, formalize it under SOTIF’s ODD-compliance requirements, and prove it remains defensible under ASIL decomposition — this is an open research topic.

Key research gap: two-stage authorization for ODD boundary detection (see Section 4 for the full leverage analysis).

Primitive 7 — Memory & State

UX facet: Letta/MemGPT’s three-tier memory model, mem0’s universal memory layer, Zep’s auto-summary + entity extraction + semantic search — the race is “the agent understands you better the more you use it.” Vehicle counterparts: World Model, Event Data Recorder, OBD-II.

Compliance facet: Cross-trip long-term memory introduces a behavioral-consistency-vs-safety tension. If the system remembers “this driver chose to change lanes at this intersection last time,” should it reuse that preference? How do we ensure memory persistence does not violate SOTIF’s “safety behavior predictability” axiom?

Key research gap: behavioral-consistency safety constraints for cross-trip memory.

Primitive 8 — Task Runners & Orchestration

UX facet: Meta’s REA supports 6-hour hibernation checkpoints and multi-workday task continuity — a benchmark for B2B Agent orchestration. Vehicle counterparts: Mission Planner, Route Manager, AUTOSAR OS Task.

Compliance facet: State preservation across OTA reboots and fault recovery. How do we ensure a multi-hour trip interrupted by a system restart does not break the safety argumentation chain?

Key research gap: long-task state preservation under reboot.

Primitive 9 — Verification & CI Integration

UX facet: The AI coding community has converged on a practice heuristic — “if an AI-written test passes buggy code, the test itself is judged invalid and must be redesigned.” This loop is already instrumented into CI tooling. Vehicle counterparts: the simulation + scenario library + CI regression chain.

Compliance facet: Migrate this test-invalidity detection heuristic into SOTIF simulation validation. When a production bad-case fails to reproduce in simulation, automatically distinguish “insufficient fidelity” from “insufficient scenario coverage” and force an improvement.

Key research gap: automatic test-invalidity detection for AD simulation (see Section 4).

Primitive 10 — Observability & Tracing

UX facet: LangSmith, OpenTelemetry for LLMs, full-path Agent decision tracing — these are the AI engineering standard. Vehicle counterparts: data uplink, EDR, V2X logs, UDS diagnostics.

Compliance facet: End-to-end decision traceability. Attention maps and saliency maps are only local explanations. They do not satisfy judicial or regulatory requirements for “reason-of-decision” traceability. This is one of the most decisive compliance gaps of the end-to-end era — it requires interdisciplinary (law + engineering) research.

Key research gap: juridically-grade decision traceability for end-to-end models.

Primitive 11 — Debugging & DX

UX facet: Interactive debugging, readable error messages, tight feedback loops — the AI-coding DX race is fierce (Cursor, Windsurf, Replit Agent). Vehicle counterparts: simulation replay, scenario reconstruction.

Compliance facet: Reproducibility bounds for end-to-end models. When the same input produces different outputs across runs, how do we reproduce the “scene of the problem” such that SOTIF’s Safety Case requirement — “every identified hazard must be reproducible” — can still be met? In the face of this requirement, end-to-end models are currently non-compliant.

Key research gap: quantitative characterization of end-to-end reproducibility boundaries.

Primitive 12 — Human-in-the-Loop

UX facet: Agent approval flows, interrupt mechanisms, human-oversight modes — the AI-side race is about “maximal user control with minimal user interruption.” Vehicle counterparts: Take-Over Request (TOR), Remote Operators, safety-driver programs.

Compliance facet: A supervisor-capability compliance model. ISO 26262’s controllability grading (C0–C3) only answers “can the driver control the vehicle.” It does not answer “does this driver have the capacity to supervise this AI.” Tang Daosheng’s observation — that stronger AI raises, not lowers, the demand placed on its human supervisor — translates directly into an open compliance gap. Section 4 develops this leverage point in detail.

Key research gap: supervisor-capability compliance model, including a C4-level extension for L3+ supervisory duty.

4. Three Highest-Leverage Points

Among the 12 research gaps, three stand out as simultaneously high-impact, methodologically well-scoped, and tightly anchored to existing standards — they are the natural first-wave priorities.

4.1 Test Invalidity ↔ SOTIF Clause 6

A practice heuristic that the AI coding community has converged on over the past year:

If an AI-written test “passes” buggy code, the test itself is judged invalid and must be redesigned.

This heuristic is, in essence, the engineering implementation of ISO 21448 Clause 6 — “continuous identification of trigger conditions.” Clause 6’s core logic: every in-field unintended failure should be treated as a new trigger condition that reflexively drives verification-strategy and test-coverage updates.

This principle has been in the SOTIF standard for some time. But at the engineering layer, the industry has stayed at “manually review failed cases” — there has been no automated mechanism to judge “do the existing tests actually cover this failure.”

The AI coding community went from the same starting point to a fully automated toolchain in roughly a year. The AD industry can adopt it directly.

Output forms: top-tier conferences (ICSE, ASE, ICRA) + open-source tooling + GB / CSAE standard proposal.

4.2 Two-Stage Authorization ↔ ODD Checking

A central design question in AI Agent circles over the past year has been “approval fatigue.” The community’s answer is a two-stage classifier:

Stage 1 (fast gate): a lightweight classifier dispatches the vast majority of clear cases in one shot.
Stage 2 (CoT reasoning): full reasoning triggered only when Stage 1 flags ambiguity.

This drops authorization compute cost by an order of magnitude.

Now consider ODD checking in production AD systems: every frame runs the full ODD check — regardless of whether the vehicle is on a sun-drenched highway straight or approaching a tunnel with backlight. The compute cost is identical.

Porting the AI-community two-stage pattern directly:

Stage 1: a lightweight model makes a coarse call — “inside ODD,” “approaching boundary,” “outside ODD.”
Stage 2: only when “approaching boundary” is returned does the full ODD detection logic fire.

Expected payoff: a large (order-of-magnitude) reduction in ODD-check compute cost, at no safety cost.

Output forms: engineering paper + production-grade solution + cross-OEM cross-validation.

4.3 Constraint As Guidance ↔ SOTIF Clause 8’s “Modification Of The Intended Functionality”

In early 2026, Tang Daosheng published a widely-circulated essay AI Formally Enters The Harness Era, giving Harness a crisp three-way metaphor:

The large model is the engine. The Harness is the wiring harness. The user is the driver.

The metaphor itself is not new — the Ford-era automotive industry was literally “engine first, harness next, driver third.” But Tang adds a deeper definition:

Constraint is not suppression of intelligence; it is the guidance of intelligence.

This sentence has a strict counterpart in automotive safety engineering: ISO 21448 SOTIF’s Clause 8 on “modification of the intended functionality.”

Clause 8’s “modifications” include:

Disabling certain functions (e.g., disabling ACC under a high-risk ODD)
Restricting activation conditions (e.g., activation only under specific weather and road types)
Degrading performance (e.g., lower max speed, increased following distance)
Using the driver as a risk-reduction measure (e.g., issuing a TOR at boundary conditions)

From an AI-engineering viewpoint, these actions can look like “weakening” the model. But in SOTIF’s original intent, they are never weakening — they are mechanisms to guide the model’s behavior into safe space.

In other words, every Clause 8 mechanism is, at its heart, a Harness mechanism:

ODD boundary checking → Primitive 6 (Permissions)
Function arbiter → Primitive 2 (Planning) + Primitive 6 (Permissions)
Degradation strategies → Primitive 12 (HITL)
TOR and remote operators → Primitive 12 (HITL)

Re-expressing Clause 8’s modification semantics through the 12 Primitives yields two very concrete research outputs.

First, a “Clause 8 modification type × Harness primitive” mapping. Drawing this out exposes several Clause 8 gaps:

“Model confidence as a modification trigger” — standard AI-Agent practice, not yet written into Clause 8.
“In-task functional degradation” — Meta REA’s checkpoint paradigm has no Clause 8 counterpart.
“Degradation semantics under multi-Agent collaboration failure” — relevant to V2X + multi-vehicle coordination, Clause 8 is silent.

Second, a supervisor-capability model. Tang writes in the same essay:

The stronger AI becomes, the more — not less — it demands of the humans around it. A person capable of safely supervising an autonomous-driving system needs to understand driving more deeply than an ordinary driver does.

Translated to AD compliance language:

An L3 driver’s supervision duty is harder than an L2 driver’s, not easier.
An L4 remote supervisor must understand system boundaries, know when an ODD fails, and know when to take over.
Controllability grades C0–C3 are, fundamentally, about “how much responsibility can the human, as part of the safety system, actually carry?”

Current ISO 26262 controllability grading addresses only “can the driver control the vehicle” — not “does this driver have the capacity to supervise this AI.” The supervisor-capability model is a completely empty research direction.

Output forms: academic papers on Clause 8 modification semantics + production-grade supervisor-capability evaluation + a standardization proposal for L3+ controllability (e.g., a C4 extension).

5. Two Directions That Need Someone To Step In

Charting the 12-Primitives landscape surfaces two directions that “no commercial main battlefield will address, but someone must do.” They are the natural domain of standardization bodies, corporate forward-looking R&D teams, third-party institutions, and research labs / universities.

5.1 The Automotive MCP (Primitive 5)

MCP is the open protocol Anthropic proposed in late 2024 to give AI Agents a unified way to call tools and data sources. In just over a year, it has become the de facto standard of the AI-tooling ecosystem.

The automotive industry has exactly the same need, but no single OEM will champion it — “deepening my own interfaces” makes commercial sense, “opening cross-OEM interoperability” does not.

That is precisely the territory of non-commercial stewards. MCP took off because Anthropic open-sourced the protocol and Google followed with A2A — academic-style openness plus standardization-body momentum, not any single company’s market strategy. An Automotive MCP needs the same kind of engine.

The Chinese window is concrete:

AUTOSAR Adaptive has China-localized extension interfaces.
C-V2X has a China-led base.
The GB / CSAE standardization system has genuine incentive to drive cross-OEM specifications.

A first draft in 2026–2027 and an ISO-level submission in 2028 is a realistic trajectory.

5.2 A Systematic Methodology For Safety-Compliance Harness

The larger direction: turn “12 Primitives × SOTIF” compliance mapping into an independent methodology.

It does not compete with any OEM’s commercial strategy — because UX is the OEM main battleground and compliance is the industry commons. Its audiences are:

Top-level topic selection in major projects and forward-looking programs
Drafting new clauses in GB / ISO
New research directions in university curricula
The PhD-topic pool for junior researchers
Incubation directions for corporate forward-looking teams
Evaluation frameworks for third-party certification bodies

This is not a “whoever moves first owns it” situation. Anyone who steps into the commons and contributes a brick is contributing, regardless of order. Subsequent contributors may extend or overturn prior work; what matters is that the direction is right, and that participation is itself valuable.

Within 12–18 months, comparable work will likely emerge from overseas labs — CMU CyLab, MIT CSAIL, TU Munich’s automotive-engineering chair, among others. This is not and should not be a “Chinese-first-means-Chinese-exclusive” exercise. It should be, from the outset, a multi-stakeholder public process.

6. The 12 Research Gaps

#	Research Gap	Likely Output Form	Priority
1	Iteration-level safety analysis for end-to-end loops	arxiv paper + tool prototype	Medium
2	Rollback safety semantics for multi-step planning	journal paper + simulation case	Medium
3	Safety-constrained adaptive perception compression	top-tier paper + production use	High
4	AI-friendly actuator-error information standard	standard proposal	Medium
5	Automotive MCP interoperability protocol	national-standard proposal	High
6	Formalization of two-stage ODD checking	engineering paper + production deployment	High
7	Safety constraints on cross-trip long-term memory	interdisciplinary paper (HCI + Safety)	Medium
8	Long-task checkpoint mechanism	engineering paper + industry whitepaper	Medium
9	Automatic test-invalidity detection for AD simulation	top-tier paper + open-source tool	High
10	Juridically-grade decision traceability for end-to-end models	law + engineering interdisciplinary paper	High
11	Reproducibility bounds for end-to-end models	foundational research paper	Medium
12	Supervisor-capability compliance model (incl. C4 extension)	engineering paper + standardization proposal	High

The six “High” items align directly with the leverage points and directions in Sections 4 and 5. Each can be pursued independently as a focused research program.

Speaking about “safety-compliance Harness” without doing anything convinces nobody. Over the past month, working in personal time with heavy assistance from AI tooling, I have done a few small open projects that turn out — in hindsight — to sit squarely on the compliance side of the 12 Primitives. These projects were not originally conceived with the “Harness” coordinate system; but once that coordinate system is laid out, they map to it cleanly.

A few caveats up front:

These projects were done in personal (non-working) time.
They relied heavily on AI tooling (primarily the Claude ecosystem).
They are incomplete and small-sample — this is personal exploration.
But they are tightly connected to standards and regulations, and I genuinely hope that these attempts can, in some form, be seen by standardization bodies, forward-looking corporate teams, or third-party institutions — and perhaps make their way into standards and regulations themselves.

7.1 ROAM — Open Database Of L4 Robotaxi Remote-Operation Incidents

ROAM (RoboTaxi Operations Anomaly Management) is an open-source L4 Robotaxi remote-operation incident database + scenario taxonomy + reference architecture, published in early April 2026.

Current coverage:

16 structured public-incident records (Waymo / Cruise / Baidu Apollo / Pony.ai, 2023–2026)
Scenario taxonomy v1.0 (6 categories A–F, 27 sub-scenarios) + severity S0–S4 + urgency U0–U3
A three-tier reference architecture (AI autonomous 70% → AI + human confirmation 25% → remote teleop 5%)
8 KPI definitions
Chinese and English bilingual literature surveys (each 4000+ words, 42 references with clickable URLs)
13 international-standard mappings + Chinese-standard alignment analysis

ROAM’s placement in the 12-Primitives framework:

Primitive 10 (Observability / decision traceability): structuring and publishing incident-scene records at the remote-operations layer. Co-locating every Robotaxi’s incidents in one scenario-taxonomy coordinate system is itself an industry-level traceability experiment.
Primitive 12 (HITL): the conditions under which a remote operator intervenes, and the takeover duration and action type after intervention, are the baseline data for HITL compliance.

7.2 OpenODC — Open Platform For AD Operational Design Conditions

OpenODC (Open Platform for Automated Driving Operational Design Conditions) is based on GB/T 45312-2025 Intelligent and Connected Vehicles — Operational Design Conditions for Automated Driving Systems, a Chinese national standard in whose drafting I participated.

Core deliverables:

Complete GB/T 45312-2025 element hierarchy JSON Schema (144 elements / 7 categories)
Machine-readable quantized grading (12-level wind, 4-level rain, 4-level visibility, snow, illumination)
/gallery sample library + /view document detail (4 views: developer / test / regulator / consumer)
/editor in-browser ODC editor
/compare multi-document diff (2–4 side-by-side with consistency annotation)
GitHub PR template + CI auto-validation

OpenODC in the 12-Primitives framework:

Primitive 6 (Permissions / ODD): turning ODD dimensions from implicit convention into explicit schema, so that industry ODD declarations become comparable and traceable. “Permissions” as a primitive requires, first and foremost, that the permission boundary be written down legibly.
Primitive 11 (Debugging / DX): making different OEMs’ ODC declarations comparable is the precondition for reproducibility. If ODD boundaries are stated differently by each OEM, cross-OEM hazard reproduction is out of reach.

7.3 AD Standards Tracker — Global AD Regulation & Standards Monitor

AD Standards Tracker is a structured tracker of global AD standards, MVP launched mid-April 2026.

Current coverage:

7 jurisdictions (International / China / US / EU / UK / Japan / Korea)
~79 structured standard records
YAML data layer + Next.js frontend + Vercel deployment
GitHub Actions automated crawler framework (not yet wired to production data sources)

AD Standards Tracker in the 12-Primitives framework:

Primitive 4 (Tool Design): making the “global AD standards” ecosystem a structured, callable, comparable resource — a meta-tool tool-design exercise.
Primitive 10 (Observability): compliance itself needs observability. Whether a given OEM is compliant in a given jurisdiction requires that the global standards evolution be trackable, subscribable, and citable.

7.4 Two More Threads In Flight

awesome-harness-engineering PR #1 — submits the English cross-reference table of “12 Primitives × ISO 21448” as a patch to the upstream repository. First internationally-visible signal for compliance-dimension Harness in the open-source community.

An English ArXiv Position Paper is in draft, with working title:

Compliance-Oriented Harness Engineering: A Framework for AI-First Safety-Critical Systems

Submission target: Q2 2026.

7.5 An Honest Note

Let me restate the disclaimer from the opening of this section: the four efforts above were all built in personal time as small explorations. None of them rises to the level of “engineering validation” in any serious sense. They are more like a few small lamps lit on this methodology map — enough to give the “safety-compliance dimension of Harness” a visible shape in concrete engineering.

Making this direction real will require far more participants. If these small projects can encourage even a few peers to, under a banner of openness and transparency, steer their own work in a similar direction, they will have vastly exceeded their original brief.

8. An Invitation To Three Audiences

If you are an OEM algorithm or engineering lead — the 12 Primitives can serve as a self-audit checklist. You are already investing on the UX side of the commercial strategy; the compliance-side gaps are what we can talk about.

If you are involved in standards drafting — the Automotive MCP and the safety-compliance Harness specification are open windows. At GB, CSAE, or ISO levels, I am happy to see this framework used as drafting input.

If you are a graduate student, a young faculty member, or a member of a corporate forward-looking team — any one of the 12 research gaps is worth a focused dive into the underlying mechanism and engineering trajectory. Whether you move first or later, the act of stepping in is itself part of advancing this commons.

References And Linked Work

Chinese deep-dive version (this article): https://blog.autozyx.com/posts/harness-compliance-dimension/
Author’s earlier piece in the same series: Harness Engineering’s Cross-Domain Application In Intelligent Driving
ROAM project: https://roam.autozyx.com ｜ GitHub: https://github.com/AutoZYX/ROAM
OpenODC platform: https://openodc.autozyx.com ｜ GitHub: https://github.com/AutoZYX/OpenODC
AD Standards Tracker: https://standards.autozyx.com
Author’s other open projects: Co4Pilot · ADSafetyPilot · Good Ideas 2.0
awesome-harness-engineering PR #1: ai-boost/awesome-harness-engineering#1

Key Industry Events Referenced

Feb 2026: OpenAI, Harness Engineering: Leveraging Codex in an Agent-First World
Mar 2026: Anthropic, Managed Agents architecture launch
Apr 2026: Nextie raises two rounds in one month, backed by Qi Lu and Kai-Fu Lee
Apr 2026: Synced, The New Frontier: Harness, Backed Heavily By Lee and Lu
Apr 2026: Tang Daosheng, AI Formally Enters The Harness Era

Key Standards Referenced

ISO 21448:2022 — Road vehicles — Safety of the intended functionality
ISO 26262:2018 — Road vehicles — Functional safety
ISO 11898 — Road vehicles — Controller area network (CAN)
GB/T 45312-2025 — Intelligent and Connected Vehicles — Operational Design Conditions for Automated Driving Systems
AUTOSAR Adaptive Platform Specification

About the author: Yuxin Zhang, Associate Professor at the School of Automotive Engineering, Jilin University; Director of the AD Safety Joint Lab; Head of Functional Safety at Zhuoyu Technology; Founder of DRIVEResearch; Visiting Scholar at Durham University, UK (2025–2026). Research interests: SOTIF (ISO 21448), Functional Safety (ISO 26262), scenario-driven testing and evaluation.

Contact: yuxinzhang@jlu.edu.cn

Citation: Zhang, Yuxin. Harness Engineering: User Experience vs Safety Compliance — A Direction Mainstream Roadmaps Have Collectively Skipped. 2026-04-19. https://blog.autozyx.com/posts/harness-compliance-dimension/

License: CC BY 4.0

1. Two Signals Surfacing At The Same Time#

1.1 Dense Signals From The AI Engineering Community#

1.2 Quieter But Equally Clear Signals From The Automotive Industry#

2. User Experience vs Safety Compliance: The Core Distinction#

2.1 The Shared Pattern Across Three Mainstream Roadmaps#

2.2 Safety Compliance: The Side Everyone Skipped#

2.3 Why The Mainstream Won’t Self-Invest In Safety Compliance#

3. 12 Primitives × SOTIF Full Mapping#

3.1 Overview Table#

3.2 Primitive-By-Primitive Analysis#

Primitive 1 — Agent Loop#

Primitive 2 — Planning & Task Decomposition#

Primitive 3 — Context Delivery & Compaction#

Primitive 4 — Tool Design#

Primitive 5 — Skills & MCP#

Primitive 6 — Permissions & Authorization#

Primitive 7 — Memory & State#

Primitive 8 — Task Runners & Orchestration#

Primitive 9 — Verification & CI Integration#

Primitive 10 — Observability & Tracing#

Primitive 11 — Debugging & DX#

Primitive 12 — Human-in-the-Loop#

4. Three Highest-Leverage Points#

4.1 Test Invalidity ↔ SOTIF Clause 6#

4.2 Two-Stage Authorization ↔ ODD Checking#

4.3 Constraint As Guidance ↔ SOTIF Clause 8’s “Modification Of The Intended Functionality”#

5. Two Directions That Need Someone To Step In#

5.1 The Automotive MCP (Primitive 5)#

5.2 A Systematic Methodology For Safety-Compliance Harness#

6. The 12 Research Gaps#

7. Related Open Explorations#

7.1 ROAM — Open Database Of L4 Robotaxi Remote-Operation Incidents#

7.2 OpenODC — Open Platform For AD Operational Design Conditions#

7.3 AD Standards Tracker — Global AD Regulation & Standards Monitor#

7.4 Two More Threads In Flight#

7.5 An Honest Note#

8. An Invitation To Three Audiences#

References And Linked Work#

Related Writing And Projects#

Key Industry Events Referenced#

Key Standards Referenced#