Abstract: In Q1 2026, “Harness Engineering” surfaced almost simultaneously at OpenAI, Anthropic, and the Chinese startup Nextie, and the “12 Primitives” converged in the open-source community as a shared taxonomy. This essay argues that essentially all mainstream investment in Harness has concentrated in a single dimension — user experience, performance, efficiency — while the dimension that actually determines market access in Safety-Critical domains (autonomous driving, medical AI, financial risk control) has been collectively skipped: safety compliance. By constructing a two-way mapping between the 12 Harness Primitives and SOTIF (ISO 21448), this essay identifies 12 concrete research directions, offered as a starting point for standardization bodies, corporate R&D, third-party institutions, and academic labs to jointly fill in this commons. A ~3000-word Chinese short form is available on the author’s WeChat channel.
1. Two Signals Surfacing At The Same Time
Over the past two months, the term “Harness Engineering” has been high-frequency vocabulary in two very different circles.
1.1 Dense Signals From The AI Engineering Community
In February 2026, OpenAI’s official blog post Harness Engineering: Leveraging Codex in an Agent-First World turned what had been an implicit community consensus into a formally named concept. The following month, Anthropic shipped the Managed Agents architecture, whose technical documentation treats the “Agent Harness” as a first-class engineering object. In April, the Chinese startup Nextie closed two rounds within a single month, with Qi Lu and Kai-Fu Lee jointly backing a company that was only four months old — its core narrative being “collective intelligence + Harness.” The same month, the Chinese tech publication Synced ran a feature titled The New Frontier: Harness, Backed Heavily By Lee and Lu, pushing the term fully into the industry mainstream.
In parallel, the GitHub repository awesome-harness-engineering consolidated the implicit community consensus into a formal taxonomy of 12 Primitives: Agent Loop, Planning, Context Delivery, Tool Design, Skills/MCP, Permissions, Memory, Task Runners, Verification, Observability, Debugging, and HITL (Human-in-the-Loop).
1.2 Quieter But Equally Clear Signals From The Automotive Industry
The automotive industry’s signals have been quieter but equally clear. Leading AD companies are independently reinforcing long-term investment in model-output interpretability and transparency, positioning it as a core strategic direction for the next 5–10 years. Internal documents at several OEM technical teams have begun discussing “Harness” as a concept.
When these two streams of signals arrived together, the first instinct was to align them — isn’t Sense-Plan-Act just Agent Loop? Isn’t ODD checking just Permissions? Isn’t the simulation pipeline just Verification?
But after the alignment, something more valuable emerged.
Both sides are talking about Harness Engineering, but they are not talking about the same Harness. More importantly, neither of the two mainstream trajectories has meaningful intersection with what I have worked on for years — AD safety under SOTIF / ISO 21448.
That is not coincidence. That is structure.
2. User Experience vs Safety Compliance: The Core Distinction
2.1 The Shared Pattern Across Three Mainstream Roadmaps
Laying out a week’s worth of materials, a commonality surfaces across the three mainstream tracks:
| Track | Representative Players | Core Requirement | Optimization Goal |
|---|---|---|---|
| Consumer collective intelligence | Nextie / Tuanzi / Xiaobing Island | Token efficiency + long-horizon tasks + multi-agent coordination | User experience |
| Enterprise Agent platforms | Anthropic Managed Agents / OpenAI Codex Agent-First | Code productivity + long-task reliability | Engineering efficiency |
| Automotive model interpretability | Leading AD companies | Model-output transparency + user trust | Product experience |
All three face the same underlying problem:
How do we make an intelligent system — powerful, but not fully predictable — operate reliably in open environments?
And all three have chosen the same optimization direction: they are investing in the user experience / performance / efficiency dimension of Harness. Making models stronger, faster, more stable, more transparent — that is their main battleground.
This is the inevitable shape of commercial competition. Whoever delivers better UX, whoever serves stronger models, wins.
2.2 Safety Compliance: The Side Everyone Skipped
But Harness has another dimension that the mainstream has collectively skipped — safety compliance.
When an intelligent system needs to win market approval, pass regulatory review, or be held accountable in a judicial proceeding, “strong experience” is no longer enough. It also needs to be “compliance-stable.”
This dimension is not particularly important in consumer-AI contexts (where the failure cost is shallow conversation) or enterprise-Agent contexts (where the failure cost is buggy code). But in Safety-Critical domains like autonomous driving, medical AI, and financial risk control, it is the second bottom line — the one that decides whether the product can go to market at all:
- ISO 21448 (SOTIF) requires “every identified hazard must be reproducible”
- ISO 26262 requires “an argumentation chain commensurate with the ASIL level”
- Regulators require “every on-road decision must be traceable”
Once these requirements are applied to AI Agents, every one of the 12 primitives surfaces a new class of compliance-engineering problems.
2.3 Why The Mainstream Won’t Self-Invest In Safety Compliance
Why won’t mainstream players self-invest in the compliance commons? Two structural reasons:
First, compliance is not a commercial differentiator. If “end-to-end decision traceability” becomes the industry minimum, no one will choose your car or your Agent because of it. It’s a ticket, not a selling point.
Second, compliance is an industry commons. If a single OEM invests five engineers in “formal methods for end-to-end decision traceability,” the outcome is a shared industry standard. Pure commercial entities have no incentive to do this kind of work.
Commons need stewards. Standardization bodies, corporate R&D (forward-looking prototyping groups), third-party institutions, academic labs and universities — these are the natural stewards. Their work is what isn’t urgent today but will be indispensable tomorrow.
In one line: Harness is not a single-dimensional engineering problem. It has a user-experience facet and a safety-compliance facet. The former has been saturated by capital and big tech; the latter is decisive in Safety-Critical contexts, and it is almost empty.

Harness Dual Dimensions
3. 12 Primitives × SOTIF Full Mapping
3.1 Overview Table
| # | Harness Primitive | SOTIF (ISO 21448) Clause | Related ISO 26262 Practice | Existing AD Industry Counterpart | Research Gap (Compliance Facet) |
|---|---|---|---|---|---|
| 1 | Agent Loop | Clause 4 / 6 | Cyclic execution + WCET | Sense-Plan-Act architecture | Iteration-level safety analysis |
| 2 | Planning & Decomposition | Clause 5 | Hierarchical FSM + FMEA | Behavior Tree, FSM | Plan-level rollback semantics |
| 3 | Context Delivery & Compaction | Clause 6 / 8 | ASIL B+ sensor validity | Multi-sensor fusion, Scene Graph | Safety-constrained adaptive compression |
| 4 | Tool Design | Clause 8 | Actuator FMEA + fail-operational | Drive-by-wire, CAN | AI-friendly actuator error standard |
| 5 | Skills & MCP | Clause 4 | Software unit integration | AUTOSAR Adaptive Function Bus | Cross-OEM interop (“Automotive MCP”) |
| 6 | Permissions & Authorization | Clause 4 / 8 | Mode mgmt + FFI | ODD checking, Function Arbiter | Two-stage authorization (fast gate + CoT) |
| 7 | Memory & State | Clause 12 | Persistent fault memory | World Model, EDR | Cross-trip long-term memory safety |
| 8 | Task Runners & Orchestration | Clause 4 | Real-time scheduler | Mission Planner, Route Manager | State preservation across reboots |
| 9 | Verification & CI Integration | Clause 9 / 10 | MIL/SIL/PIL/HIL | Simulation + Scenario Library + CI | Automatic test-invalidity detection |
| 10 | Observability & Tracing | Clause 12 | Diagnostic Event Manager, DTC | Data uplink, EDR, V2X logs | End-to-end decision traceability |
| 11 | Debugging & DX | Clause 6 | FuSa Toolchain | Scenario replay tools | End-to-end reproducibility bounds |
| 12 | Human-in-the-Loop | Clause 5 / 8 | Controllability (C0–C3) | TOR, Remote Operator | Supervisor-capability compliance model |

12 Primitives Overview
3.2 Primitive-By-Primitive Analysis
Each primitive is decomposed into five sections: UX facet, compliance facet, standards mapping, existing industry practice, and the key research gap.
Primitive 1 — Agent Loop
UX facet: Mainstream Agent frameworks (LangGraph graph-state, OpenAI’s Item/Turn/Thread protocol, Anthropic Managed Agents loop runtime) compete on “each loop iteration is faster, cheaper in tokens, and stays on-task.” Nextie’s collective-intelligence work operates in the same layer for concurrent throughput.
Compliance facet: Iteration-level safety analysis is a formal gap. Current SOTIF analyses tend to treat the entire loop as a single functional unit, but AI Agent practice has shown that mismatched loop patterns (persistent vs one-shot) materially shift error rates and token consumption. Each iteration boundary hides a trigger condition — a layer of granularity SOTIF has not yet formalized.
Existing AD practice: Sense-Plan-Act (Apollo, Autoware) is an instantiation of Agent Loop under a different name.
Key research gap: iteration-level safety analysis methodology.
Primitive 2 — Planning & Task Decomposition
UX facet: The community is converging around three paradigms — LATS (MCTS variant), Plan-and-Execute, Microsoft TaskWeaver — competing on decomposition accuracy and token efficiency. Production-grade driving uses Behavior Trees and FSMs for decision smoothness.
Compliance facet: Plan-level rollback semantics. When a mid-plan step is found infeasible, how do we roll back safely, and how is the rollback chain covered by the safety argument? Neither ISO 21448 nor 26262 currently formalizes this.
Key research gap: plan-level rollback safety semantics.
Primitive 3 — Context Delivery & Compaction
UX facet: Anthropic’s Compaction API, LLMLingua-2, Active Context Compression — the race is about packing more useful information within tighter budgets. Automotive counterparts include multi-sensor fusion, Scene Graph, and SD/HD map loading strategies.
Compliance facet: How can compaction operations preserve SOTIF’s perception-completeness and safety-margin guarantees? Which context fragments are “safety-essential” and must not be compacted? This is the intersection of UX-dimension compression and ASIL-B+ data-validity requirements.
Key research gap: safety-constrained adaptive context compression.
Primitive 4 — Tool Design
UX facet: AI tool-UX guidelines (clear naming, strict schemas, error messages explaining why/how). Outlines and instructor turn these into runtime-enforced constraints. Automotive counterparts: drive-by-wire interfaces, CAN (ISO 11898), AUTOSAR port definitions.
Compliance facet: AI’s three tool-design axioms — clear errors, idempotency, observability — have not yet entered the ISO 26262 normative scope. In particular, “whether an actuator’s error message is sufficient for an AI caller to understand” is a standards gap.
Key research gap: AI-caller-friendly actuator error information standard.
Primitive 5 — Skills & MCP
UX facet: The Model Context Protocol (MCP), Microsoft’s Playwright MCP (using the accessibility tree instead of screenshots — an order of magnitude in token savings), Composio’s 250+ SaaS integrations, and the A2A Agent-to-Agent protocol are all competing on skill-ecosystem unification and efficiency. AUTOSAR Adaptive does “functions-as-services” on the vehicle side.
Compliance facet: A cross-OEM “Automotive MCP” does not exist. Every OEM maintains a proprietary skill interface. This is an opening where Chinese standardization bodies could plausibly lead an international specification.
Key research gap: cross-OEM interoperability protocol for automotive skills.
Primitive 6 — Permissions & Authorization
UX facet: Claude Agent SDK’s five-layer evaluation (hooks → deny → mode → allow → canUseTool); Microsoft’s Authorization Fabric with PEP/PDP; the two-stage classifier (fast gate + CoT on flags only) that replaces “ask every time” approval fatigue.
Compliance facet: How to apply the fast-gate + CoT pattern to ODD boundary checking, formalize it under SOTIF’s ODD-compliance requirements, and prove it remains defensible under ASIL decomposition — this is an open research topic.
Key research gap: two-stage authorization for ODD boundary detection (see Section 4 for the full leverage analysis).
Primitive 7 — Memory & State
UX facet: Letta/MemGPT’s three-tier memory model, mem0’s universal memory layer, Zep’s auto-summary + entity extraction + semantic search — the race is “the agent understands you better the more you use it.” Vehicle counterparts: World Model, Event Data Recorder, OBD-II.
Compliance facet: Cross-trip long-term memory introduces a behavioral-consistency-vs-safety tension. If the system remembers “this driver chose to change lanes at this intersection last time,” should it reuse that preference? How do we ensure memory persistence does not violate SOTIF’s “safety behavior predictability” axiom?
Key research gap: behavioral-consistency safety constraints for cross-trip memory.
Primitive 8 — Task Runners & Orchestration
UX facet: Meta’s REA supports 6-hour hibernation checkpoints and multi-workday task continuity — a benchmark for B2B Agent orchestration. Vehicle counterparts: Mission Planner, Route Manager, AUTOSAR OS Task.
Compliance facet: State preservation across OTA reboots and fault recovery. How do we ensure a multi-hour trip interrupted by a system restart does not break the safety argumentation chain?
Key research gap: long-task state preservation under reboot.
Primitive 9 — Verification & CI Integration
UX facet: The AI coding community has converged on a practice heuristic — “if an AI-written test passes buggy code, the test itself is judged invalid and must be redesigned.” This loop is already instrumented into CI tooling. Vehicle counterparts: the simulation + scenario library + CI regression chain.
Compliance facet: Migrate this test-invalidity detection heuristic into SOTIF simulation validation. When a production bad-case fails to reproduce in simulation, automatically distinguish “insufficient fidelity” from “insufficient scenario coverage” and force an improvement.
Key research gap: automatic test-invalidity detection for AD simulation (see Section 4).
Primitive 10 — Observability & Tracing
UX facet: LangSmith, OpenTelemetry for LLMs, full-path Agent decision tracing — these are the AI engineering standard. Vehicle counterparts: data uplink, EDR, V2X logs, UDS diagnostics.
Compliance facet: End-to-end decision traceability. Attention maps and saliency maps are only local explanations. They do not satisfy judicial or regulatory requirements for “reason-of-decision” traceability. This is one of the most decisive compliance gaps of the end-to-end era — it requires interdisciplinary (law + engineering) research.
Key research gap: juridically-grade decision traceability for end-to-end models.
Primitive 11 — Debugging & DX
UX facet: Interactive debugging, readable error messages, tight feedback loops — the AI-coding DX race is fierce (Cursor, Windsurf, Replit Agent). Vehicle counterparts: simulation replay, scenario reconstruction.
Compliance facet: Reproducibility bounds for end-to-end models. When the same input produces different outputs across runs, how do we reproduce the “scene of the problem” such that SOTIF’s Safety Case requirement — “every identified hazard must be reproducible” — can still be met? In the face of this requirement, end-to-end models are currently non-compliant.
Key research gap: quantitative characterization of end-to-end reproducibility boundaries.
Primitive 12 — Human-in-the-Loop
UX facet: Agent approval flows, interrupt mechanisms, human-oversight modes — the AI-side race is about “maximal user control with minimal user interruption.” Vehicle counterparts: Take-Over Request (TOR), Remote Operators, safety-driver programs.
Compliance facet: A supervisor-capability compliance model. ISO 26262’s controllability grading (C0–C3) only answers “can the driver control the vehicle.” It does not answer “does this driver have the capacity to supervise this AI.” Tang Daosheng’s observation — that stronger AI raises, not lowers, the demand placed on its human supervisor — translates directly into an open compliance gap. Section 4 develops this leverage point in detail.
Key research gap: supervisor-capability compliance model, including a C4-level extension for L3+ supervisory duty.
4. Three Highest-Leverage Points
Among the 12 research gaps, three stand out as simultaneously high-impact, methodologically well-scoped, and tightly anchored to existing standards — they are the natural first-wave priorities.
4.1 Test Invalidity ↔ SOTIF Clause 6
A practice heuristic that the AI coding community has converged on over the past year:
If an AI-written test “passes” buggy code, the test itself is judged invalid and must be redesigned.
This heuristic is, in essence, the engineering implementation of ISO 21448 Clause 6 — “continuous identification of trigger conditions.” Clause 6’s core logic: every in-field unintended failure should be treated as a new trigger condition that reflexively drives verification-strategy and test-coverage updates.
This principle has been in the SOTIF standard for some time. But at the engineering layer, the industry has stayed at “manually review failed cases” — there has been no automated mechanism to judge “do the existing tests actually cover this failure.”
The AI coding community went from the same starting point to a fully automated toolchain in roughly a year. The AD industry can adopt it directly.
Output forms: top-tier conferences (ICSE, ASE, ICRA) + open-source tooling + GB / CSAE standard proposal.
4.2 Two-Stage Authorization ↔ ODD Checking
A central design question in AI Agent circles over the past year has been “approval fatigue.” The community’s answer is a two-stage classifier:
- Stage 1 (fast gate): a lightweight classifier dispatches the vast majority of clear cases in one shot.
- Stage 2 (CoT reasoning): full reasoning triggered only when Stage 1 flags ambiguity.
This drops authorization compute cost by an order of magnitude.
Now consider ODD checking in production AD systems: every frame runs the full ODD check — regardless of whether the vehicle is on a sun-drenched highway straight or approaching a tunnel with backlight. The compute cost is identical.
Porting the AI-community two-stage pattern directly:
- Stage 1: a lightweight model makes a coarse call — “inside ODD,” “approaching boundary,” “outside ODD.”
- Stage 2: only when “approaching boundary” is returned does the full ODD detection logic fire.
Expected payoff: a large (order-of-magnitude) reduction in ODD-check compute cost, at no safety cost.
Output forms: engineering paper + production-grade solution + cross-OEM cross-validation.
4.3 Constraint As Guidance ↔ SOTIF Clause 8’s “Modification Of The Intended Functionality”
In early 2026, Tang Daosheng published a widely-circulated essay AI Formally Enters The Harness Era, giving Harness a crisp three-way metaphor:
The large model is the engine. The Harness is the wiring harness. The user is the driver.
The metaphor itself is not new — the Ford-era automotive industry was literally “engine first, harness next, driver third.” But Tang adds a deeper definition:
Constraint is not suppression of intelligence; it is the guidance of intelligence.
This sentence has a strict counterpart in automotive safety engineering: ISO 21448 SOTIF’s Clause 8 on “modification of the intended functionality.”
Clause 8’s “modifications” include:
- Disabling certain functions (e.g., disabling ACC under a high-risk ODD)
- Restricting activation conditions (e.g., activation only under specific weather and road types)
- Degrading performance (e.g., lower max speed, increased following distance)
- Using the driver as a risk-reduction measure (e.g., issuing a TOR at boundary conditions)
From an AI-engineering viewpoint, these actions can look like “weakening” the model. But in SOTIF’s original intent, they are never weakening — they are mechanisms to guide the model’s behavior into safe space.
In other words, every Clause 8 mechanism is, at its heart, a Harness mechanism:
- ODD boundary checking → Primitive 6 (Permissions)
- Function arbiter → Primitive 2 (Planning) + Primitive 6 (Permissions)
- Degradation strategies → Primitive 12 (HITL)
- TOR and remote operators → Primitive 12 (HITL)
Re-expressing Clause 8’s modification semantics through the 12 Primitives yields two very concrete research outputs.
First, a “Clause 8 modification type × Harness primitive” mapping. Drawing this out exposes several Clause 8 gaps:
- “Model confidence as a modification trigger” — standard AI-Agent practice, not yet written into Clause 8.
- “In-task functional degradation” — Meta REA’s checkpoint paradigm has no Clause 8 counterpart.
- “Degradation semantics under multi-Agent collaboration failure” — relevant to V2X + multi-vehicle coordination, Clause 8 is silent.
Second, a supervisor-capability model. Tang writes in the same essay:
The stronger AI becomes, the more — not less — it demands of the humans around it. A person capable of safely supervising an autonomous-driving system needs to understand driving more deeply than an ordinary driver does.
Translated to AD compliance language:
- An L3 driver’s supervision duty is harder than an L2 driver’s, not easier.
- An L4 remote supervisor must understand system boundaries, know when an ODD fails, and know when to take over.
- Controllability grades C0–C3 are, fundamentally, about “how much responsibility can the human, as part of the safety system, actually carry?”
Current ISO 26262 controllability grading addresses only “can the driver control the vehicle” — not “does this driver have the capacity to supervise this AI.” The supervisor-capability model is a completely empty research direction.
Output forms: academic papers on Clause 8 modification semantics + production-grade supervisor-capability evaluation + a standardization proposal for L3+ controllability (e.g., a C4 extension).
5. Two Directions That Need Someone To Step In
Charting the 12-Primitives landscape surfaces two directions that “no commercial main battlefield will address, but someone must do.” They are the natural domain of standardization bodies, corporate forward-looking R&D teams, third-party institutions, and research labs / universities.
5.1 The Automotive MCP (Primitive 5)
MCP is the open protocol Anthropic proposed in late 2024 to give AI Agents a unified way to call tools and data sources. In just over a year, it has become the de facto standard of the AI-tooling ecosystem.
The automotive industry has exactly the same need, but no single OEM will champion it — “deepening my own interfaces” makes commercial sense, “opening cross-OEM interoperability” does not.
That is precisely the territory of non-commercial stewards. MCP took off because Anthropic open-sourced the protocol and Google followed with A2A — academic-style openness plus standardization-body momentum, not any single company’s market strategy. An Automotive MCP needs the same kind of engine.
The Chinese window is concrete:
- AUTOSAR Adaptive has China-localized extension interfaces.
- C-V2X has a China-led base.
- The GB / CSAE standardization system has genuine incentive to drive cross-OEM specifications.
A first draft in 2026–2027 and an ISO-level submission in 2028 is a realistic trajectory.
5.2 A Systematic Methodology For Safety-Compliance Harness
The larger direction: turn “12 Primitives × SOTIF” compliance mapping into an independent methodology.
It does not compete with any OEM’s commercial strategy — because UX is the OEM main battleground and compliance is the industry commons. Its audiences are:
- Top-level topic selection in major projects and forward-looking programs
- Drafting new clauses in GB / ISO
- New research directions in university curricula
- The PhD-topic pool for junior researchers
- Incubation directions for corporate forward-looking teams
- Evaluation frameworks for third-party certification bodies
This is not a “whoever moves first owns it” situation. Anyone who steps into the commons and contributes a brick is contributing, regardless of order. Subsequent contributors may extend or overturn prior work; what matters is that the direction is right, and that participation is itself valuable.
Within 12–18 months, comparable work will likely emerge from overseas labs — CMU CyLab, MIT CSAIL, TU Munich’s automotive-engineering chair, among others. This is not and should not be a “Chinese-first-means-Chinese-exclusive” exercise. It should be, from the outset, a multi-stakeholder public process.
6. The 12 Research Gaps
| # | Research Gap | Likely Output Form | Priority |
|---|---|---|---|
| 1 | Iteration-level safety analysis for end-to-end loops | arxiv paper + tool prototype | Medium |
| 2 | Rollback safety semantics for multi-step planning | journal paper + simulation case | Medium |
| 3 | Safety-constrained adaptive perception compression | top-tier paper + production use | High |
| 4 | AI-friendly actuator-error information standard | standard proposal | Medium |
| 5 | Automotive MCP interoperability protocol | national-standard proposal | High |
| 6 | Formalization of two-stage ODD checking | engineering paper + production deployment | High |
| 7 | Safety constraints on cross-trip long-term memory | interdisciplinary paper (HCI + Safety) | Medium |
| 8 | Long-task checkpoint mechanism | engineering paper + industry whitepaper | Medium |
| 9 | Automatic test-invalidity detection for AD simulation | top-tier paper + open-source tool | High |
| 10 | Juridically-grade decision traceability for end-to-end models | law + engineering interdisciplinary paper | High |
| 11 | Reproducibility bounds for end-to-end models | foundational research paper | Medium |
| 12 | Supervisor-capability compliance model (incl. C4 extension) | engineering paper + standardization proposal | High |
The six “High” items align directly with the leverage points and directions in Sections 4 and 5. Each can be pursued independently as a focused research program.
7. Related Open Explorations
Speaking about “safety-compliance Harness” without doing anything convinces nobody. Over the past month, working in personal time with heavy assistance from AI tooling, I have done a few small open projects that turn out — in hindsight — to sit squarely on the compliance side of the 12 Primitives. These projects were not originally conceived with the “Harness” coordinate system; but once that coordinate system is laid out, they map to it cleanly.
A few caveats up front:
- These projects were done in personal (non-working) time.
- They relied heavily on AI tooling (primarily the Claude ecosystem).
- They are incomplete and small-sample — this is personal exploration.
- But they are tightly connected to standards and regulations, and I genuinely hope that these attempts can, in some form, be seen by standardization bodies, forward-looking corporate teams, or third-party institutions — and perhaps make their way into standards and regulations themselves.
7.1 ROAM — Open Database Of L4 Robotaxi Remote-Operation Incidents
ROAM (RoboTaxi Operations Anomaly Management) is an open-source L4 Robotaxi remote-operation incident database + scenario taxonomy + reference architecture, published in early April 2026.
Current coverage:
- 16 structured public-incident records (Waymo / Cruise / Baidu Apollo / Pony.ai, 2023–2026)
- Scenario taxonomy v1.0 (6 categories A–F, 27 sub-scenarios) + severity S0–S4 + urgency U0–U3
- A three-tier reference architecture (AI autonomous 70% → AI + human confirmation 25% → remote teleop 5%)
- 8 KPI definitions
- Chinese and English bilingual literature surveys (each 4000+ words, 42 references with clickable URLs)
- 13 international-standard mappings + Chinese-standard alignment analysis
ROAM’s placement in the 12-Primitives framework:
- Primitive 10 (Observability / decision traceability): structuring and publishing incident-scene records at the remote-operations layer. Co-locating every Robotaxi’s incidents in one scenario-taxonomy coordinate system is itself an industry-level traceability experiment.
- Primitive 12 (HITL): the conditions under which a remote operator intervenes, and the takeover duration and action type after intervention, are the baseline data for HITL compliance.
7.2 OpenODC — Open Platform For AD Operational Design Conditions
OpenODC (Open Platform for Automated Driving Operational Design Conditions) is based on GB/T 45312-2025 Intelligent and Connected Vehicles — Operational Design Conditions for Automated Driving Systems, a Chinese national standard in whose drafting I participated.
Core deliverables:
- Complete GB/T 45312-2025 element hierarchy JSON Schema (144 elements / 7 categories)
- Machine-readable quantized grading (12-level wind, 4-level rain, 4-level visibility, snow, illumination)
/gallerysample library +/viewdocument detail (4 views: developer / test / regulator / consumer)/editorin-browser ODC editor/comparemulti-document diff (2–4 side-by-side with consistency annotation)- GitHub PR template + CI auto-validation
OpenODC in the 12-Primitives framework:
- Primitive 6 (Permissions / ODD): turning ODD dimensions from implicit convention into explicit schema, so that industry ODD declarations become comparable and traceable. “Permissions” as a primitive requires, first and foremost, that the permission boundary be written down legibly.
- Primitive 11 (Debugging / DX): making different OEMs’ ODC declarations comparable is the precondition for reproducibility. If ODD boundaries are stated differently by each OEM, cross-OEM hazard reproduction is out of reach.
7.3 AD Standards Tracker — Global AD Regulation & Standards Monitor
AD Standards Tracker is a structured tracker of global AD standards, MVP launched mid-April 2026.
Current coverage:
- 7 jurisdictions (International / China / US / EU / UK / Japan / Korea)
- ~79 structured standard records
- YAML data layer + Next.js frontend + Vercel deployment
- GitHub Actions automated crawler framework (not yet wired to production data sources)
AD Standards Tracker in the 12-Primitives framework:
- Primitive 4 (Tool Design): making the “global AD standards” ecosystem a structured, callable, comparable resource — a meta-tool tool-design exercise.
- Primitive 10 (Observability): compliance itself needs observability. Whether a given OEM is compliant in a given jurisdiction requires that the global standards evolution be trackable, subscribable, and citable.
7.4 Two More Threads In Flight
awesome-harness-engineering PR #1 — submits the English cross-reference table of “12 Primitives × ISO 21448” as a patch to the upstream repository. First internationally-visible signal for compliance-dimension Harness in the open-source community.
An English ArXiv Position Paper is in draft, with working title:
Compliance-Oriented Harness Engineering: A Framework for AI-First Safety-Critical Systems
Submission target: Q2 2026.
7.5 An Honest Note
Let me restate the disclaimer from the opening of this section: the four efforts above were all built in personal time as small explorations. None of them rises to the level of “engineering validation” in any serious sense. They are more like a few small lamps lit on this methodology map — enough to give the “safety-compliance dimension of Harness” a visible shape in concrete engineering.
Making this direction real will require far more participants. If these small projects can encourage even a few peers to, under a banner of openness and transparency, steer their own work in a similar direction, they will have vastly exceeded their original brief.
8. An Invitation To Three Audiences
If you are an OEM algorithm or engineering lead — the 12 Primitives can serve as a self-audit checklist. You are already investing on the UX side of the commercial strategy; the compliance-side gaps are what we can talk about.
If you are involved in standards drafting — the Automotive MCP and the safety-compliance Harness specification are open windows. At GB, CSAE, or ISO levels, I am happy to see this framework used as drafting input.
If you are a graduate student, a young faculty member, or a member of a corporate forward-looking team — any one of the 12 research gaps is worth a focused dive into the underlying mechanism and engineering trajectory. Whether you move first or later, the act of stepping in is itself part of advancing this commons.
References And Linked Work
Related Writing And Projects
- Chinese deep-dive version (this article): https://blog.autozyx.com/posts/harness-compliance-dimension/
- Author’s earlier piece in the same series: Harness Engineering’s Cross-Domain Application In Intelligent Driving
- ROAM project: https://roam.autozyx.com | GitHub: https://github.com/AutoZYX/ROAM
- OpenODC platform: https://openodc.autozyx.com | GitHub: https://github.com/AutoZYX/OpenODC
- AD Standards Tracker: https://standards.autozyx.com
- Author’s other open projects: Co4Pilot · ADSafetyPilot · Good Ideas 2.0
- awesome-harness-engineering PR #1: ai-boost/awesome-harness-engineering#1
Key Industry Events Referenced
- Feb 2026: OpenAI, Harness Engineering: Leveraging Codex in an Agent-First World
- Mar 2026: Anthropic, Managed Agents architecture launch
- Apr 2026: Nextie raises two rounds in one month, backed by Qi Lu and Kai-Fu Lee
- Apr 2026: Synced, The New Frontier: Harness, Backed Heavily By Lee and Lu
- Apr 2026: Tang Daosheng, AI Formally Enters The Harness Era
Key Standards Referenced
- ISO 21448:2022 — Road vehicles — Safety of the intended functionality
- ISO 26262:2018 — Road vehicles — Functional safety
- ISO 11898 — Road vehicles — Controller area network (CAN)
- GB/T 45312-2025 — Intelligent and Connected Vehicles — Operational Design Conditions for Automated Driving Systems
- AUTOSAR Adaptive Platform Specification
About the author: Yuxin Zhang, Associate Professor at the School of Automotive Engineering, Jilin University; Director of the AD Safety Joint Lab; Head of Functional Safety at Zhuoyu Technology; Founder of DRIVEResearch; Visiting Scholar at Durham University, UK (2025–2026). Research interests: SOTIF (ISO 21448), Functional Safety (ISO 26262), scenario-driven testing and evaluation.
Contact: yuxinzhang@jlu.edu.cn
Citation: Zhang, Yuxin. Harness Engineering: User Experience vs Safety Compliance — A Direction Mainstream Roadmaps Have Collectively Skipped. 2026-04-19. https://blog.autozyx.com/posts/harness-compliance-dimension/
License: CC BY 4.0