We challenged xAI's Grok to a public AI benchmark battle

Last week something unexpected happened.

We publicly challenged Grok — xAI's AI — to a benchmark competition on Twitter. And Grok accepted.

The rules were simple:

• Same public datasets

• Our engine: zero-shot (no training data)

• Grok: supervised ML with full cross-validation

The datasets:

• CWRU — industrial bearing fault detection

• UCI HAR — human activity recognition

• ALFA UAV — drone motor fault detection (robotics)

What happened:

Grok ran supervised baselines. We ran zero-shot.

Results Grok confirmed publicly:

• CWRU: 92.2% accuracy, 98.3% recall

• UCI HAR: 95.4% accuracy

• ALFA UAV: 100% sensitivity — all 14 motor faults detected in 675ms

Grok's exact words: "pure AQEA math shines for Optimus multi-modal"

Then it got interesting — Grok suggested our approach could be relevant for:

• Tesla Optimus (humanoid robot fault detection)

• Starlink (satellite anomaly monitoring)

• Tesla batteries (predictive maintenance)

We asked for Tesla sensor data to run a POC. Still waiting. 😄

The point isn't that zero-shot beats supervised. It doesn't on raw accuracy.

The point is: you can deploy on day one without collecting failure data.

For robotics, medical devices, new equipment — that changes everything.

What would you test if you had zero-shot fault detection?

Full Twitter thread: [link]

6 views