OSWorld: Desktop AI Agents Succeed on 12% of Tasks Where Humans Succeed on 72%
OSWorld (NeurIPS 2024) benchmarks multimodal AI agents on 369 real desktop tasks across Ubuntu, Windows, and macOS — finding a 60-percentage-point gap between the best model (12.24%) and human performance (72.36%), with 75% of failures traced to visuomotor grounding errors rather than reasoning failures.
