WebArena: The 812-Task Benchmark That Measures What Web Agents Actually Can and Cannot Do
GPT-4 completes only 14.41% of WebArena's 812 realistic web tasks while humans reach 78.24%; the dominant failure mode is false infeasibility — conservative refusal to act — with direct implications for any agent operating Fava or finance web UIs.
