METR, which runs the benchmark measuring how well models can complete long-duration tasks, found that Claude Mythos Preview ...
Got a confidential news tip? We want to hear from you. Sign up for free newsletters and get more CNBC delivered to your inbox Get this delivered to your inbox, and ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results