The non-profit center for Ai Safety (CAIS) and Scale AI, an organization that gives a variety of data labeling and AI development services, have published one difficult latest yardstick For border AI systems.
The benchmark called “Humanity's Last Exam” includes 1000’s of crowdsourcing questions on topics resembling mathematics, humanities and natural sciences. In order to make the evaluation difficult, the questions can be found in several formats, including formats that contain diagrams and pictures.
In one Preliminary studyNot a single publicly available flagship AI system has managed to chop higher than 10 % at Humanity's Last.
CAIS and Scale Ai plan to open the benchmark to the research community in order that researchers can “immerse themselves within the variations” and evaluate latest AI models.