Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Support
    • Submit feedback
  • Sign in / Register
W
wisclic
  • Project overview
    • Project overview
    • Details
    • Activity
  • Issues 1
    • Issues 1
    • List
    • Boards
    • Labels
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Analytics
    • Analytics
    • CI / CD
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Create a new issue
  • Jobs
  • Issue Boards
  • Fredric Shackelford
  • wisclic
  • Issues
  • #1

Closed
Open
Opened Feb 10, 2025 by Fredric Shackelford@fredricshackel
  • Report abuse
  • New issue
Report abuse New issue

If there's Intelligent Life out There


Optimizing LLMs to be proficient at specific tests backfires on Meta, Stability.

-. -. -. -. -. -. -

When you buy through links on our website, we might make an affiliate commission. Here's how it works.

Hugging Face has released its 2nd LLM leaderboard to rank the very best language designs it has actually tested. The new leaderboard seeks to be a more tough uniform requirement for testing open big language model (LLM) performance throughout a range of jobs. Alibaba's Qwen designs appear dominant in the leaderboard's inaugural rankings, taking three spots in the leading 10.

Pumped to announce the brand brand-new open LLM leaderboard. We burned 300 H100 to re-run brand-new assessments like MMLU-pro for all major open LLMs!Some knowing:- Qwen 72B is the king and Chinese open models are dominating general- Previous evaluations have actually ended up being too easy for current ... June 26, 2024

Hugging Face's second leaderboard tests language models throughout 4 tasks: knowledge testing, reasoning on exceptionally long contexts, complicated math capabilities, and guideline following. Six benchmarks are used to evaluate these qualities, with tests consisting of fixing 1,000-word murder secrets, explaining PhD-level questions in layman's terms, and the majority of complicated of all: high-school math equations. A complete breakdown of the standards used can be found on Hugging Face's blog site.

The frontrunner of the new leaderboard is Qwen, Alibaba's LLM, which takes 1st, 3rd, and 10th place with its handful of variations. Also revealing up are Llama3-70B, Meta's LLM, and a handful of smaller sized open-source jobs that managed to surpass the pack. Notably missing is any sign of ChatGPT; Hugging Face's leaderboard does not check closed-source designs to guarantee reproducibility of results.

Tests to qualify on the leaderboard are run exclusively on Hugging Face's own computer systems, which according to CEO Clem Delangue's Twitter, are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's open-source and collective nature, anybody is totally free to submit brand-new models for screening and admission on the leaderboard, with a brand-new ballot system prioritizing popular brand-new entries for screening. The leaderboard can be filtered to reveal only a highlighted array of considerable designs to avoid a confusing glut of little LLMs.

As a pillar of the LLM space, Hugging Face has ended up being a relied on source for LLM learning and community cooperation. After its very first leaderboard was launched last year as a means to compare and replicate testing results from numerous recognized LLMs, the board quickly took off in appeal. Getting high ranks on the board ended up being the objective of lots of designers, small and big, cadizpedia.wikanda.es and as models have actually ended up being generally more powerful, 'smarter,' and optimized for the specific tests of the first leaderboard, its results have actually become less and less significant, hence the creation of a 2nd variation.

Some LLMs, including newer variants of Meta's Llama, seriously underperformed in the brand-new leaderboard compared to their high marks in the very first. This originated from a trend of over-training LLMs just on the first leaderboard's criteria, resulting in falling back in real-world efficiency. This regression of efficiency, thanks to hyperspecific and self-referential information, follows a pattern of AI performance growing worse gradually, showing when again as Google's AI answers have revealed that LLM performance is just as good as its training data which true synthetic "intelligence" is still numerous, several years away.

Remain on the Innovative: Get the Tom's Hardware Newsletter

Get Tom's Hardware's best news and thorough reviews, straight to your inbox.

Dallin Grimm is a contributing author for Tom's Hardware. He has been constructing and breaking computers given that 2017, functioning as the resident youngster at Tom's. From APUs to RGB, Dallin guides all the latest tech news.

Moore Threads GPUs allegedly reveal 'exceptional' reasoning efficiency with DeepSeek models

DeepSeek research study recommends Huawei's Ascend 910C provides 60% of Nvidia H100 reasoning efficiency

Asus and MSI trek RTX 5090 and RTX 5080 GPU rates by approximately 18%

-. bit_user. LLM efficiency is just as excellent as its training data and that real synthetic "intelligence" is still numerous, several years away. First, this statement discount rates the role of network architecture.

The definition of "intelligence" can not be whether something processes details precisely like human beings do, otherwise the look for extra terrestrial intelligence would be completely futile. If there's smart life out there, it most likely does not believe quite like we do. Machines that act and behave smartly also needn't always do so, either. Reply

-. jp7189. I don't enjoy the click-bait China vs. the world title. The fact is qwen is open source, open weights and can be run anywhere. It can (and has currently been) tweaked to add/remove bias. I praise hugging face's work to produce standardized tests for LLMs, and for putting the concentrate on open source, open weights first. Reply

-. jp7189. bit_user said:. First, this declaration discount rates the role of network architecture.

Second, intelligence isn't a binary thing - it's more like a spectrum. There are various classes cognitive tasks and abilities you may be acquainted with, if you study kid development or animal intelligence.

The definition of "intelligence" can not be whether something processes details exactly like humans do, otherwise the search for extra terrestrial intelligence would be totally futile. If there's intelligent life out there, wiki.eqoarevival.com it probably doesn't think quite like we do. Machines that act and act smartly also need not always do so, either. We're producing a tools to help humans, addsub.wiki therfore I would argue LLMs are more practical if we grade them by human intelligence standards. Reply

- View All 3 Comments

Most Popular

Tomshardware becomes part of Future US Inc, a worldwide media group and leading digital publisher. Visit our corporate website.

- Terms.

  • Contact Future's specialists. - Privacy policy. - Cookies policy. - Availability Statement. - Advertise with us.
  • About us. - Coupons.
  • Careers

    US, Inc. Full 7th Floor, 130 West 42nd Street, New York City, NY 10036.
  • Discussion
  • Designs
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
0
Labels
None
Assign labels
  • View project labels
Reference: fredricshackel/wisclic#1