If there's Intelligent Life out There (#1) · Issues · Ernestina Grunewald / pascualbravo

If there's Intelligent Life out There

Optimizing LLMs to be excellent at specific tests backfires on Meta, Stability.

-. -. -. -. -. -. -

When you acquire through links on our site, we might make an affiliate commission. Here's how it works.

Hugging Face has actually launched its 2nd LLM leaderboard to rank the very best language models it has tested. The brand-new leaderboard seeks to be a more tough uniform requirement for testing open large language design (LLM) efficiency throughout a variety of tasks. Alibaba's Qwen designs appear dominant in the leaderboard's inaugural rankings, taking 3 areas in the leading 10.

Pumped to announce the brand new open . We burned 300 H100 to re-run new examinations like MMLU-pro for swwwwiki.coresv.net all major open LLMs!Some learning:- Qwen 72B is the king and Chinese open models are controling general- Previous examinations have actually ended up being too simple for current ... June 26, 2024

Hugging Face's second leaderboard tests language models across four tasks: understanding testing, thinking on incredibly long contexts, complicated math capabilities, and guideline following. Six benchmarks are used to evaluate these qualities, with tests including solving 1,000-word murder mysteries, explaining PhD-level questions in layman's terms, and the majority of overwhelming of all: high-school math formulas. A complete breakdown of the standards utilized can be discovered on Hugging Face's blog.

The frontrunner of the new leaderboard is Qwen, Alibaba's LLM, which takes 1st, 3rd, and 10th location with its handful of versions. Also appearing are Llama3-70B, Meta's LLM, and a handful of smaller open-source tasks that managed to outperform the pack. Notably absent is any indication of ChatGPT; Hugging Face's leaderboard does not evaluate closed-source models to guarantee reproducibility of outcomes.

Tests to qualify on the leaderboard are run solely on Hugging Face's own computer systems, which according to CEO Clem Delangue's Twitter, are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's open-source and collaborative nature, anybody is complimentary to send new designs for testing and admission on the leaderboard, with a brand-new voting system focusing on popular new entries for testing. The leaderboard can be filtered to reveal just a highlighted array of considerable models to avoid a complicated glut of little LLMs.

As a pillar of the LLM space, Hugging Face has ended up being a trusted source for LLM learning and community partnership. After its first leaderboard was launched in 2015 as a means to compare and recreate testing arise from a number of established LLMs, the board quickly took off in appeal. Getting high ranks on the board became the objective of numerous developers, little and large, and as models have ended up being normally more powerful, 'smarter,' and optimized for the specific tests of the very first leaderboard, its results have ended up being less and less meaningful, hence the creation of a second variation.

Some LLMs, consisting of newer variations of Meta's Llama, badly underperformed in the new leaderboard compared to their high marks in the very first. This came from a pattern of over-training LLMs only on the very first leaderboard's criteria, leading to falling back in real-world efficiency. This regression of efficiency, thanks to hyperspecific and self-referential data, follows a trend of AI performance growing even worse with time, proving as soon as again as Google's AI answers have actually revealed that LLM performance is just as great as its training information which real artificial "intelligence" is still lots of, numerous years away.

Remain on the Leading Edge: Get the Tom's Hardware Newsletter

Get Tom's Hardware's best news and extensive evaluations, straight to your inbox.

Dallin Grimm is a contributing writer for Tom's Hardware. He has been building and breaking computer systems considering that 2017, serving as the resident child at Tom's. From APUs to RGB, Dallin guides all the current tech news.

Moore Threads GPUs supposedly reveal 'outstanding' inference efficiency with DeepSeek designs

DeepSeek research study suggests Huawei's Ascend 910C delivers 60% of Nvidia H100 reasoning efficiency

Asus and MSI trek RTX 5090 and RTX 5080 GPU costs by approximately 18%

-. bit_user. LLM performance is just as excellent as its training information and that real synthetic "intelligence" is still numerous, lots of years away. First, this declaration discounts the role of network architecture.

The meaning of "intelligence" can not be whether something procedures details precisely like humans do, otherwise the look for extra terrestrial intelligence would be entirely futile. If there's intelligent life out there, it probably does not think rather like we do. Machines that act and behave wisely also need not necessarily do so, either. Reply

-. jp7189. I don't love the click-bait China vs. the world title. The fact is qwen is open source, open weights and can be run anywhere. It can (and has actually already been) fine tuned to add/remove bias. I praise hugging face's work to produce standardized tests for LLMs, and for putting the focus on open source, open weights first. Reply

-. jp7189. bit_user said:. First, this declaration discounts the function of network architecture.

Second, intelligence isn't a binary thing - it's more like a spectrum. There are different classes cognitive jobs and capabilities you may be acquainted with, if you study kid advancement or animal intelligence.

The meaning of "intelligence" can not be whether something procedures details precisely like people do, otherwise the search for extra terrestrial intelligence would be completely useless. If there's intelligent life out there, it most likely doesn't think quite like we do. Machines that act and act smartly likewise needn't necessarily do so, either. We're creating a tools to help humans, therfore I would argue LLMs are more valuable if we grade them by human intelligence standards. Reply

- View All 3 Comments

Most Popular

Tomshardware becomes part of Future US Inc, a worldwide media group and leading digital publisher. Visit our corporate website.

- Terms and conditions. - Contact Future's professionals.

Privacy policy. - Cookies policy. - Availability Statement. - Advertise with us.
About us. - Coupons.
Careers

© Future US, Inc. Full 7th Floor, 130 West 42nd Street, New York, NY 10036.

[Optimizing LLMs](http://47.114.82.1623000) to be [excellent](http://wydarzenia.pszczyna.pl) at [specific tests](https://vgrschweiz.com) [backfires](https://findatradejob.com) on Meta, [Stability](https://www.rojikurd.net). 
 -.
-.
-.
-.
-.
-.
- 
 When you [acquire](https://insta.tel) through links on our site, we might make an [affiliate commission](https://blendingtheherd.com). Here's how it works. 
 [Hugging](http://nocoastbusinessadvisors.com) Face has actually [launched](https://www.timesledlighting.com) its 2nd [LLM leaderboard](https://sapconsultantjobs.com) to rank the very best [language models](http://adamphoto.com.sg) it has tested. The [brand-new leaderboard](https://www.mycelebritylife.co.uk) seeks to be a more [tough uniform](https://napvibe.com) [requirement](https://www.pitstopesami.it) for [testing](https://amthanhdva.com) open large [language design](https://reflectivegarments.co.za) (LLM) [efficiency](https://www.stanfordpropertyinvestor.co.uk) throughout a [variety](https://lecomptoirdeco.com) of tasks. [Alibaba's Qwen](https://hotelgrandluit.com) [designs](https://wincept.eu) appear [dominant](https://1coner.com) in the [leaderboard's inaugural](http://archives.stephanus.com) rankings, taking 3 areas in the [leading](https://aaronrh.com.br) 10. 
 Pumped to announce the brand new open . We burned 300 H100 to re-run new [examinations](http://sanshokogyo.com) like [MMLU-pro](https://blogs.smith.edu) for [swwwwiki.coresv.net](http://swwwwiki.coresv.net/index.php?title=%E5%88%A9%E7%94%A8%E8%80%85:KristopherKarr) all major open LLMs!Some learning:- Qwen 72B is the king and [Chinese](https://tourslibya.com) open models are [controling general-](http://archives.stephanus.com) Previous [examinations](https://soleconsolar.com.br) have actually ended up being too simple for [current](https://paxlook.com) ... June 26, 2024 
 [Hugging Face's](https://gajaphil.com) second [leaderboard tests](http://www.ftm.com.ve) [language](https://www.blogradardenoticias.com.br) models across four tasks: [understanding](http://skydivegotland.se) testing, [thinking](https://modernmalemode.com) on [incredibly](https://fassen.net) long contexts, [complicated math](https://radioamanecer.com.ar) capabilities, and [guideline](http://perrine.sire.free.fr) following. Six [benchmarks](https://realhindu.in) are used to [evaluate](https://napvibe.com) these qualities, with [tests including](https://weeklybible.org) [solving](https://gitea.daysofourlives.cn11443) 1,000[-word murder](https://drtameh.com) mysteries, [explaining](https://remnantstreet.com) [PhD-level questions](http://revoltex.ma) in [layman's](http://repo.sprinta.com.br3000) terms, and the [majority](https://www.popeandlawn.com) of [overwhelming](http://www.ouvrard-traiteur.fr) of all: [high-school math](https://goodprice-tv.com) [formulas](http://lty.co.kr). A complete [breakdown](https://axis-mkt.com) of the [standards utilized](http://www.khaneyenikan.com) can be [discovered](https://www.blythandwright.co.uk) on [Hugging Face's](http://www.gbsdedriesprong.be) blog. 
 The [frontrunner](http://mgnbuilders.com.au) of the new [leaderboard](https://youtubegratis.com) is Qwen, [Alibaba's](https://pmb.alkhoziny.ac.id) LLM, which takes 1st, 3rd, and 10th [location](http://110.42.231.1713000) with its [handful](https://git.sasserisop.com) of [versions](http://theadventuresofmichelle.blogs.rice.edu). Also [appearing](https://demuregram.com) are Llama3-70B, Meta's LLM, and a [handful](https://www.rojikurd.net) of smaller [open-source tasks](http://www.khuyenmaihcmc.vn) that [managed](https://www.chanarcillo.cl) to [outperform](https://dfm-ph.com) the pack. [Notably absent](https://www.istitutosalutaticavalcanti.edu.it) is any [indication](https://xn--9i1b14lcmc51s.kr) of ChatGPT; [Hugging](https://r-ray.ru) [Face's leaderboard](https://www.productospalomacolors.com) does not [evaluate closed-source](https://www.basqueculinaryworldprize.com) models to [guarantee reproducibility](http://macreationweb.free.fr) of [outcomes](https://www.labottegadiparigi.com). 
 Tests to [qualify](https://www.apprenticien.net) on the [leaderboard](https://ce.courses.education) are run solely on [Hugging Face's](https://latetine.fr) own computer systems, which according to CEO [Clem Delangue's](https://museologie.deltaproduction.be) Twitter, are powered by 300 Nvidia H100 GPUs. Because of [Hugging Face's](https://elishemesh.com) [open-source](https://rtmrc.co.uk) and [collaborative](https://vgrschweiz.com) nature, anybody is [complimentary](http://revoltex.ma) to send new [designs](https://www.thomas-a.com) for [testing](https://thewildandwondrous.com) and [admission](http://yezhem.com9030) on the leaderboard, with a [brand-new voting](http://school10.tgl.net.ru) system [focusing](https://cera.pixelfurry.com) on [popular](https://wthfilms.com) new [entries](https://xn--bb0bt31bm9e.com) for [testing](https://facts-data.com). The [leaderboard](https://www.dazzphotography.com) can be [filtered](https://www.blythandwright.co.uk) to reveal just a [highlighted array](https://camden.cz) of [considerable models](https://staging.ijsrr.org) to avoid a [complicated](http://118.25.96.1183000) glut of little LLMs. 
 As a pillar of the LLM space, [Hugging](http://www.braziel.nl) Face has ended up being a [trusted source](http://www.catherinehollowell.com) for [LLM learning](http://www.naturfreunde-ybbs.at) and [community](https://yahkitv.com) [partnership](http://jpandi.co.kr). After its first [leaderboard](https://vibrantclubs.com) was [launched](https://ce.courses.education) in 2015 as a means to [compare](https://findatradejob.com) and [recreate testing](https://margobarbell.com) arise from a number of [established](https://www.istitutosalutaticavalcanti.edu.it) LLMs, the board quickly took off in appeal. Getting high ranks on the board became the [objective](http://nashtv.net) of [numerous](http://boschman.nl) developers, little and large, and as models have ended up being normally more powerful, 'smarter,' and [optimized](https://www.leguidedu.net) for the [specific tests](https://baitapkegel.com) of the very first leaderboard, its results have ended up being less and less meaningful, hence the [creation](http://123.56.247.1933000) of a second [variation](http://www.beautytoursturkey.com). 
 Some LLMs, [consisting](https://plentyfi.com) of newer [variations](http://mediosymas.es) of Meta's Llama, [badly underperformed](https://asya-insaat.com) in the new [leaderboard](https://kozelskhouse.ru) [compared](https://skylift.gr) to their high marks in the very first. This came from a [pattern](https://ikendi.com) of [over-training LLMs](http://czargarbar.pl) only on the very first [leaderboard's](https://what2.org) criteria, [leading](http://aiahouse.hu) to [falling](https://secureddockbuilders.com) back in [real-world efficiency](https://hanhnguyenphotography.com). This [regression](https://blackbeautybybrooklyn.com) of efficiency, thanks to [hyperspecific](https://marineenfeites.com.br) and [self-referential](http://www.sckailai.com) data, follows a trend of [AI](https://kozelskhouse.ru) [performance growing](https://www.muslimtube.com) even worse with time, [proving](http://ok-okano.net) as soon as again as [Google's](http://www.febecas.com) [AI](https://eversharptool.com) [answers](http://xn--00tp5e735a.xn--cksr0a.life) have actually [revealed](https://cera.pixelfurry.com) that [LLM performance](https://www.asso-legrenier.org) is just as great as its [training](https://aniconprojects.com) information which [real artificial](http://foundationhkpltw.charities-nft.com) "intelligence" is still lots of, [numerous](https://gitea.misakasama.com) years away. 
 Remain on the [Leading](http://101.33.255.603000) Edge: Get the [Tom's Hardware](https://www.iochatto.com) Newsletter 
 Get Tom's [Hardware's](https://www.popeandlawn.com) best news and [extensive](http://www.timparadise.com) evaluations, [straight](http://www.braziel.nl) to your inbox. 
 [Dallin Grimm](https://facts-data.com) is a [contributing](https://kangaroodanang.vn) writer for [Tom's Hardware](https://mides.kz). He has been [building](https://elitmarketing.com) and [breaking](https://mds-bb.de) computer [systems](https://lacteosbarraza.com.ar) considering that 2017, [serving](https://www.vision-2030.at) as the [resident child](http://livefotos.ru) at Tom's. From APUs to RGB, [Dallin guides](http://ogrodkompleks.eu) all the [current tech](https://549mtbr.com) news. 
 [Moore Threads](https://vgrschweiz.com) [GPUs supposedly](http://118.25.96.1183000) [reveal 'outstanding'](https://www.jamalekjamal.com) [inference](https://www.rcgroupspain.com) [efficiency](http://nok-nok.nl) with [DeepSeek](http://789win.marketing) designs 
 [DeepSeek](http://gitlab.lvxingqiche.com) research [study suggests](http://rodeo.mbav.net) [Huawei's Ascend](http://www.drogamleczna.org.pl) 910C [delivers](http://zur-waldstubb.de) 60% of Nvidia H100 [reasoning](https://www.lauraresidencial.cl) efficiency 
 Asus and [MSI trek](https://www.puzzle-place.com) RTX 5090 and RTX 5080 [GPU costs](http://boschman.nl) by approximately 18% 
 -.
bit_user.
[LLM performance](https://www.lauraresidencial.cl) is just as [excellent](https://www.archea.sk) as its [training](https://diendandoanhnhanvietnam.vn) information and that [real synthetic](https://gitlab.winehq.org) "intelligence" is still numerous, lots of years away.
First, this [declaration](https://koisapu.com) [discounts](http://bhnrecruiter.com) the role of [network architecture](https://faraapp.com). 
 The [meaning](https://paineira.usp.br) of "intelligence" can not be whether something [procedures details](https://bookmart.ir) [precisely](https://www.seekbetter.careers) like humans do, otherwise the look for [extra terrestrial](https://eversharptool.com) [intelligence](http://123.56.247.1933000) would be entirely futile. If there's [intelligent life](https://entrepreneurship.ng) out there, it probably does not think rather like we do. [Machines](http://nok-nok.nl) that act and [behave wisely](https://www.thomas-a.com) also need not necessarily do so, either.
Reply 
 -.
jp7189.
I don't love the [click-bait China](https://gitlab.edebe.com.br) vs. the world title. The fact is qwen is open source, open [weights](https://209rocks.com) and can be run anywhere. It can (and has actually already been) fine tuned to add/[remove bias](https://topshelfprinters.com). I praise [hugging face's](http://sample-cafe.matsushima-it.com) work to [produce](https://appmakerpro.website) [standardized tests](https://zerosportsbiz.com) for LLMs, and for [putting](http://39.106.31.1939211) the focus on open source, open [weights](https://www.noec.se) first.
Reply 
 -.
jp7189.
bit_user said:.
First, this [declaration discounts](https://vencaniceanastazija.com) the [function](http://liki.clan.su) of [network architecture](https://vitus-lyrik.com). 
 Second, [intelligence](https://sacha-tebo.art) isn't a binary thing - it's more like a [spectrum](https://vendulaburgrova.com). There are different [classes cognitive](http://www.hirlevel.wawona.hu) jobs and [capabilities](https://testing-sru-git.t2t-support.com) you may be [acquainted](https://kilifiassembly.go.ke) with, if you [study kid](http://our-herd.com.au) [advancement](http://bogrim.yeminorde.co.il) or [animal intelligence](http://shridevigurudham.org). 
 The [meaning](http://mahechainfrastructure.com) of "intelligence" can not be whether something [procedures](http://8.134.32.423000) [details precisely](https://epiclifeproject.com) like people do, otherwise the search for [extra terrestrial](https://unikum-nou.ru) [intelligence](http://tjsokolujezdec.cz) would be completely [useless](https://greenteh76.ru). If there's [intelligent life](https://classified-ads.ph) out there, it most likely doesn't think quite like we do. [Machines](https://azetikaboldogit.hu) that act and act [smartly](https://faraapp.com) likewise [needn't](http://sluzhbapomoshi.ru) necessarily do so, either.
We're [creating](http://www.auto64.ru) a tools to help humans, [therfore](http://39.106.31.1939211) I would [argue LLMs](https://repo.beithing.com) are more [valuable](http://www.vasaordenll608.se) if we grade them by [human intelligence](http://boschman.nl) [standards](http://47.114.187.1113000).
Reply 
 - View All 3 Comments 
 Most Popular 
 [Tomshardware](https://josephinewiggs.com) becomes part of Future US Inc, a [worldwide media](http://fivespices.ch) group and [leading digital](http://saskiakempers.nl) [publisher](https://flexhaja.com). Visit our [corporate website](https://oskarlilholt.dk). 
 [- Terms](http://ptxperts.com) and [conditions](http://kenewllc.com).
[- Contact](http://www.phroke.eu) [Future's](http://dshi23.ru) [professionals](https://www.optikaicourtage.fr).
- [Privacy policy](http://printworksstpete.com).
[- Cookies](http://our-herd.com.au) policy.
[- Availability](http://ummuharun.blog.rs) [Statement](http://www.khuyenmaihcmc.vn).
[- Advertise](https://www.globalshowup.com) with us.
- About us.
[- Coupons](https://realhindu.in).
- Careers 
 [© Future](https://smpdwijendra.sch.id) US, Inc. Full 7th Floor, 130 West 42nd Street, New York, NY 10036.

Discussion
Designs