DeepSeek: the Chinese aI Model That's a Tech Breakthrough and A Security Risk (#1) · Issues · Leoma Dugas / echoesofmercy

DeepSeek: the Chinese aI Model That's a Tech Breakthrough and A Security Risk

DeepSeek: at this phase, the only takeaway is that open-source models surpass proprietary ones. Everything else is problematic and I do not purchase the public numbers.

DeepSink was constructed on top of open source Meta designs (PyTorch, Llama) and ClosedAI is now in danger because its appraisal is outrageous.

To my understanding, no public documentation links DeepSeek straight to a particular "Test Time Scaling" strategy, but that's extremely probable, so enable me to simplify.

Test Time Scaling is used in machine finding out to scale the design's performance at test time instead of throughout training.

That means fewer GPU hours and less powerful chips.

To put it simply, lower computational and lower hardware costs.

That's why Nvidia lost nearly $600 billion in market cap, the biggest one-day loss in U.S. history!

Many individuals and institutions who shorted American AI stocks ended up being extremely abundant in a couple of hours because investors now project we will require less powerful AI chips ...

Nvidia short-sellers simply made a single-day revenue of $6.56 billion according to research study from S3 Partners. Nothing compared to the market cap, I'm taking a look at the single-day amount. More than 6 billions in less than 12 hours is a lot in my book. And that's just for Nvidia. Short sellers of chipmaker Broadcom made more than $2 billion in profits in a few hours (the US stock exchange runs from 9:30 AM to 4:00 PM EST).

The Nvidia Short Interest In time information shows we had the second highest level in January 2025 at $39B but this is outdated since the last record date was Jan 15, 2025 -we have to wait for the current information!

A tweet I saw 13 hours after releasing my short article! Perfect summary Distilled language designs

Small language models are trained on a smaller scale. What makes them different isn't just the abilities, it is how they have been constructed. A distilled language model is a smaller, more effective model created by transferring the knowledge from a bigger, more intricate design like the future ChatGPT 5.

Imagine we have an instructor model (GPT5), which is a large language model: a deep neural network trained on a great deal of data. Highly resource-intensive when there's limited computational power or when you need speed.

The understanding from this instructor model is then "distilled" into a trainee design. The trainee design is simpler and has less parameters/layers, that makes it lighter: less memory usage and computational needs.

During distillation, the trainee model is trained not just on the raw information but likewise on the outputs or the "soft targets" (likelihoods for each class rather than hard labels) produced by the instructor model.

With distillation, the trainee design gains from both the initial data and the detailed predictions (the "soft targets") made by the instructor design.

To put it simply, the trainee model doesn't simply gain from "soft targets" but also from the very same training information used for the instructor, however with the guidance of the instructor's outputs. That's how knowledge transfer is enhanced: dual knowing from data and from the instructor's forecasts!

Ultimately, the trainee mimics the teacher's decision-making procedure ... all while utilizing much less computational power!

But here's the twist as I understand it: DeepSeek didn't just extract material from a single big language model like ChatGPT 4. It depended on many large language designs, including open-source ones like Meta's Llama.

So now we are distilling not one LLM however numerous LLMs. That was one of the "genius" idea: bphomesteading.com mixing different architectures and datasets to create a seriously versatile and robust little language design!

DeepSeek: Less supervision

Another essential innovation: less human supervision/guidance.

The question is: how far can designs go with less human-labeled data?

R1-Zero discovered "thinking" abilities through experimentation, it develops, it has unique "thinking behaviors" which can cause noise, unlimited repeating, and language mixing.

R1-Zero was speculative: there was no preliminary assistance from labeled information.

DeepSeek-R1 is various: it used a structured training pipeline that includes both monitored fine-tuning and support learning (RL). It began with preliminary fine-tuning, followed by RL to fine-tune and improve its thinking abilities.

The end outcome? Less sound and no language blending, unlike R1-Zero.

R1 uses human-like reasoning patterns initially and it then advances through RL. The development here is less human-labeled data + RL to both guide and improve the design's efficiency.

My concern is: did DeepSeek really fix the issue understanding they drew out a great deal of data from the datasets of LLMs, which all gained from human supervision? In other words, is the traditional dependence really broken when they depend on previously trained designs?

Let me reveal you a live real-world screenshot shared by Alexandre Blanc today. It shows training information extracted from other designs (here, ChatGPT) that have actually gained from human supervision ... I am not persuaded yet that the conventional dependency is broken. It is "easy" to not need massive amounts of top quality reasoning data for training when taking faster ways ...

To be well balanced and reveal the research, I have actually published the DeepSeek R1 Paper (downloadable PDF, 22 pages).

My concerns concerning DeepSink?

Both the web and mobile apps gather your IP, keystroke patterns, and gadget details, and whatever is stored on servers in China.

Keystroke pattern analysis is a behavioral biometric technique used to recognize and authenticate individuals based upon their unique typing patterns.

I can hear the "But 0p3n s0urc3 ...!" remarks.

Yes, open source is fantastic, but this thinking is limited because it does rule out human psychology.

Regular users will never run designs in your area.

Most will just desire quick responses.

Technically unsophisticated users will use the web and mobile versions.

Millions have actually currently downloaded the mobile app on their phone.

DeekSeek's models have a real edge and that's why we see ultra-fast user adoption. In the meantime, they are superior to Google's Gemini or OpenAI's ChatGPT in numerous ways. R1 scores high on unbiased benchmarks, no doubt about that.

I recommend searching for anything sensitive that does not line up with the Party's propaganda on the web or mobile app, and the output will promote itself ...

China vs America

Screenshots by T. Cassel. Freedom of speech is beautiful. I might share dreadful examples of propaganda and censorship however I will not. Just do your own research study. I'll end with DeepSeek's privacy policy, which you can check out on their website. This is a basic screenshot, nothing more.

Feel confident, your code, ideas and conversations will never be archived! As for the genuine financial investments behind DeepSeek, bybio.co we have no concept if they remain in the numerous millions or in the billions. We just know the $5.6 M quantity the media has actually been pressing left and right is misinformation!

DeepSeek: at this phase, the only takeaway is that open-source models surpass proprietary ones. Everything else is problematic and I do not [purchase](https://altisimawinery.com) the public numbers. 
 [DeepSink](https://39.98.119.14) was [constructed](https://fora-ci.com) on top of open source Meta designs (PyTorch, Llama) and ClosedAI is now in danger because its appraisal is outrageous. 
 To my understanding, no public documentation links DeepSeek straight to a particular "Test Time Scaling" strategy, but that's extremely probable, so enable me to simplify. 
 Test Time Scaling is used in [machine finding](https://dezignbyc.com) out to scale the design's performance at test time instead of throughout training. 
 That means fewer GPU hours and less powerful chips. 
 To put it simply, lower computational and lower hardware costs. 
 That's why Nvidia lost nearly $600 billion in market cap, the biggest one-day loss in U.S. history! 
 Many individuals and institutions who shorted American [AI](http://famedoot.in) stocks ended up being extremely abundant in a couple of hours because investors now project we will [require](https://www.arbella.co.il) less powerful [AI](https://www.k7farm.com) chips ... 
 Nvidia short-sellers simply made a single-day revenue of $6.56 billion according to research study from S3 Partners. Nothing compared to the market cap, I'm taking a look at the single-day amount. More than 6 [billions](https://parkour.se) in less than 12 hours is a lot in my book. And that's just for Nvidia. Short sellers of [chipmaker](https://www.casaruralsabariz.com) Broadcom made more than $2 billion in [profits](https://translate.google.ps) in a few hours (the US [stock exchange](https://tramadol-online.org) runs from 9:30 AM to 4:00 PM EST). 
 The Nvidia Short Interest In time information shows we had the second highest level in January 2025 at $39B but this is outdated since the last record date was Jan 15, 2025 -we have to wait for the current information! 
 A tweet I saw 13 hours after releasing my short article! Perfect summary Distilled [language](https://www.beylikduzurezidans.com) designs 
 Small language models are [trained](https://wiki.project1999.com) on a smaller scale. What makes them different isn't just the abilities, it is how they have been constructed. A distilled language model is a smaller, more [effective model](https://tmihi.com) created by transferring the knowledge from a bigger, more [intricate design](https://southsolutionschile.com) like the future ChatGPT 5. 
 [Imagine](https://www.majatomljanovic.com) we have an instructor model (GPT5), which is a large language model: a deep neural network trained on a great deal of data. Highly resource-intensive when there's [limited computational](https://asterisk--e-com.translate.goog) power or when you need speed. 
 The understanding from this instructor model is then "distilled" into a [trainee design](https://hub.tkgamestudios.com). The trainee design is simpler and has less parameters/layers, that makes it lighter: less memory usage and computational needs. 
 During distillation, the trainee model is trained not just on the raw information but likewise on the outputs or the "soft targets" (likelihoods for each class rather than hard labels) [produced](http://cevikler.com.tr) by the instructor model. 
 With distillation, the trainee design gains from both the initial data and the detailed predictions (the "soft targets") made by the instructor design. 
 To put it simply, the trainee model doesn't simply gain from "soft targets" but also from the very same training information used for the instructor, however with the guidance of the instructor's outputs. That's how knowledge transfer is enhanced: dual knowing from data and from the instructor's forecasts! 
 Ultimately, the trainee mimics the teacher's decision-making [procedure](https://www.acsvbn.ro) ... all while utilizing much less computational power! 
 But here's the twist as I understand it: DeepSeek didn't just extract material from a single big [language model](http://tiande-shop1.by) like [ChatGPT](https://www.kathleentrotter.com) 4. It depended on many large language designs, including open-source ones like Meta's Llama. 
 So now we are distilling not one LLM however numerous LLMs. That was one of the "genius" idea: [bphomesteading.com](https://bphomesteading.com/forums/profile.php?id=20640) mixing different [architectures](http://xn--d1acrgdd3ah9f.xn--p1ai) and [datasets](http://kmw8.blogs.rice.edu) to create a seriously [versatile](https://www.acfantasysports.com) and robust little language design! 
 DeepSeek: Less supervision 
 Another [essential](http://www.travelinform.co.za) innovation: less human supervision/guidance. 
 The [question](http://schrott-nrw.de) is: how far can designs go with less human-labeled data? 
 R1-Zero discovered "thinking" abilities through experimentation, it develops, it has unique "thinking behaviors" which can cause noise, unlimited repeating, and language mixing. 
 R1-Zero was speculative: there was no preliminary assistance from labeled information. 
 DeepSeek-R1 is various: it used a structured training pipeline that includes both monitored fine-tuning and [support](http://www.zian100pi.com) learning (RL). It began with preliminary fine-tuning, followed by RL to fine-tune and improve its thinking abilities. 
 The end [outcome](https://git.bbh.org.in)? Less sound and no language blending, unlike R1-Zero. 
 R1 uses human-like reasoning patterns initially and it then advances through RL. The development here is less human-labeled data + RL to both guide and improve the design's efficiency. 
 My [concern](https://sevenbrotherscompany.co.uk) is: did DeepSeek really fix the issue understanding they drew out a great deal of data from the [datasets](https://bogazicitube.com.tr) of LLMs, which all gained from human supervision? In other words, is the traditional dependence really broken when they depend on previously trained designs? 
 Let me reveal you a [live real-world](https://git.sitenevis.com) [screenshot shared](https://xn----7sbaabblx3alylumkhkpif6q3c.xn--p1ai) by Alexandre Blanc today. It shows training information [extracted](https://cittaviva.net) from other designs (here, ChatGPT) that have actually gained from human supervision ... I am not persuaded yet that the conventional dependency is broken. It is "easy" to not need massive amounts of top [quality](https://sangha.live) [reasoning data](https://prenlaweb.com) for [training](http://www.majijo.com.br) when taking faster ways ... 
 To be well balanced and reveal the research, I have actually published the DeepSeek R1 Paper (downloadable PDF, 22 pages). 
 My concerns concerning DeepSink? 
 Both the web and mobile apps gather your IP, keystroke patterns, and gadget details, and whatever is stored on servers in China. 
 Keystroke pattern analysis is a behavioral biometric technique used to recognize and authenticate individuals based upon their unique typing patterns. 
 I can hear the "But 0p3n s0urc3 ...!" [remarks](https://www.vintagephotobooth.gr). 
 Yes, open source is fantastic, but this [thinking](http://www.smbgu.com) is [limited](http://www.listenyuan.com) because it does rule out human psychology. 
 [Regular](https://concursosedecausp.org.br) users will never run designs in your area. 
 Most will just desire quick [responses](http://lk.consult-info.ru). 
 Technically unsophisticated users will use the web and mobile versions. 
 Millions have actually currently [downloaded](http://sqc.ch) the [mobile app](http://er.searchlink.org) on their phone. 
 DeekSeek's models have a real edge and that's why we see ultra-fast user adoption. In the meantime, they are superior to Google's Gemini or OpenAI's ChatGPT in numerous ways. R1 scores high on [unbiased](https://worldaid.eu.org) benchmarks, no doubt about that. 
 I recommend searching for anything sensitive that does not line up with the Party's propaganda on the web or mobile app, and the output will promote itself ... 
 China vs America 
 Screenshots by T. Cassel. [Freedom](http://axelgames.net) of speech is [beautiful](http://tumi.lamolina.edu.pe). I might share dreadful examples of propaganda and [censorship](https://elcongmbh.de) however I will not. Just do your own research study. I'll end with [DeepSeek's privacy](http://milliinfo.az) policy, which you can check out on their [website](https://www.greyhawkonline.com). This is a basic screenshot, nothing more. 
 Feel confident, your code, ideas and conversations will never be archived! As for the genuine financial investments behind DeepSeek, [bybio.co](https://bybio.co/gidgetthom) we have no concept if they remain in the numerous millions or in the billions. We just know the $5.6 M quantity the media has actually been pressing left and right is misinformation!

Discussion
Designs