🔥 Gate.io Launchpool $1 Million Airdrop: Stake #ETH# to Earn Rewards Hourly
【 #1# Mainnet - #OM# 】
🎁 Total Reward: 92,330 #OM#
⏰ Subscription: 02:00 AM, February 25th — March 18th (UTC)
🏆 Stake Now: https://www.gate.io/launchpool/OM?pid=221
More: https://www.gate.io/announcements/article/43515
After Computing Power, high-quality corpus data sets determine the upper limit of large model capacity.
From February 21st to 23rd, Shanghai will host the 2025 Global Developer Conference (GDC). The Shanghai Economic and Information Commission introduced that 100 domestic and foreign developer communities such as Hugging Face, Microsoft Developer Community, CSDN, Ali Moda Community, Linux Foundation, ARPA Foundation, Huawei Community, etc., will participate in this GDC; focusing on core technologies such as large models, Computing Power, corpora, tools, software platforms, etc. The participating developers cover hardware development, cloud computing, big data, internet of things, AI, robotics, blockchain, and Metaverse.
Shanghai Cupass Technology Co., Ltd. is one of the companies participating in this conference. Cupass is a data platform enterprise specializing in artificial intelligence corpus established in accordance with the requirements of the Shanghai Municipal Party Committee and Municipal Government. The company is positioned as a professional functional corpus service operation platform, dedicated to providing low-cost, high-quality corpus data services to basic models, vertical models, and small and medium-sized innovative entrepreneurs.
"Our entire team has been working non-stop since the fourth day of the Lunar New Year, conducting research and follow-ups on the innovation of DeepSeek." Kupax CEO Huang Haiqing told Interface News that the emergence of DeepSeek has both excited and anxious the entire AI industry. The main anxiety lies in why existing large models have invested so much money but have not achieved the same effect as DeepSeek.
He believes that the success of DeepSeek lies not only in the innovation of the original algorithm, but also in the use of high-quality language corpus data, which can greatly save Computing Power and data, providing a way for China's large model industry to "overtake on a bend". Huang Haiqing said that according to the current development of large models, the high-quality language corpus data set will determine the upper limit of the ability of large models, and the supply of high-quality language corpus can greatly reduce the training cost of large model companies.
He introduced that Cupas has already fully launched the construction of industry corpora in the fields of embodied intelligence, finance, manufacturing, education, medical care, entertainment, urban governance, etc. The corpora operation 1.0 platform has been put into operation, and is accelerating the layout of the development of the platform 2.0 from the real world to simulation to data synthesis. Currently, the company has connected with more than 50 corpora ecological partners, reducing the cost of large models by providing high-quality and effective datasets to partners.
The scaling law is still at work, but the speed has slowed down, according to Huang Haiqing. He believes that in the future, beyond the large language models, the application of multimodal large models will begin to explode, and the business models of ToB (enterprise) and ToG (government) will become the main development direction of large model companies. Many basic large model companies are now shifting towards industry verticals, and in the future, there will be fewer than ten basic large model companies that can survive in the Chinese market.
In specific industries, he believes that the financial, education, medical, and industrial sectors have already prioritized the embrace of large models. In key areas such as autonomous driving, embodied intelligence, and scientific intelligence, large models are also actively being applied. As time progresses, the future transportation and retail industries will also apply large models. Correspondingly, there is a greater and higher quality demand for vertical industry corpora. For inference models, it is also necessary to construct the inference process on the original data, which also poses new requirements for corpus production.
In the collection and production of corpus data, Huang Haiqing also suggested that we should keep pace with the times in terms of copyright law, and make some updates to the reasonable definition scope of artificial intelligence and large-scale model training corpus data.
"This is not about changing the past rules, but about adding and updating. I think this is a more suitable and workable path," Huang Haiqing said. "In the field of artificial intelligence, large models, and corpus data, the previous copyright laws were designed for humans. When training corpus data with large models, if machine learning standards are measured by past standards, it may not be appropriate. Moreover, this issue has already affected the corpus procurement costs and legal risks of large model companies."
He suggested accelerating the clarification of reasonable use rules for large-scale model corpus data, promoting the applicability of "text and data mining" in the field of pre-training; promoting the reasonable use of data for machine learning in China, balancing the rights of copyright owners and the needs of technological development, and solving the problem of difficult authorization; the government should introduce encouraging policies to support corpus data enterprises in strengthening the research and development of automated toolchain platforms, reducing corpus data costs; building an AI automated cleaning and labeling toolchain platform to reduce corpus costs; accelerating the legal research on the protection scope of AI-generated objects and formulating clear rules on the ownership and responsibility of AI-generated objects.
Huang Haiqing also stated that in the future, AI will dominate the annotation and cleaning of data, and data annotation will transition from labor-intensive industries to knowledge-based and technology-based industries.
(Source: Jiemian News)
Source: East Money
Author: Interface News