Home/Product Center/Text
He Ji Bing: Survey Technology Forum for AI Systems | CNCC Experts Talk

 2024-10-25  Read 47  Comment 0

Abstract: Special guests in this issue: Hejian CCF Outstanding Member, Researcher and Blog Director of Zhejiang University Indian Computer Society Author: CNCC2023 [Calculation Technology for AI System] Ho Shuili, Chairman of the Technical Forum

He Ji Bing: Survey Technology Forum for AI Systems | CNCC Experts Talk

Special guests in this issue:

Hejian CCF Outstanding Member, Researcher and Blog Director of Zhejiang University

Indian Computer Society

Author: CNCC2023 [Calculation Technology for AI System] Ho Shuili, Chairman of the Technical Forum

AI technology empower all walks of life

Figure 1: AI technology empower all walks of life and all walks of life

With the development of socioeconomic and technological level, artificial intelligence (AI) technology (AI) technology has appeared in all aspects of people's lives, such as natural language processing, computer vision, brain interface, recommendation system, etc. (Figure 1).With the emergence of new AI phenomenon applications such as ChatGPT, Stable Diffusion, GM entered the era of large models.The innovative results of the combination of vertical domain applications and large models quickly emerged like bamboo shoots after the rain. The large models show unprecedented power in all aspects of social life.

The challenge faced by the AI ​​system

The rapid development of AI technology is inseparable from the strong support of computer hardware platforms and software systems (referred to as AI systems).Efficient AI systems can effectively meet the needs (computing power) requirements of AI missions, improve application computing efficiency, reduce user costs, and help breakthrough AI technology.Among the three carriages of "data, algorithms, and computing power", the computing power is the base of the development of the entire AI technology.However, with the spraying of AI model parameters and the scale of data sets, the current AI system is facing challenges such as "computing power, storage, network and reliability".

-Ai system is facing huge computing power requirements: At present, the number of AI models is deep and complicated, and it needs huge computing power.According to OpenAL's data, from 2012 to 2020, its computing power consumption has doubled every 3.4 months, and the computing power has increased by 300,000 times in 8 years [1] (see Figure 2).The multi-mode and large model GPT-4 launched in March 2023. The computing power demand during training has even reached amazing 2.15 × 1024 Flops [2].The growing computing power needs make the AI ​​computing center facing unprecedented challenges.

Figure 2: Since 2012, computing power demand has increased by more than 300,000 times

-Ai system is facing huge storage challenges: AI training represented by large models has the characteristics of many parameters and large input data sets.For example, the GPT-4 model has a model parameter of 1.8 trillion and requires 13 trillion token as the input data set. [2]The AI ​​system requires efficient storage systems to store and read these large amounts of data.However, the data of the AI ​​model shows a trend of rising year by year [3] (see Figure 3). At the same time, the growth rate of storage hardware performance is far behind the speed of the GPU computing power. Therefore, the storage access is increasingly becoming some AIApplication performance bottleneck.For example, in the deep learning climate forecast of the National Laboratory of Oak Ling, the distributed file system used can only provide 1%of the ideal bandwidth (1.16 TB/s);The total execution time of I/O visits has become a performance bottleneck.

Figure 3: AI model growth trend

-AI system has high network transmission requirements: Since a single computing node cannot meet the large -scale AI computing power requirements, the AI ​​center often uses the network to connect multiple GPU servers to distributed machine learning.In distributed machine learning, data needs to be communicated between multiple machines.If the network transmission speed is slow or unstable, the calculation efficiency of the entire GPU system will be greatly reduced.As shown in Figure 4, poor network transmission can often reduce half of model training efficiency, causing great waste to valuable hardware resources [4].Jaipur Wealth Management

Figure 4: The training performance of the network communication restriction model

-AI system has strong reliability requirements: Due to the participation of multi -equipment and long -term operation, AI applications often face a high error rate.For example, the OPT-175B model uses 992 A100 GPUs during the training process, and the failure of more than 110 times during the two-month training time [5].Similar phenomena also appear in the training process of the BLOOM model [6].Frequent failures bring waste of hardware resources and increase the cost of application execution. Therefore, the efficient system failure recovery mechanism is required to ensure the stable and continuous implementation of training.

[1] Mehonic a, Kenyon a J. Brain-inspired Computing NEEDS A Master Plan [J]. Nature, 2022, 604 (7905): 255-260. [2] wang g, qin h, jacobs s a, et al.Zero ++: Extremely Efficient Collective Communication for Giant Model Training [J]. Arxiv Preprint Arxiv: 2306.10209, 2023. [5] zhang S, Roller S, ET AL. pt: Open Pre-TrainEd Transformer Language Models [J].Arxiv Preprint Arxiv: 2205.01068, 2022. [6]

New AI Stocking Technology

In order to solve the above challenges, we urgently need to seek new AI depository technology to upgrade the existing AI systems from many aspects such as storage, computing, and networks.According to the differences in the system architecture, it can be divided into two ideas: one is to optimize the AI ​​system under the classic von Numanman architecture, and the other is to explore and develop the AI ​​deposit system based on the new type of depository integration architecture.The comparison of the above two architecture is shown in Figure 5.Surat Investment

Figure 5: Traditional von Nomanman architecture v.S. New Type Statistics Integrated Architecture

(1) Optimize the classic architecture

In order to meet the unprecedented computing power and storage needs, the existing AI centers generally adopt a distributed architecture (Figure 6) to gather multiple processors, accelerators or storage equipment capabilities to carry out large -scale machine learning [7].At present, research on distributed computing, distributed storage, and new storage technologies of AI -oriented storage has emerged. It focuses on building an efficient underlying computing scheduling system, storage system, memory system, etc. for the unique computing characteristics of the AI ​​model and the visits mode.In addition, efficient network communication technology is also a hot issue in the current research of AI systems.

Figure 6: Classic distributed architecture

a. distributed calculation

The AI ​​center generally deploy multiple GPU servers to meet the huge computing power needs of the AI ​​model.Each server is equipped with several GPUs that accelerate AI computing, and the entire cluster system may contain thousands of GPUs.For example, in order to train GPT-4, OPENAI uses 25,000 A100 to build a highly cost-effective large-scale distributed GPU cluster.However, the utilization rate of hardware resources in the AI ​​system is still not high. For example, the utilization rate of GPUs is usually below 30%, which leads to huge waste of resources and high computing power costs.Therefore, the existing AI system still urgently needs to develop efficient software and hardware technology to further improve the efficiency of distributed machine learning.

bGuoabong Investment. distributed storage

In order to meet the growing bandwidth requirements of the AI ​​model, the AI ​​center often deploys model data sets on shared distributed storage.For example, in Microsoft's data center, 97.3%of training tasks are stored and read from its Azure distributed cloud storage system.However, as the amount of data in AI applications is increasing, the I/O bandwidth provided by the current distributed storage system is still limited.Therefore, the development of more efficient storage system acceleration technology, such as data pre -cache method combined with AI data access features, has now become a research hotspot of the AI ​​storage system.In addition, some new storage architectures and equipment have also become important concerns.

c. New storage technology

Some AI applications have strict requirements for the effectiveness of training and reasoning, and it is difficult for disk -based storage systems to meet the limit storage bandwidth of AI applications.The new Non-Volatile Memory (NVM) has the characteristics of high bandwidth and low latency, and has the persistence of external memory, which provides new ideas for designing high-efficiency AI storage systems.However, NVM has its own inherent characteristics. Therefore, how to efficiently use NVM technology, perceive the characteristics of the equipment, reduce software overhead, and specifically optimize the application of AI applications in the system and user software layer, which has become a cutting -edge research direction for new AI storage technology.

d. Network acceleration technology

Improving network communication efficiency is an effective way to accelerate the learning performance of the entire distributed machine.Communication acceleration technology based on smart network devices has received widespread attention in the field of AI systems.These methods are calculated and processed on the network by fusion new intelligent network devices, such as FPGA intelligent network cards and programmable smart switches based on FPGA intelligent network cards and programmable smart switches, so as to reduce the size and network delay during the AI ​​computing process, accelerate the training of the entire AI modelOr reasonable process.

(2) Design new architecture

Traditionally based on the computing paradigm of the von Nurman architecture, the computing paradigm of the existence and calculations is unable to balance the increasingly disparity storage and computing development gap, and always face the problem of storage walls and power walls.For this reason, the non -von Nokaman architecture represented by the new type of depository structure is proposed.By introducing hardware such as the integrated chip and brain chip of the deposit, the new type of deposit and calculation integrated architecture realizes the fusion of the storage and calculation module, and effectively avoids the storage wall and power consumption wall caused by the repeated transportation of data between the storage and calculation units.efficiency.

a. Stocking integrated chip

Emerging depository integrated chips, such as the memory resistor, etc., integrate data storage and calculations inside the same module, and greatly reduce data access delay and energy consumption by calculating in situ (shown in Figure 7), which can effectively meet the future largeStocking needs of large -scale artificial intelligence application scenarios.However, related technologies are still in its infancy, and there is a long way to go before the industrialization landing.How to start a series of technological innovations from the perspective of circuit, architecture, algorithm, etc., open up the key links between the underlying hardware to the top application, and design the high -efficiency integrated existence system of AI is still a problem that needs to be solved.

Figure 7: Integrated calculation chip

b. Brain -like computing chip

As an important branch of the integration of deposits, the characteristics of neuropic cells can be calculated and stored at the same time, and the deep fusion of storage and computing is realized, as shown in Figure 8.This fusion builds neuropized brain chips, using pulse neural networks for training and reasoning, making it more efficient to deal with complex mathematical issues and image recognition, and has high research value.At present, this is also actively exploring at home and abroad.

Figure 8: The pulse neural network in the brain -like calculation

4New Delhi Wealth Management. Technical Forum is an important means to promote the development of the field

As the number of parameters of the AI ​​model increased and calculated the scale of the scale well, the existing technology of driving AI model training and reasoning also needs to be continuously promoted.What new problems and challenges will the increasingly developed AI models bring in the field of computing and storage systems?Where will the new AI deposit technology develop?

Please pay attention to the "Conication Technology for AI Systems and New Trends of the Development of Systems" organized by the CNCC Conference.This forum invites outstanding scholars and heads of leading enterprises with representative results in recent years to share, and discuss the key elements of new AI deposit technology technology, explore new applications, computing frameworks, calculation architecture, cloud infrastructure, etc.The trend and the latest progress in the field of technical fields will provide you with a good academic exchange platform. Welcome to joinJaipur Stock!Let's promote the development and progress of the AI ​​system!

CNCC Participation Registration

Forum Name: 【Calculation Technology for AI Systems】

Holding time: October 26th

Forum Chairman: Ho Jishimi CCF Outstanding Member, Researcher and Bo Director of Zhejiang University

Co -chair: Wang Yanfeng Huawei Yun AI System Innovation Laboratory Technology Expert


Jaipur Wealth Management

Notice:Article by "Bank loan procedures | Make money from Financial investments". Please indicate the source of the article in the form of a link;

Original link:https://vvipchina.com/pc/53.html

Tags:

  • Article104
  • Comment0
  • Browse1603
About Us

Copyright © Focusing on global investment and financial management, we provide online access services to our clients to facilitate online access services!

Financial
Platform
Stock
Investment
Product
Gold
Management
Contact Us
Totline:
Address:
Email:
Code:
Copyright http://www.vvipchina.com Bank loan procedures Rights Reserved.