网易科技活动品牌-MCtalk-网易云信

背景

网易数智为电魂旗下的《野蛮人2》提供了一套完整的定制化游戏AI机器人解决方案，包含接入方案、机器人训练、机器人部署和机器人迭代四大内容。具体来说：接入方案：玩法体验和接入方法设计；机器人训练：SAR方案设计、模型网络设计、大规模分布式训练；机器人部署：多难度AI机器人接口、多风格AI机器人接口、局内动态难度调整接口、私有化部署支持；机器人迭代：遇到大的版本更新，进行模型加训或模型结构调整，确保机器人表现符合要求；

2023年3月20日-24日，第35届游戏开发者大会（GDC）在美国旧金山举办，网易数智携手电魂亮相GDC，并基于强化学习理论与AI机器人在游戏中的具体应用，分享网易数智的游戏AI解决方案在电魂游戏中的落地实践，与游戏开发者和行业专业人士探讨AI技术如何助力游戏行业发展

在真实的游戏环境中，人类玩家的行为往往是比较复杂，一些传统制作游戏机器人的方法（比如有限状态机FSM、行为树等）往往很难将真实玩家的行为全部模仿出来，这就会导致传统机器人的模式固定、动作单一，玩家看到传统机器人会感觉“一眼假”；而且，随着对战次数的增加，玩家会很容易学习到传统机器人的行为“套路”，高端的玩家能够很轻易的战胜传统机器人，这就会导致玩家对战胜利后的成就感大打折扣。我们为了解决这个问题，采用了强化学习的方法来训练游戏机器人，我们将其称之为AI机器人。关于强化学习，它是人工智能领域一个特殊而重要的模型训练方法，其特殊性在于它不同于常规的监督学习或者无监督学习那样需要大量的训练数据，而是需要一个真实的环境，我们要训练的模型（通常也将其称之为智能体），会不断的与真实环境进行互动，环境也会给出相应的互动反馈，智能体会学习这种反馈并指导其与环境的互动方式，从而更好的适应环境。具体来说，智能体会观察环境当前时刻的状态，然后根据当前状态决策出要在环境内执行的动作，动作在环境中被执行后，智能体会得到环境反馈的奖励，在不断循环上述过程中，智能体会学习到更好适应环境的交互策略；因为智能体学习的过程与人类学习的过程类似，强化学习也被认为是实现通用人工智能的一个比较可行的办法，所以它在人工智能领域的地位也相对重要。我们的AI机器人正是采用了强化学习的训练方法，因此AI机器人在拟人性以及强度方面比传统机器人都要高出许多。

为了降低我们训练AI机器人的成本、提高训练效率，我们自研了一套分布式强化学习训练框架RLEase。框架最核心部分包括Worker和Learner两个模块，其中Worker的主要功能是与游戏环境进行交互，收集交互产生的状态、动作、奖励等数据，定期将这些数据发送到Learner模块，同时也要定期同步Learner端的模型；Learner的主要功能是利用Worker收集的数据进行模型训练。在实际训练时worker有多个而learner只有一个。训练框架的其他部分为训练任务提供了更多辅助性的功能，Stat模块用于统计训练中的一些基本数据，比如训练时的损失、模型之间对战胜率等等；Model Manager模块用于模型的管理，包括模型的存取、模型调度等，其中模型调度是根据模型之间的对战胜率情况为当前模型挑选合适的对手模型以实现多样化的self-play。

我们设计了这样一种网络结构来适配BBQ2 AI机器人模型的训练，网络的输入和输出均是multi-head的结构，其中输入网络的游戏环境状态包括角色状态、团队状态、对手状态、Boss状态以及地图状态四个部分，角色状态主要有角色位置数据、血量数据、技能数据、buff数据等多维度信息，团队状态包括了各个队友的角色状态以及团队积分等信息，对手状态包括了各个可见对手的角色状态以及敌方团队的积分等信息，而地图状态主要用于表示角色周围的道具、障碍物以及蘑菇等基本信息。由于我们使用的强化学习算法遵循了Actor-Critic范式（paradigm），因此网络的输出包括动作和价值两部分，输出的动作是AI机器人具体要在环境中执行的动作，价值则是用来评估这个动作的好坏。价值输出只在训练的时候使用，模型部署上线时，价值输出会被丢弃，只保留动作输出。

基于成本的考量，我们在进行AI机器人研发的过程中收到了诸多限制，整个项目使用的资源不超过3000核CPU；由于临近游戏上线测试，我们训练时间只有短短三周；由于游戏一直在迭代更新，我们也需要频繁的对模型进行加训、更新。

在多数情况下，游戏开发商如果想要接入外部的AI服务会遇到诸多问题，首先他们要面对接入挑战。对游戏开发者来说，开发、使用AI机器人往往意味着他们需要学习强化学习相关的内容，这对绝大多数人来说都是沉重的负担；而对于AI工程师来说，他们往往不懂得游戏的开发逻辑，这也难以让他们训练出效果出色的AI机器人。

为了解决这个问题，我们设计、开发了AIBridge，正如其名，AIBridge建立起了游戏开发者和AI工程师之间的桥梁。AIBridge主要作用是屏蔽AI服务端复杂的逻辑结构，游戏开发者只需要调用其对外暴露的API接口就可以使用AI机器人的功能，对游戏开发者来说AI机器人的逻辑完全是透明的。

AIBridge中有两个主要的类，分别是AIGamePlay类和AIAgent类。AIGamePlay类维持了一组与游戏逻辑的会话（session），AIAgent类对应了游戏中一个具体的角色。AIGamePlay类为游戏开发者提供了三个接口方法，分别是GameStart、tick、GameEnd，游戏开发者需要在游戏开始之后调用GameStart来建立游戏与AI服务的会话，之后按照需要调用tick让Aiagent决策执行动作，在游戏结束之前调用GameEnd以结束会话，通过这种方式可以极大降低游戏接入AI服务的成本。

其次，由于游戏逻辑复杂、代码量巨大，游戏不可避免的包含了许多未知的bug。有些bug对模型的训练来说是致命的。在实际训练的时候我们遇到了某个角色在释放完其大招（ultimate）后移动速度会达到最大值的bug，其他角色无论如何都追不上并且击杀他，此时产生的状态数据是错误的，AI机器人的表现行为也会因为学习了这些错误的数据而变得怪异。

针对该问题，我们会对训练中模型做一系列实时的数据统计分析，通过观察各类指标的分布是否在合理区间内来判定训练是否正常，从而反推训练过程中是否出现了未知的bug。

同时我们也会查看疑似出现bug的训练任务中产生的游戏录像文件来分析问题所在。

对复杂的游戏系统进行抽象是我们进行AI机器人开发最大的挑战。截止到我们的AI机器人上线，BBQ这款游戏已经开发了6个英雄、3张地图，每局对战是3个队伍之间的混战，而且允许重复英雄的存在，对局中英雄都可以拾取并使用随机出现的道具；此外BBQ还包含，群星之力、装备等复杂局外养成系统，我们要将上述所有内容都抽象为模型所需要输入的状态空间。而随着游戏的更新，还有会有更多的英雄、道具、装备天赋，我们也需要考虑模型对新内容的适配。

除输入信息外，我们也需要设计AI机器人的动作空间以确保AI机器人可以像真实玩家一样流畅的进行攻击、使用道具。确保这些设计的合理性，对AI机器人的表现来说十分重要。

为了解决状态空间过于庞大的问题，我们实现了一些非常有效的特征工程。例如，为了更好地获取角色周围蘑菇的信息，我们创建了一个以角色为中心的圆，并将其分割为8个相等的扇区，然后对每个扇区中蘑菇的数量进行归一化；同时我们限制了每个AI机器人的感知范围，即只会保留一定范围内的障碍物、道具等信息作为AI机器人的状态输入；在实际训练时每个AI机器人都会感知到其他可见角色的实际位置，实验证明此举对于提升模型训练速度非常有效。为了解决游戏的更新导致的状态空间变动的问题，我们将用于表示物品或技能状态one-hot编码全部替换为了其embedding形式。

对于动作空间，我们将游戏中的动作进行了抽象并将其分为了11种不同的类型，英雄的行动方向抽象为8个，英雄技能的释放目标有10个，包括9个英雄加1个boss。因此agent每次动作决策，都有990种可能。

为了进一步降低游戏过于复杂造成的训练困难的问题，我们设计了一种课程学习的训练方法。训练任务前期，我们会使用固定的英雄阵容、固定的装备，AI机器人在达到相对高的水平后我们会引入更多英雄更多的装备，直至训练后期所有的英雄、装备全部随机进行训练。利用这种方法，我们极大的降低了训练难度，模型最终能够在英雄、装备全部随机的情况下达到非常好的效果。

使用reward-shaping方法也能够加快模型训练速度，我们基于过往的设计经验在BBQ项目上进行了细致的reward shaping，以确保AI机器人能做出符合真实玩家逻辑的行为。例如，我们设计了一些奖励来鼓励其他角色杀死魔王，而成为魔王的角色则会收到更高伤害惩罚，AI机器人就可以学会集火魔王或者魔王努力规避其他角色攻击。

让机器人更像真实玩家，这是我们研发AI机器人的初衷。为此我们设计了不同风格以及不同难度的AI机器人。

对于不同风格的机器人，我们重点考虑了reward shaping的方法。我们实现了机器人三种不同的风格，包括激进风格、辅助风格、谨慎风格。对激进风格的设计，我们在基础机器人模型的基础上，适当增加了伤害、击杀类的奖励、降低了承伤、死亡类的惩罚；而辅助风格则是增加了助攻、治疗类的奖励，降低了伤害类奖励；谨慎风格则增加了得分类、存活类奖励，增加了承伤类惩罚，降低了伤害类、击杀类奖励。

这里我们用gif直观展现了不同风格AI的行为表现，第一个gif是谨慎风格的AI机器人，它在血量低的时候倾向远离对手并用远程道具poke对手；第二张是激进风格的AI机器人，即使它在低血量的时候也会主动进攻并击杀敌人；最后是辅助风格的AI，在队友血量低时它倾向掩护并治疗队友。我们统计了不同风格AI机器人的一些基本数据情况，并用雷达图的形式展示出来。Assist、Death、Kill、SR(Score Rata)、TFT(Team Fight Tendency)、EDK(Effective Devil Kill)、DK(Devil Kill)、TD(Team fight Damage)、AT(Attack Tendency)。从图中可以看出，谨慎风格更倾向于打团战（TFT很高）而且其死亡次数和击杀次数都比较低；激进风格则是攻击欲望很强（AT很高），其击杀和死亡次数都很高；辅助风格比较明显，其助攻次数、得分率以及参加团战倾向都很高。

针对不同难度的设计，我们采用了利用fake state将模型从最高难度依次削弱的方法。即我们会训练一个最强的模型，然后在模型中依次加入不同的fake state，最终得到了6个不同的难度。

所谓fake state就是让AI机器人有一个错误的感知，进而导致错误的决策。比如这个例子中，真实场景下purple角色的装备和血量都比green角色的差，但是我们故意改变输入到purple角色的状态，让purple角色错误的感知到他的装备和血量都比对方好，本应该逃跑的purple角色会倾向主动攻击对面。但实际上的情况正好相反，所以purple角色就被干掉了。

slide中的核心内容展示的差不多了，大家来放松一下来看一个我们录制的玩家与AI机器人对战的视频。

这里有一些数据，在BBQ2新加坡测试中，我们的一共创建了超过50万个AI机器人，进行了超过10万场对局，真实玩家的平均等待实践减少了64%，并增加了3.5倍的开局数量。

经过这个项目的开发，我们能感受到将基于强化学习训练的AI机器人引入对战类游戏确实可以增加玩家的游戏体验；在整个项目的实施过程中我们遇到了诸多挑战，但我们也想到了对应的解决方案，例如AIbridge中间件、训练数据监控、录像分析、特征工程、奖励设计、课程学习等。

这里再次特别感谢我们的合作伙伴网易智企，如果您考虑应用强化学习机器人到您的游戏中，那么选择合适的服务供应商能最大化规避风险和提升收益，我们的伙伴网易智企，它为我们提供了全套的AI机器人解决方案，包括接入、训练、部署和迭代，帮助我们成功的在我们的游戏中应用了AI机器人。

英文原稿：

1

I am delighted to be here at GDC to give this speech to all our partners in the gaming industry. The main topic is the challenges we faced and the solutions we came up with when applying reinforcement learning game bots in our game, the BarbarQ2. Our game is expected to be officially launched in the second half of the year. We have completed online testing, and the performance of AI game robots throughout the testing process has been excellent, far beyond our expectations. I would like to give special thanks to our partner NetEase GrowthEase, who provided us with a complete set of AI game bot solutions to help us integrate services, the training and deployment of AI game bot were implemented by them as well.

2

Here is the framework for the speech content. First, introduce our game, then give a brief overview of reinforcement learning. Next, mainly talk about the challenges and solutions we encountered in applying reinforcement learning. Finally, some conclusions.

3

Here I would like to give a brief introduction of the game: BarbarQ2, as well as its core gameplay called mushroom melee which AI game bot applied into. Mushroom Melee has a 3v3v3 form, a gamer keep collecting mushrooms, by killing others or from the ground directly, to continuously update equipment and scores. A gamer can become a demon king through accumulating scores or killing current demon king, the team wins the game if it contains the demon king at the end of the game. Here is video showing the process from game start to the master evil

4

I want to talk about our aim of application of the AI game bots, that is we want to get very human-like AI game bots, these bots can be utilized to match or against the human gamers to give them much better experience as well as decline the waiting time of matchmaking.

5

In a real game environment, the behaviors of human players are more complex. Some traditional implementation of game robot (such as finite state machine FSM, behavior tree, etc.) are often difficult to imitate all the behaviors of human players, which make the action pattern of traditional robot fixed and single. Players can detect the fake of traditional bot at a glance; moreover, with the increase in the number of battles, players can easily learn the behavior rules of traditional bot, and high-end players are able to win the game bot very easily, it leads to a greatly reduced sense of accomplishment after the battle even players win the game. In order to solve this problem, we use the method of reinforcement learning to train game bot called AI game bot. The reinforcement learning is a special and important model training method in the field of artificial intelligence. It is totally different from conventional supervised learning or unsupervised learning that requires a large amount of training data, but requires a real environment. The model need to be trained (usually called agent) continuously interact with a real environment, and the environment gives corresponding interactive feedbacks to the agent. The agent will be guided by these feedbacks to interact with the environment, thus better adapt to the environment. Particularly, an agent observes the state of the environment at the current moment, then output an action that should be executed in the environment, after the action is executed in the environment, this agent will receive rewards from the environment feedback. After continuous cycling of the above process, the agent will learn an interaction strategy that better adapts to the environment; because the learning process of agent is similar to the learning process of human, reinforcement learning is also considered to be a more feasible way to achieve general artificial intelligence, so it has a higher status in the field of artificial intelligence. Our AI game bot adopts the training method of reinforcement learning, so it performs better than the traditional game bot, in terms of human-like level and strength.

6

In order to reduce the cost of training AI game bots and improve training efficiency, we have developed a distributed reinforcement learning training framework called RLEase. The core part of the framework includes two modules: Worker and Learner. The main function of Worker is to interact with the game environment, collect data such as states, actions, and rewards generated by the interaction, and regularly send these data to the Learner module, as well as synchronize the model on the Learner side regularly; the main function of the Learner is to use the data collected by the Worker for model training. During actual training, there are multiple Workers and only one Learner. Other parts of the framework provide more auxiliary functions for training tasks. The Stat module is used to count some basic data in training, such as the loss during training, the wining rate between models, etc.; the Model Manager module is used for model management, including model I/O, model scheduling, etc., model scheduling is to select a suitable opponent model for the current model according to the wining rate between the models to achieve diverse self-play.

7

We designed such a network structure to adapt to the training of the BBQ2 AI game bot model. The input and output of the network are both multi-head structures. The game environment states be input to the network include character states, team states, opponent states, and boss states, these four parts of the map states. The character states are multi-dimensional information includes character position data, blood volume data, skill data, buff data and so on. The team states are information includes the role state of each teammate and team points. The opponent states are information includes the opponent's character state and the points of the enemy team. While the map states are used to represent the basic information such as items, obstacles, and mushrooms around character. Since the reinforcement learning algorithm we applied follows the Actor-Critic paradigm, the output of the network contains two parts: action and value. The output action is the action that the AI game bot should perform in the environment, and the output value is used to evaluate this action to be good or bad. The output value is only used during training. When the model is deployed online, the value output will be discarded and only the action output will be kept.

8

Considering the cost, we had many restrictions in the process of developing AI game bot. The total computing resources used by the entire project under 3000 CPU cores; due to the very short term before the game online test, we have only three weeks to develop AI game bots and we also need to update model frequently to catch up with the game’s fast update pace.

9

In most cases, game developers have to learn reinforcement learning related knowledge if they want to apply AI game bot in their games, it turns out to a be heavy burden for these game developers; while for AI engineers, they usually do not understand video game development.

10

In order to solve this problem, we developed a middleware called AIBridge. As the name suggests, AIBridge builds a bridge between game developers and AI engineers. The main purpose of AIBridge is to encapsulate the complex logic structure of the AI server. Game developers only need to call its exposed API to use the functions of AI game bot. For game developers, the logic of AI robots is completely transparent.

11

There are two main classes in AIBridge: AIGamePlay and AIAgent. The AIGamePlay maintains a set of sessions with game logic, and the AIAgent class corresponds to a specific role in the game. The AIGamePlay class provides game developers with three interfaces: GameStart, Tick, and GameEnd. Game developers need to call GameStart after the game starts to establish a session between the game and AI service, then call tick as needed to let AIAgent do actions, Call GameEnd before the game ends to end the session. This method can make game access to AI game bot service very easily.

12

Second challenge, due to the video game’s complex logic and huge amount of code, the game inevitably contains many unknown bugs. Some bugs are fatal to the model training. For example, during the actual training, we encountered a bug that a character got the maximum value of the movement speed after releasing its ultimate, other characters could not catch up and kill it anyway, so the states generated under this condition were definitely wrong, and the AI game bot behaved weirdly because they trained by wrong data.

13

To solve this problem, we conducted a series of real-time data statistical analysis of the training model, and decided whether the training is normal by observing whether the distribution of various indicators was within a reasonable range, thus to know whether there are some unknown game’s bugs.

14

At the same time, we also checked the videos generated in the training tasks that are suspected of containing bugs to analyze the problem.

15

Abstracting complex game systems is our biggest challenge in AI game bot development. As of the launch of our AI game bot, the BBQ2 game has involved 6 heroes and 3 maps. Each match is a battle between 3 teams, and the existence of same heroes is allowed. Heroes in the game are allowed to picked up and use items that appear randomly; in addition, BBQ2 also includes complex external growing systems such as star power and equipment. We need to abstract all these contents into the state space that is input to model. As the game continuously updates, there will be more heroes, items, and equipment, we also need to care about the model’s adaptation to these new changes.

16

In addition to input information, we also need to design the action space of the AI game bots to ensure they can do any actions as smoothly as human players. It’s crucial to make sure the rationality of these designs to get excellent performance of AI game bot.

17

To solve the problem of the state space being too large, we implemented some very effective feature engineering. For instance, to better get information of mushrooms around the character, we created a circle centered on a character and divided it to 8 equal sectors, then normalized the number of mushrooms in each sector and limited the perception range of a AI game bot. Each character perceived the actual location of other visible characters during actual training. Experiments prove that this is a very effective way in improving the speed of model training. In order to solve the problem of state space changes caused by game updates, we replaced all one-hot encodings used to represent items or skills states with their embedding forms.

18

We abstracted the actions in the game and divided them into 11 different types, as well as action direction was abstracted into 8 types, plus 10 skill releasing targets: 9 heros and a boss. Therefore, every time the agent has 990 choices when need to make an action

19

In order to reduce the problem of training difficulties caused by overly complex games, we designed a training method of curriculum learning. In the early stage of the training task, we use a fixed line-up of heroes and fixed equipment. After the AI game bot reaches a relatively high level, we introduce more heroes and equipment. Until the final stage of training, all heroes and equipment will be trained randomly. Using this method, we greatly reduced the difficulty of training, and the model could achieve very good result even the heroes and equipment are all randomly given.

20

Using the reward-shaping method can also speed up model training. Based on past experience, we carried out very detailed reward shaping on the BBQ2 project to ensure that AI game bot can behave like human player. For example, we designed some rewards to encourage other characters to kill the demon king, and the character who becomes the demon king receives a higher penalty, thus the AI game bot knows how to siege the demon king or the demon king try it best to avoid other characters' attack.

21

Making game bot more closed to human player is our original intention of developing AI game bot. To accomplish this, we designed AI game bot with different styles and difficulties.

22

To achieve diverse styles, we focused on the reward shaping method. We implemented three different styles, they are aggressive style, auxiliary style, and cautious style. For the aggressive style design, we added appropriate rewards for damage and killing, and reduced penalties for taking damage and dying based on the basic robot model. The assisted style added rewards for assists and healing, and reduced rewards for damage. The cautious style added rewards for objectives and survival, increased penalties for taking damage, and reduced rewards for damage and killing.

23

Here we use gifs to intuitively show the behavior of different styles. The left gif is cautious style. When the blood volume is low, it tends to stay away from the opponent and poke the opponent with remote items; the middle one is the aggressive style, even when it is low in blood, it attacks and kills enemies actively; the right one is assisted style, it tends to cover and heal teammates when their blood is low. We have collected some basic data of different styles of AI game bots and displayed them in the form of radar chart. Assist, Death, Kill, SR (Score Rata), TFT (Team Fight Tendency), EDK (Effective Devil Kill), DK (Devil Kill), TD (Team fight Damage), AT (Attack Tendency). The figure shows that the cautious style prefers team fight (high TFT) and the number of death and killing is relatively low; the aggressive style has a strong desire to attack (high AT), it has higher number of death and killing; the assisted style is obvious, the number of assists, scoring rate, and tendency to team fight are all very high.

24

To achieve diverse difficulty levels, we utilized the method of fake state which gradually weakening the model from strongest level. Specifically, we trained the strongest model and then added different fake states to the model one by one, resulting in 6 difficulty levels

25

The so-called fake state is to make the AI game bot has a wrong perception, which leads to wrong decisions. In this example, the equipment and blood volume of the purple character are worse than that of the green character in the real condition, but we deliberately change the state input to the purple character, so that the purple character mistakenly thinks his equipment and blood volume are better than the opponent's, the purple character who should be running away will tend to attack the opponent actively. But in fact the opposite happened, so the purple character was killed.

26

Let's relax and watch a video of a human player fight with an AI game bot.

27

To conclude, in the BBQ2 Singapore testing, we created over 500,000 AI game bots, conducted more than 100,000 games. The average waiting time for real players was reduced by 64%, and the number of game openings increased by 3.5 times.

28

Through the development of this project, we can feel that introducing RL-based AI game bot into games can indeed enhance players' gaming experience. Throughout the implementation of the entire project, we encountered numerous challenges, but we also came up with corresponding solutions. such as AI-Bridge middleware, training stats monitoring, replay analysis, feature engineering, reward shaping, curriculum learning, etc.

Here, we would like to express our special thanks to our business partner, GrowthEase of NetEase. If you are considering applying reinforcement learning game bot to your game, choosing the right service provider can maximize risk avoidance and revenue lift. Our partner, GrowthEase, has provided us with a complete AI robot solution, including service integrate, model training, model deployment, and upgrade, which has helped us successfully apply RL-based AI game bot to our game."

网易数智携手电魂亮相GDC，用AI共建游戏生态

背景

英文原稿：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28