Mimicking Electronic Gaming Machine Player Behavior Using Reinforcement Learning

This study looks at how Reinforcement Learning (RL) approaches can be used to understand player behavior in Electronic Gaming Machines (EGMs) found in venues like casinos. The gaming business is keen to learn about the many types of player behavior and create virtual players mimicking these behaviors. To achieve this, we trained RL models to mimic player behavior by grouping different playing styles with K-means clustering and determining termination states for one of the playing behaviors. The Proximal Policy Optimization (PPO) and Actor Critic using Kronecker-Factored Trust Region (ACKTR) models were subsequently implemented, with the agents being re-warded based on their proximity to the termination states. Our findings suggest that the ACKTR model performed better than the PPO model, with the generated playing behavior demonstrating a high level of statistical similarity to real-world player behavior within the selected cluster.


Introduction
Electronic Gaming Machines (EGMs), which are gambling machines installed in a range of locations such as casinos, bars, and hotels, have become more and more popular, attracting the attention of not just the gaming industry but also of the government and researchers.The availability of gambling activities has increased due to internet gaming [1].According to the 2018 Canadian Community Health Survey (CCHS), nearly two-thirds (64.5%) of Canadians aged 15 or older (18.9 million) reported gambling at least once in the previous year [2].These games can be highly addictive as there is a chance of winning a high amount of money in a short duration.In some situations, people continue to spend time and money on gambling even though it affects them mentally and financially, which is identified as a disorder named the problem gambling [3,4].The government is working to tackle this problem by promoting responsible gambling resources, while researchers have conducted studies [5][6][7][8] to identify the gamblers at-risk of problem gambling.Most of the work in this field has been done on limiting problem gambling, however, generating virtual players that mimic the real-world players can help industry and researchers test experiments and conduct a detailed behavioral analysis to understand the behavioral patterns in-depth without identifying the person.Future behavior in the particular situation of a real player could be anticipated using this virtual player.
The major challenge in working with EGM data is that they are anonymous, which means that these machines do not keep any identifying information about the players in their logs and that they do not distinguish between various players' sessions.In this research, we are using sessions that were sessionized based on balance and the pauses taken while playing [9].However, these sessions do not contain any information about the experience of the player.
We developed agents using Reinforcement Learning (RL) to produce such sessions in order to mimic the behavior of real-world EGM players.To mimic one particular player's behavior, we have to separate all types of behaviors.We used an unsupervised learning algorithm, namely the K-means clustering for grouping the sessions with similar behavior of players.This similarity is measured using the Euclidean distance.Moreover, for replicating a particular behavior we are defining termination states using some selected features and values from one cluster.These termination states are used for reward calculation of the RL models, which get higher rewards as the RL agent approaches towards these termination states.The closeness of the current state of the agent to the termination states is measured by the Euclidean distance.We trained two RL algorithms, Proximal Policy Optimization (PPO) and Actor Critic using Kronecker-Factored Trust Region (ACKTR), to generate sessions with a particular behavior.

Contributions
The four most important contributions of this paper are as follows: (1) Developed a methodology for mimicking the behavior of real-world players using reinforcement learning.(2) Demonstrated the effectiveness of the proposed approach in reducing game development time and costs by using virtual players for testing and validation.(3) Contributed to the advancement of the field of artificial intelligence and machine learning in the context of mimicking game players.(4) Provided a valuable tool for game developers to improve the player experience and create more engaging games.The remainder of the paper is organized in the following way.Section 2 reviews backgrounds on EGM and related work.In section 3, we discuss the methodology of this study.Section 4 details the experimental setup.In section 5, we discuss the results of Reinforcement Learning models and compare them with real-world players' sessions, followed by providing a conclusion and future directions of the concept in section 6.

Background and Related Work
Electronic Gaming Machines (EGMs) [10] are a common type of gambling machine found in casinos, clubs, and other public areas where people congregate for recreation.Although these devices, which use sophisticated technology, are actually computers, many of them still have reels that purport to spin and are evocative of earlier gambling machines.A random number generator is the base of every EGM.The computer retrieves the numbers created at that moment and transforms them into a display on the screen when a button or touch screen is pressed.The numbers represent a location on a reel map (the quantity and arrangement of symbols on each virtual reel) and a pay table (the payouts for any combination of symbols appearing on a line).For instance, the pay table will be used to map the random process's generation of three cherries to a payout of, say, two credits.These machines don't keep track of most of the play data and are stateless.Loyalty cards [11] is a major update that certain venues have implemented that are used to track customer information in the casino.As a result, well-formed data that takes into account playing sessions, games played, and money spent is produced.With this, the sessionizing task is entirely relinquished, as well as a history of user play data is also provided.Loyalty cards are not required and are not even used by the majority of venues [12], therefore typical data processing is still in use.
Latifi [9] effectively sessionized user datasets using EGM logs, which contain game data and metadata but no user ID.Assumptions were made that players only use one machine and sessions start with a cash-in and end with a cash-out or playing all credits and turning off the machine.The second assumption took into account the minimum cash balance and idle time threshold.If the time gap exceeds the idle time threshold and credit is less than the minimum, the session is terminated.Multiple cash-outs within a session are allowed.Experiment results show that using idle time of a few minutes and a minimum amount of money around a couple of dollars gives actual player session duration.The study uses data sessionized using this method, making it important.
Studies have been conducted on detecting gamblers' personas and predicting at-risk problem gambling using unsupervised learning techniques like clustering [5][6][7][8].Problem gambling is characterized by excessive gambling behavior despite negative consequences, and often co-occurs with other negative habits such as substance abuse and food disorders [3,13,14].Adami et al. [5] proposed indicators based on wager volatility and number of games played to identify a group of medium-risk players that were not recognized by Braverman and Shaffer [8].
Mosquera and Keselj [6] identified game-play types based on session start and end times using k-means clustering on EGM data (win or no win to cash out).They used ANOVA and Tukey's HSD to compare clusters.To refine results, Latifi [9] advised using DBSCAN before k-means.Although the researcher uses a Multivariate Convolutional LSTM neural network to quickly classify playstyle, it does not significantly improve performance when more than 40 transactions are analyzed.
Deep neural network advancements have had a significant impact on the field of Reinforcement Learning (RL), allowing complicated decision-making issues with high-dimensional state and action spaces to be solved [15].Deep Reinforcement Learning (DRL) has been used in a variety of fields, including robotics [16,17] and video games [18][19][20][21].By studying the relationship between dopaminergic neuron activity and reward prediction errors, researchers in psychology and neuroscience have found evidence that the brain uses RL algorithms [22].
Wu and Izawa [23] studied the effect of regret on motivation in problem gambling by including it in reinforcement learning.They defined regret as the gap between the maximum reward and the current reward.Their regret reinforcement learning algorithm demonstrated behavior comparable to that of addicted gamblers by selecting high-risk, high-reward options.Inverse Reinforcement Learning (IRL) [24] models human behavior by extracting reward functions from observations and optimal behavior [25][26][27][28][29][30].However, IRL requires well-defined environment and behavior trajectories, which can be challenging to produce in EGM.

Methodology
In this section, we will discuss how we detected playstyles using the clustering technique.Furthermore, we will use one of the playstyle clustered data in our Reinforcement Learning (RL) model to mimic the behavior of players in that cluster.We will also discuss how we identified termination states where the players within this cluster end their session by deciding playing further is not worth it.Finally, the sessions generated by the RL are compared with the selected cluster session data.Figure 1 illustrates the flow of tasks performed for generating player behavior.The data that we used is from a less sophisticated EGM game.This game does not have any bonus rounds or secondary game rounds, which makes the implementation of RL simpler.There are around 27,027 sessions in the dataset.Each session includes the player's transactions, containing information such as cash put into or removed from the machine, bets played, and cash won for each bet.EGM data has a fairly limited set of information due to which many characteristics have been extracted from the sessions' initial features.Some of these features are mentioned and explained in Table 1.

Feature Selection and Data Transformation
For detecting the playstyles of the player, we have used only a few features similar to that were used by Mosquera and Keselj [6] in the clustering algorithms.To address the skewness of the data, we used the Box-Cox transformation.The Q-Q plot analysis was used to verify the normal distribution of the data following the transformation.The z-score normalization technique was then used to normalize the transformed data.

Clustering
We used the K-means clustering method with random initialization to identify different types of gambling behavior or playstyles.Clustering algorithms, in general, divide data into k groups or clusters by analyzing cases in a data set; cases that appear similar to others are grouped together [4].A dissimilarity function is used to define these clusters.There are numerous methods for clustering data, with k-means clustering being one of the most widely used.
The number of clusters is the only major hyper-parameter that needs to be tuned for this algorithm, which can be determined through Elbow method.This algorithm is linear in complexity and scales well to big data.The dataset was clustered by changing the value of k from 5 to 20 to identify a stable and suitable solution for k.The dissimilarity of data objects was calculated in this study based on the distance between pairs of data objects using the Euclidean distance on the normalized dataset.To mimic one of the playstyles, we selected a cluster from among all clusters that represent a group of players with similar playstyles.

Identifying Termination States
We want to find the termination states, where the player decides that continuing the session is not worth it, and train an RL model based on these states.The endpoints are the 25th, 50th, and 75th percentiles because they represent the majority of the distribution of the clusters and show the general playing style of the players within this cluster.
This distance is calculated by averaging 100 players' session features and then calculating the Euclidean distance for each transaction performed by the player with the termination value; i.e., the distance is calculated as the player progresses through the game.We chose win percentage, loss percentage, loss disguised as a win, PW ratio, and the illusion of control (see description in Table 1) termination values in the termination states for training the RL model because they have a smooth decreasing curve, indicating a common playing style among players in this cluster.

Reinforcement Learning Algorithms
In this section, we will briefly describe the Reinforcement Learning algorithms used in this study to mimic real-world player behavior.

Proximal Policy Optimization
Model-free policy search techniques, such as policy gradient approaches, are helpful for updating the policy [31], but the problem with policy gradient is finding the right step size for updation, as they are sensitive.
To eliminate this problem, researchers came up with an approach called Trust Region Policy Optimization (TRPO) [32], which applied a trust region restriction to the objective function in order to reduce the KL divergence between the existing and new policies to make sure that the new policies are not too far from the old policies.Theoretically, this can be supported by demonstrating that improving the policy within the trust region results in a guaranteed improvement in monotonic performance.TRPO is computationally inefficient for large-scale tasks, and when applied to sophisticated network architectures, it is challenging to scale up for those situations [33].By using a clipping technique to avoid totally imposing the hard restriction, Proximal Policy Optimization (PPO) [34] greatly decreases complexity and is able to employ a first-order optimizer, such as the Gradient Descent method, to optimize the objective function which is defined as: where θ is policy parameter, E t is the expectation over time t, r t is the probability ratio of old and new policies, A t is the estimated advantage at time t, and ϵ is a hyperparameter.By attempting to remove the reward for moving the policy away from the previous one when the probability ratio between them is outside of a clipping range, this objective function eliminates the KL constraint of TRPO while maintaining the execution of a Trust Region update.PPO performs better overall for a broad range of tasks and is relatively easy to implement and tune while preserving the stability and reliability of a TRPO.

Actor Critic using Kronecker-Factored Trust Region
Actor Critic using Kronecker-Factored Trust Region (ACKTR) [35] uses actor-critic methods in which the actor performs an action while the critic estimates the value function, distributed Kronecker factorization [36], and trust region optimization [32].It creates a scalable approximation of the natural gradient using the Kronecker-Factored Approximated Curvature (K-FAC) [36,37].K-FAC uses a Kronecker-factored approximation to the Fisher matrix to perform efficient approximate natural gradient updates.It approximate small block F l corresponding to layer l as F l by calculating: By assuming that there is no correlation between the second-order statistics of the activations and the backpropagate derivatives, this approximation can be understood.The Fisher metric for RL objectives is defined as: where p(τ ) is the distribution of trajectories stated as: The Fisher matrix is used to update both the actor and the critic by approximating it by applying K-FAC.ACKTR then applies trust region formulation of K-FAC [38] to update the policy distribution.With both discrete and continuous action spaces, ACKTR is adaptable to learning the model's action probability distribution from an observation.It returns the probability mass for discrete action spaces whereas it returns the probability density for continuous action spaces [39].

Experimental Setup
For this study, we used PPO2, which is implemented for GPU by OpenAI, and for multiprocessing, it uses vectorized environments compared to PPO1 which uses MPI [40].Both models, PPO2 and ACKTR, were trained for around 1 million iterations.The agent can mainly take two types of actions that are either it can make a wager or cash out.The agent also has to decide the amount of money to wager.As in the real EGM game, the agent can also only bet 2,4,8,10,15,20,25,30,35,40,45,50,60,70,80,90, and 100 dollars.If the agent decides to cash out, it will cash out all of the money in the machine because the majority of the players in the original cluster only cashed out once.

State/Observation Space
The agent looks at the observation space in order to take action.Initially, the agent gets some random credit in the machine to start with as we have no cash in action.The agent considers the amount waged and received in the previous transaction, the current credit amount in the machine, the win percentage, the loss percentage, the loss disguised as a win, the payout to the wager, and the illusion of control.The agent uses these features to predict which actions will result in a higher reward and acts accordingly.

Reward
The agent's goal is to maximize the total reward of the game episode.If an agent makes an invalid move (for example, wagering more than credit) for each step, it will be penalized by 15.If the Euclidean distance between the termination state and the current state decreases, the agent receives a reward of 1.If the agent cashes out from the machine, he receives 15 as a reward.If the Euclidean distance between the current state and the termination state is less than 2, it receives a reward of 5.

Results and Discussion
For clustering different playstyles, the optimal value of k was found to be 9 which was then verified using silhouette analysis (Figure 2).Further, we chose cluster number 4 to mimic the playstyle of players in that group, which contains 4483 sessions.This cluster represents intense gamblers, with a mean of 0.26 bets per second, who are unconcerned about losing a lot of money, as evidenced by an 82% loss.We generated around 1000 sessions of agents playing the game by both models trained using the termination states of the chosen cluster.These session data were then used to compute the win percentage, loss percentage, loss disguised as a win, and PW ratio (see description in Table 1).To evaluate the model's performance, we compared statistical measures like the minimum, 25th percentile, median, 75th percentile, and maximum values of the features and the real player cluster to see if the model's agent playstyle matched with the real player playstyle of the selected cluster.
Figure 3 shows that both ACKTR and PPO2 generated sessions that are very similar to those of real players.We can see that the 25th percentile, median, and 75th percentile values of both models are nearly identical to those of the original cluster for all the selected features.Both models did not perform well in the max value due to noise in the data.Figure 4 depicts the distribution comparison.We can see that the distribution of sessions generated by both models is similar to that of the original cluster.ACKTR depicted the wager features more accurately, such as the number of unique wagers, average wager, and the illusion of control, because the agent learned to change the wager, whereas the PPO2 agent played the entire session with only one wager value.ACKTR agent was intelligently changing the wager value depending on the losses and the wins to gain maximum reward out of the session.PPO2 was better than ACKTR in playing for a longer time, as PPO2 produced sessions with a number of wagers of around 200, while ACKTR produce sessions with a maximum of 100 numbers of wagers.

Conclusion
This study mimics player behavior to obtain a good evaluation of a reinforcement learning model as a feasible substitute for real-world players.K-means gave an excellent separation of the player behaviors.We selected the cluster with intense gamblers and their behavioral attributes used to define the termination states for the RL model.The reward for the agents was calculated based on the Euclidean distance between the current state and the termination states.Both the models, PPO2 and ACKTR, succeeded in producing behavior similar to that of the selected cluster behavior.Though PPO2 was able to produce sessions with longer duration or more number of wagers, ACKTR slightly outperformed PPO2 as the ACKTR agent was intelligently able to change wager values during the session, which is an important attribute to say that the agent mimicked the player behavior.

Limitations
The RL algorithms employed to mimic player behavior in this study were only used for one playstyle and game, hence they may not be generalizable to other playstyles and games.Since the RL models have not been tested in production, they may be subject to unforeseen limitations.

Future Work
For future work, a different clustering algorithm could be used to have a better distinction of the behaviors of the player.We can include more features in the termination states, this might make the agent behave more like real-player as it will have more behavioral attributes to think about.It will be interesting to see how the agent will behave by tweaking the reward function from using the Euclidean distance to some other distance.Currently, the model does not have a cash in action as it starts with some random number credit in the machine, this action could be added so that it is performed by the agent, and also the agent could be modeled so it can perform some intermediate cash in and cash out based on the net loss of the session.Finally, it will be interesting to change this EGM environment to simulate it as a video game such as Atari games.

Table 1 .
The total amount of money inserted by the player in the machine during a play session Average primary wagerThe average primary wager in a play session Session length Elapsed time is the amount of time that passes from the start of a session to the end of the session.Total cash out Total cash out in a session Starting cash in Total amount of cash in at the start of the session before he starts playing Loss PercentageTotal number of loss divided by total number of wagers Intensity Intensity (wagers/minute) in a session Cash out to cash in ratioThe ratio of cash out to cash in Number of cash in Number of times a player inserted money in a machine Payout to Wager (PW) ratio Total payout divided by total wager in a session Features Explanation

Table 2 .
Table 2 shows the selected features for clustering and their statistics measures.It is clear from Table 2 that the data is skewed to the right showing a non-normal sample distribution.Features Statistics Measures