A Hyperscale Data Center (HDC) is essentially a massive building filled with thousands of servers, racks, and network equipment, often several football fields large. Meta has many of these on their campus and uses a number of designs for them, including the H-shaped and I-shaped design.
Problem
When designing a new HDC, Meta needed to be sure that it would work well without delays or bottlenecks in all connected functions. One important element to consider was how to improve workflows connected with the racks in the building.
There are three types of workflows associated with the racks, but the focus in this case study is on “receiving”. This workflow occurs when a data center is coming online and racks are being brought into that particular data center.
The teams that support the workflows are highly skilled resources. Thus, these workflows and how to stage and plan the work should be optimized.
However, Meta did not have a good process to visualize and simulate operational constraints in HDC designs, and therefore, could not understand bottlenecks and throughput capabilities.
Solution
To better understand the requirements of a new HDC, Meta decided to implement a modeling approach. The first step was creating a 3D visualization for an agent-based model to facilitate rack flow data validation and accelerate teams’ workflows and resources’ learnings. Thus, they could see the entire workflow within the space where it is going to happen. This was all done in advance, prior to construction of the HDC.
The 3D visualization gave insights into the parameters necessary to set up the discrete-event simulation model, as well as optimizing them.
For the discrete-event simulation, a number of assumptions were necessary:
- The HDC is an object with multiple parameters in the AnyLogic platform and the throughput of a variety of HDC types can be predicted.
- The process to model is rack receiving with multiple steps spread amongst different teams (receiving and positioning – team 1, energizing – team 2, cabling – team 3, provisioning – automated).
- Simulation is one week, and throughput is measured as a percentage of the total.
- The number of resources in each team is configurable and the utilization is set between 60-80%.
- Shifts start at 8am and overtime is permitted.
- Team 1 processes include unloading of the truck, unpacking and paperwork, dock queue, data hall runs, elevator capacity, 1st or 2nd floor (50/50 chance).
- Normal distribution for team 2 and team 3.
- Provision has two steps – switch and server, with both being set to 80% (this means that 20% of the time they need to rework).
One of the many great features in AnyLogic is the ability to develop a UI for the model, then change parameters, and see the results. Meta created a UI for their receiving model where each user could change the parameters for each team, e.g., the number of people, add overtime, change the unloading time, etc.
Results
Discrete-event simulation
Using a discrete-event simulation, Meta ran the model with regular parameters. The target throughput was 100%, but they achieved only 40% with an average duration of 3.7 days. Bottlenecks were identified in cabling and positioning within the racking process.
Optimization experiment
In order to solve these issues, it was necessary to identify the optimized value for each parameter. This was done using an optimization experiment, with the objective of maximizing the throughput.
The optimization results can be seen in the tables below. Based on these results the team could run the model again and achieve 92% throughput with an average duration of 2.2 days. This was a reduction of 1.5 days from the initial model. As a result, more racks could be received per week.
In addition, there were no bottlenecks and the only reason it was not 100% was because the time was set to one week and provisioning (automated process) could not be completed within this time constraint.
However, having an optimized model does not tell the whole story because, in the real-world, there exists an element of uncertainty.
Monte Carlo Experiment
Meta understood this and decided to use a Monte Carlo experiment, which is a stochastic method using a random sample of inputs to create an output for the model.
Running the Monte Carlo experiment a number of times gives a distribution of output, and instead of just one scenario, you can have multiple scenarios at the end.
Meta ran the model 10,000 times and the results can be seen in the illustration below. The X-axis is the throughput, and the Y-axis is the probability of that throughput. The chart shows that 40% of the time the throughput is going to be 90%. There are also other options shown as well, such as a 20% chance of the throughput being 30%. What these results show is that 90% throughput is not guaranteed, but it is the most probable outcome of the model.
Next steps
- Include rack redistributing and refreshing processes to the model.
- Add more details to energizing, cabling, and provisioning steps.
- Create sensitivity analysis to the model to understand the best values for parameters.
After adding these steps to the model, the team can do the last step, which is increasing the time of simulation to one year and analyzing the results.
The case study was presented by Peter Lopez, Mohammad Shariatmadari, Marcin Starzyk, and Lakhwinder Singh, of Meta, at the AnyLogic Conference 2022.
The slides are available as a PDF.