The Use of AI and Machine Learning in Running a Data Center is gaining popularity.

Fault detection for early rectification running a data center with Zero Downtime is becoming increasingly critical for businesses to remain relevant in the big data world that we have grown to depend.

The average cost per minute of unplanned data center downtime is $9,000, up a staggering 61 percent from $5,600 per minute in 2010, according to a benchmark study done by the Ponemon Institute in 2015. This result is derived from the benchmark analysis of 63 US-based data centers from 16 industries. E-commerce and financial services are the two largest industry segments representing 15 percent and 13 percent of the benchmark sample, respectively [1].

Unplanned data center outages can be very costly for businesses in today’s digital world. It can cost the business not only financial losses such as outage recovery associated costs, productivity loss and lost revenue, but also consequential business disruptions that include reputational damages, customer churn and lost business opportunities.

A tier four data center is often described as fault-tolerant, which must have two parallel power and cooling systems with no single point of failure (also known as 2N). Building or co-locating in a tier four center is hardly cost-effective for most companies, and the jump from tier three to tier four provides only a marginal gain in availability. However, infrastructure isn’t the only factor that plays into availability, and all data center owners/operators can see an improvement in uptime by going a step beyond fault tolerance and practicing “fault avoidance.” Without fault avoidance, tier numbers mean little in terms of availability.

In 2018, the Uptime Institute addressed the concept of fault avoidance at their Executive Symposium in a session titled “Beyond Fault Tolerance.” Rather than reacting to a problem once it has occurred, fault avoidance focuses on preventing those problems from occurring in the first place. Many complications that lead to data center downtime can be prevented with equipment and systems monitoring, formalized staff training, thorough procedures, and regular maintenance [2].

To practice fault avoidance, operators should adhere to the recommended guidelines proposed by Original Equipment Manufacturers (OEMs) and ensure equipment is properly maintained. Instead of reactively repairing equipment, careful operators use preventative and predictive maintenance to prevent incidents and downtime. Predictive maintenance involves monitoring equipment and interpreting the data to understand when a machine or component is likely to fail, while preventative maintenance involves less monitoring and planned maintenance at regular intervals. Vigilant operators use a combination of both, as some components cannot be monitored, and following a maintenance schedule does not guarantee that nothing will fail.

But predictive maintenance will only reach its full potential if and only the data center is fully digitalize to have an effective platform that can deliver end-to-end visibility across all stages of its lifecycle - including the center’s design, construction, commissioning, operation, service provision and performance assurance.

The uptime and operational efficiency of a data center depends on a wide range of tightly controlled environmental factors, including: temperature, humidity, airflow, light, sound, door position and power, to name just a few. With data being produced by all kinds of instrumentation systems - electrical power management systems (EPMS), data center infrastructure management systems (DCIMs), branch circuit monitoring systems (BCMs), environmental monitoring systems, building management systems (BMS) sensors, the BMS itself, and more - facility operators can suffer ‘snow blindness’ from the information overload. Without an INTELLIGENT platform to collate both historical patterns and real-time data from these systems and analyzed to generate insights and make intelligent predictions for operators to refer and take actions, this can become the vehicle that drive up the already high incident rate triggered by human error.

Data Center Infrastructure Management (DCIM) scope is ever expanding to respond faster and broader. It has gone beyond the human factors and hardware and advance into AI and Machine Learning to Incorporate Intelligence. 

A Gartner report predicts that by 2020, more than 30 percent of data centers that fail to implement AI and Machine Learning will cease to be operationally and economically viable. In other words, data centers will have to embrace these technologies if they want to continue operating in the future [3].

Across the globe, data science and AI are influencing the design and development of modern data centers. With the surge in the amount of data every day, traditional data centers will eventually get slow and result in an inefficient output. Utilizing AI in ingenious ways, data center operators can drive efficiencies up and costs down. A fitting example of this is the tier-two automated control system implemented at Google to cool its data centers autonomously. The system makes all the cooling-plant tweaks on its own, continuously, in real-time- thus contributing to the plant’s energy savings [4].

Data centers are designed to support business requirements at a particular moment in time, but in actual, business goals change, technology evolves, and new regulatory and compliance frameworks are introduced overtime with seemingly ever-increasing rate. This means that although the mechanical and engineering architecture of the data center may not change much during its lifecycle, but the IT configuration in it keep changing in adaptation to evolution of the business environment. Consequentially, data center operators are put in a position to make alterations to a live environment, often without the ability to accurately predict how the facility will react. This challenge poses a serious risk; the wrong decision could inhibit business processes, or in a worst-case scenario, even lead to failure.

However, current aspects of machine learning in data center are at the initial data processing stage that can show operators what exactly is going on in their data centers. Some DCIM systems have the ability to use tools like computational fluid dynamics on the thermal side for making certain kinds of forecasts, but there isn’t really a whole lot of data-oriented decision-making capabilities in DCIM systems, and in particular there aren’t one that allow operators to predict, visualize and quantify the impact of any change in the data center prior to implementation.

Data center Digital Twins with AI and Machine Learning empowers you to make informed decision on changes prior to implementation and autonomously adjust your data center to optimal operation condition at all times

Most of the modern data centers built today had incorporated some kind of engineering models at the design stage and extended the use of the model into the construction and some even into the operation stage. Very likely at the operation stage, these data centers already have the Digital Twin, which is a 3-D virtual replica of the data center that includes the entire facility’s infrastructure and IT equipment, in place for operators to deploy as a management tool. The Digital Twin, integrate with management tools such as DCIM to capture the current state live data of the actual data center can then be used to simulate and predict, visualize and quantify the impact of any change in the data center prior to implementation to mitigate the risk of disruption or failure. It can also be used to test “What-if” scenarios. What if my cooling units fails? What if my power supply fails? What if my IT overheats in certain section? Moreover, run “What-if” scenarios on any element of your design to test layouts and analyze the implications of different power, loading and cooling scenarios.

With the availability of real time data from all the major components in the entire ecosystem and a system with machine learning algorithm was in place, the data center could learn the optimum temperatures at different times of day and at different IT utilization levels and automatically adjust the cooling systems accordingly. Furthermore, it would keep track of the data and continually refine the algorithm to make it more effective over time.

Predictive maintenance system such as CMMS (Computerized Maintenance Management System), if integrated with an AI powered self-driving robot that employs non-intrusive testing techniques (thermodynamics, acoustics, vibration analysis, and infrared analysis, etc.) that patrol and scan the critical systems as necessary to detect the anomalies and failure patterns can help the data center accurately predict “whether there is a possibility of failure in next n-steps” or “how much time is left before the next failure” for critical equipment and/or system and provide early warnings. These warnings can enable efficient maintenance with the advantages of controlling repair costs, avoiding warranty costs for failure recovery, reducing unplanned downtime, and eliminating the causes of failure.

At Exyte, “Virtual Design & Construction” is a common practice. We design a data center using BIM to build up the 3D model, often with “LOD500” level of detail which can then be used to visualize if there are any design clashes among all the engineering disciplines and using various simulation tools to analyze process flow, energy usage, equipment placement etc. for further value engineering prior to the construction works to ensure the design is optimized for constructability, efficiency, operation and maintenance. Based on the project requirements the model can be developed further into 4D, 5D, 6D and 7D to add schedule, costs, testing & commissioning and operation & management information respectively into the model to deliver a viable digital twin of the data center readied for AI and Machine Learning deployment to the client. Furthermore, the cloud based project delivery eco-system of Exyte seamlessly integrates the model with EHS, project control, quality management, procurement management, testing & commissioning and document control systems ensures that all processes are fully digitalized and project requirements and design quality are sustained throughout the whole project lifecycle.

Exyte Virtual Design & Construction 

Contact for this topic:

Walter Wong
DTC BU Director
Exyte China