Skip to the content.

What Makes Power Outages Last?

by Andre Gala-Garza (asgala@umich.edu)

Introduction

Power outages have consistently had negative impacts on the health of communities. Often, the severity of a particular outage is proportional to how long the outage lasts, since this increases the risk of food spoiling and water becoming contaminated. To alleviate these setbacks, a useful solution is to provide an estimate of how long an outage might last for, based on information that is known at the time of the outage. I aim to accomplish this goal during this investigation, where I will analyze a dataset of major power outage data in the continental U.S. from January 2000 to July 2016.

First, I will clean the dataset and conduct exploratory data analysis to obtain a basic understanding of the information the dataset provides. Next, I will explore the following motivating research question: What characteristics lead to power outages with longer durations? Specifically, I plan to use my exploratory analysis to reveal key relationships between the duration of a power outage and other features of a dataset, then create a model that predicts the duration of a specific outage given features of the outage that are known when it begins. This type of prediction would be essential for households and businesses to make preparations for outages that are proportional to their length, especially in dangerous situations such as hospitals, where the backup power for critical medical devices must be managed carefully.

The data used for this investigation was accessed from Purdue University’s Laboratory for Advancing Sustainable Critical Infrastructure. A data dictionary is available at the ScienceDirect article “Data on major power outage events in the continental U.S.”, under Table 1. Variable descriptions.

These are the features of the data which I will focus on in my investigation:

Column Description
YEAR Indicates the year when the outage event occurred
MONTH Indicates the month when the outage event occurred
POSTAL.CODE Represents the postal code of the U.S. states
NERC.REGION The North American Electric Reliability Corporation (NERC) regions involved in the outage event
CLIMATE.REGION U.S. Climate regions as specified by National Centers for Environmental Information (nine climatically consistent regions)
ANOMALY.LEVEL Oceanic El Niño/La Niña (ONI) index referring to cold and warm episodes by season, calculated as a 3-month running mean
CLIMATE.CATEGORY Climate episodes corresponding to the years categorized as “Warm”, “Cold”, or “Normal” based on Oceanic Niño Index (ONI)
CAUSE.CATEGORY Categories of all the events causing the major power outages
OUTAGE.DURATION Duration of outage events (in minutes)
DEMAND.LOSS.MW Amount of peak demand lost during an outage event (in Megawatts)
CUSTOMERS.AFFECTED Number of customers affected by the power outage event
TOTAL.PRICE Average monthly electricity price in the U.S. state (cents/kilowatt-hour)
TOTAL.CUSTOMERS Annual number of total customers served in the U.S. state
OUTAGE.START.DATE, OUTAGE.START.TIME Day and time of the year when the outage event started, as reported by the corresponding Utility in the region
OUTAGE.RESTORATION.DATE, OUTAGE.RESTORATION.TIME Day and time of the year when power was restored to all customers, as reported by the corresponding Utility in the region

Data Cleaning and Exploratory Data Analysis

Data Cleaning

The original data was saved as a Microsoft Excel spreadsheet, so I removed the first five rows, which were entirely blank aside from a general description of the data. I also removed the seventh row, which was only used to describe the units of some of the features. The sixth row was used as the header row. I also removed the first column, which was not part of the data itself.

There were a total of 1,534 power outage entries in the original dataset, which I loaded into a DataFrame using the Python library pandas. The DataFrame had 56 unique columns (features). In order to clean the data, I performed the following operations on the original DataFrame:

Univariate Analysis

For my analysis, I begin by exploring some univariate relationships within individual columns of the dataset. First, I plot a bar graph of the number of power outages for each type of major cause of an outage:

Next, I plot a histogram of the distribution of outages by the total number of customers that were affected:

Interestingly, the distribution seems to be somewhat normal until the start of the bin with 6 million total customers, at which point the normality assumption can no longer reasonably hold.

I also plot the distribution of outages by the El Niño/La Niña anomaly level that was reported during the outage:

Bivariate Analysis

I continue with bivariate analyses to compare relationships between two variables. First, I draw a line plot of the average outage duration for each month in the dataset. Red lines correspond to odd-numbered years, while blue lines correspond to even-numbered years.

I then group the outages by the state in which they occured to produce a choropleth of the average outage duration by state:

This plot reveals that Wisconsin has the highest average outage duration, with New York and West Virginia following behind. Notably, many states with high outage durations on average are in either the Midwest or Northeast region of the United States.

Next, I make several scatter plots that compare outage duration with other variables in the dataset. First, I plot outage duration against two discrete features: the month in which an outage occurred and the major cause of an outage, respectively.

Next, I plot outage duration against two continuous features: the amount of peak demand (in Megawatts) lost during an outage and the El Niño/La Niña anomaly Level, respectively.

Based on aggregate information from the grouped tables seen below, I was also interested in the relationship between a state’s climate region and an outage’s duration. The following series of box plots show this relationship, revealing that East North Central and Northeast states seem to have longer outage durations on average, while West North Central states have very low duration outages.

Grouping and Aggregates

I constructed a table of the outages in the dataset grouped by CLIMATE.REGION and performed the aggregate function mean(), in order to find the average severity metrics for each region. Here are all of the rows in the grouped table:

CLIMATE.REGION OUTAGE_DURATION_MEAN CUSTOMERS_AFFECTED_MEAN DEMAND_LOSS_MW_MEAN
Central 2882.21 144424 595.682
East North Central 5391.4 149816 633.902
Northeast 3330.52 177848 1013.4
Northwest 1536.36 136768 336.036
South 2872.45 212031 475.686
Southeast 2247.66 202705 865.318
Southwest 1621.41 66121.1 903.25
West 1636.31 217946 711.566
West North Central 796.071 66242.4 251.333

I also followed a similar procedure for the column NERC.REGION, to find the average severity metrics for each North American Electric Reliability Corporation (NERC) region. Here are all of the rows in the grouped table:

NERC.REGION OUTAGE_DURATION_MEAN CUSTOMERS_AFFECTED_MEAN DEMAND_LOSS_MW_MEAN
ECAR 5603.31 260624 1394.48
FRCC 4271.12 385068 1108
FRCC, SERC 372 nan nan
MRO 3001.81 107524 281.429
NPCC 3578.65 164358 1695.22
RFC 3767.79 157057 429.434
SERC 1765.29 117203 608.125
SPP 2783.57 231177 192.118
TRE 2988.24 253805 739.093
WECC 1578.27 182706 664.74

Imputations

In the column TOTAL.PRICE, which indicates the average monthly electricity price in the U.S. state corresponding to a certain outage, a total of 22 values were missing (NaN). I considered these values to be missing completely at random (MCAR), due to the small number of these outages and their varying indices from across the original dataset. Since the distribution of the column appeared relatively normal, I felt that these missing values would be representative of the dataset as a whole. Therefore, I chose to use mean imputation on these values by replacing them with the mean TOTAL.PRICE for all outages.

This is what the distribution of TOTAL.PRICE looked like before imputation:

As there were only 22 missing values, the distribution appeared relatively unchanged after imputation:

Framing a Prediction Problem

My prediction problem is to predict the duration of a power outage; therefore, the response variable is OUTAGE.DURATION. This is a regression problem, since the model will predict the continuous quantity of how long a power outage lasts. I chose this response variable because when a household has a power outage, having an estimate of the time taken before recovery is crucial for making decisions such as whether to use a backup generator or continue storing perishable food.

The time of prediction would be immediately at the start of a power outage. Therefore, at prediction time, the following variables from the original dataset would be known: OUTAGE.START.DATE, OUTAGE.START.TIME, YEAR, MONTH, POSTAL.CODE, NERC.REGION, CLIMATE.REGION, ANOMALY.LEVEL, CAUSE.CATEGORY, CAUSE.CATEGORY.DETAIL, TOTAL.PRICE, TOTAL.CUSTOMERS.

Clearly, variables such as OUTAGE.RESTORATION.DATE and OUTAGE.RESTORATION.TIME are outcome variables not available at the start of an outage.

Regional economic and land-use characteristics from the original dataset, such as PC.REALGSP.STATE or POPULATION, are aggregated at a regional or state level, and thus do not provide specific information about the infrastructure or condition at the exact location of an outage. Therefore, I chose to exclude these variables from my model.

For my regression model, I will use the performance metric of root mean squared error (RMSE):

Root mean squared error (RMSE)

I chose this metric because it is similar to the mean squared error (MSE), which penalizes larger errors more than smaller ones due to the squaring term. This makes it a better metric than mean absolute error (MAE), because incorrect predictions can have critical consequences in this case. For example, significantly underestimating a long outage may cause a family in a household with said outage to keep food in their refrigerator only for said food to perish. However, I also wanted to balance the sensitivity to large errors with immediate interpretability. Since RMSE produces a prediction with the same units as the response variable (in this case, the number of hours for which a duration lasts), it is ideal for my model.

I will additionally use the R^2 metric:

R^2 metric

This helps compare the predictive power of my baseline model with my final model, which adds more features to use for predictions.

Baseline Model

My baseline model predicts the duration of an outage using features that are straightforward, immediately available, and broadly applicable across all outages.

The features I used in the baseline model are as follows: MONTH, ANOMALY.LEVEL, POSTAL.CODE, CAUSE.CATEGORY, CLIMATE.REGION.

In this model, the quantiles for numerical features (MONTH, ANOMALY.LEVEL) are reduced to 500, which is ensured to be below the number of samples in the dataset. Meanwhile, categorical features (POSTAL.CODE, CAUSE.CATEGORY) are encoded using OneHotEncoder to convert nominal categories into dummy variables.

The RMSE of the baseline model is 6426.33, and its R^2 value is 0.133345. Here is a line plot comparing the baseline model’s predictions for outage durations and the actual durations:

The line for the baseline model’s predictions (in orange) deviates significantly from the actual durations (in blue). Therefore, we can conclude that the baseline model will generalize rather poorly to unseen data.

Final Model

My final model includes additional features that offer finer granularity to improve predictive accuracy.

The features I used in the final model are as follows: MONTH, ANOMALY.LEVEL, POSTAL.CODE, CAUSE.CATEGORY, CLIMATE.REGION, NERC.REGION, TOTAL.PRICE, TOTAL.CUSTOMERS.

Here is an explanation of why I added the new features to the final model:

The modeling algorithm I chose was RandomForestRegressor. This was very useful for my purposes, since the RandomForestRegressor can predict a continuous target variable by averaging the outputs of multiple decision trees. It handles both numerical and categorical variables and can perform well even when capturing non-linear relationships or interactions between features. Finally, its averaging mechanism across multiple trees reduces the risk of overfitting compared to individual decision trees.

Within my sklearn Pipeline, I also create two new features:

GridSearchCV performs an exhaustive search over the specified parameter grid to optimize RandomForestRegressor parameters, such as:

These were the hyperparameters that ended up performing the best: n_estimators = 100, max_depth = None, min_samples_split = 5, min_samples_leaf = 2.

The RMSE of the final model evaluated to 6307.16, and its R^2 value is 0.165188. This is an improvement from the baseline model’s RMSE and R^2 of 6426.33 and 0.133345, respectively, since RMSE should be minimized and R^2 should be maximized.

Here is a line plot comparing the final model’s predictions for outage durations and the actual durations:

This line plot compares both models against each other, along with the actual durations:

From the above graph, we can see that the final model’s prediction for outage durations (in green) matches the actual durations (in blue) quite closely, when taking into account the large fluctuations made by the outage duration over time. Therefore, we can conclude that the final model will generalize relatively well to unseen data.