# 1 Summary/Abstract

Rapidly increasing inflation rates and worsening effects of climate change are dramatically increasing the financial impacts of disasters. Within the United States, the Federal Emergency Management Agency (FEMA) is responsible for providing federal resources to support states and territories during the response phase as well as tracking the financial costs. All tiers of government would benefit from knowing which disaster declaration characteristics would change the amount of financial support from FEMA. This analysis utilizes machine learning technology to predict the funds requested by states or territories from FEMA as well as the funds eventually obligated to states or territories from FEMA for a disaster declaration. In particular, three models are considered: decision tree, bagged tree, and random forest. Cross-validation and bootstrap re-sampling structures are utilized, depending on the model, and the fits of the models are evaluated using the root mean square error (RMSE) statistic. The bagged tree model employing bootstrap re-sampling best fit both the requested and obligated fund models; however, future studies should consider other machine learning models as well as further refinement of the tuning parameters for the ones considered in this analysis. The prediction capabilities created by such models have the potential to dramatically improve comprehension of the key factors affecting financial disaster relief support.

# 2 Introduction

On November 8, 1988, United States (US) President Ronald Reagan signed the Robert T. Stafford Disaster Relief and Emergency Act into law, which formalizes the disaster declaration process and specifies the response responsibilities for each tier of government (Moss, Schellhamer, & Berman, 2009). Under this law, when a disaster occurs, the affected area (local or state representatives) conducts a damage assessment to estimate the amount of support required from the federal government. That assessment goes through the state government to the Federal Emergency Management Agency (FEMA) to the Department of Homeland Security (DHS) to the US President, who ultimately decides whether the affected region should receive an emergency declaration, a major disaster declaration, or neither (McCarthy, 2009). FEMA then deploys resources from across the federal government to support the disaster region, the volume and type of which depend upon the declaration type. However, the federal government does not necessarily incur all of the cost of these resources. Under most circumstances, this pay structure more accurately reflects a reimbursement model; that is, the states or territories submit requests to the FEMA for reimbursement after the expenses have occurred. However, there is then a reconciliation process by which the total funds a state or territory receives is often different than the original request. This process, explained in greater detail below, involves numerous parameters that are not always obvious to non-FEMA representatives. As such, this analysis endeavors to discern the key predictors for receiving federal funding relief under a disaster declaration.

## 2.1 Disaster Declaration Process

When a disaster occurs, the onus for requesting assistance begins at the local government. The local officials must have enacted their own emergency operations plans and demonstrate the need for more resources than the local entities can provide. Once this threshold is crossed, the municipal or county leader submits a formal request to the state governor for assistance by requesting a state disaster declaration (McCarthy, 2009). If the governor approves such a request, state resources are mobilized to assist the affected area, but the local government is still the primary leading entity for the emergency management response. If the state has fully enacted their emergency response plans and further assistance is still required, the governor requests the state emergency management agency to conduct a state-level Preliminary Damage Assessment (PDA). These assessments aim to estimate the impact of the disaster on individuals, infrastructure, and government agencies (Moss et al., 2009). As federal resources are extraordinarily expensive, state governors utilize PDAs to ensure federal assistance is only requested when absolutely necessary. If the governor does decide that federal resources are required, a formal request is submitted by the governor’s office through the regional FEMA office. This request includes the PDA, barring a few exceptional circumstances for catastrophic events, and a list of requested federal resources, including the anticipated duration and cost of each resource (Moss et al., 2009). The regional FEMA office ensures the request is complete before sending it to a special declaration processing unit in FEMA headquarters, which further add to the request by incorporating detailed statistics about the region and their own assessment of needs and costs (McCarthy, 2009). The declaration processing unit then submits the request to the FEMA Administrator, who must make a recommendation for the President. The request and the recommendation are then submitted to the President, who ultimately decides whether or not to grant the declaration request. The President is not bound by the FEMA Administrator’s recommendation, and, in the event the President denies the request, the governor is permitted to submit an appeal based on further damage assessments (Moss et al., 2009). If the request is approved, FEMA becomes the primary agency to coordinate resource allocation and support the local response. The approval from the President also specifies which programs are authorized as well as the federal/state cost-sharing ratio (FEMA, 2021).

## 2.2 Newly Available Data

The federal government funding process was further refined as part of the Sandy Recovery Improvement Act of 2013 as well as The Disaster Recovery Reform Act of 2019 (FEMA, 2021). These laws require FEMA to publicly provide the information regarding federal government resources paid for by state and federal government funds for each disaster declaration. As a result, the OpenFEMA Data Sets website now exists and provides information summarizing disaster declarations and mission assignments. While these acts have improved the accessible information, little has been done to draw conclusions from these data for predicting the financial impacts of disasters. Due to the federal government legislative process, these datasets have been publicly available for less than two years, during which the preponderance of the emergency management community was dedicated to managing the COVID-19 pandemic. To date, there is one published paper using predictive modeling to examine disaster declarations; however, it is only published in Spanish, and the authors do not have a reliable means of translation (Araujo Pérez, 2019). Based on an approximate translation, the authors used a logistic regression to predict whether or not a presidential declaration would be awarded in the state of Maryland, utilizing Maryland data 2003 - 2017 (Araujo Pérez, 2019). Given this newly available data from the federal government, there are numerous potential predictive models to be considered.

It would greatly benefit the federal government to be able to predict the amount of funding required for a disaster based on parameters known at the onset of the incident, as it would allow for improved funds flow and budgeting practices. Moreover, it would benefit states and territories to know which parameters increase the amount of federal government support in the aftermath of a disaster. Modeling approaches that include machine learning algorithms have been demonstrated to have extensive prediction capability for large datasets. As such, this analysis will examine the use of machine learning models in predicting FEMA expenses for a federally-declared disaster. In particular, a decision tree, bagged tree, and random forest model will be used to predict requested and obligated FEMA funds.

# 3 Materials and Methods

## 3.1 Data and Processing

The data were obtained from the OpenFEMA Data Sets website (OpenFEMA Data Sets FEMA.gov, n.d.) as well as the US Department of Agriculture Economic Research Service website (USDA ERS - County-level Data Sets, n.d.). The mission assignments data was cleaned to represent the requested and obligated FEMA funds for each disaster declaration for each state or territory. Summary variables were created for total number of federal agencies involved in resource assistance, the amount of funding requested by the state or territory from FEMA, the amount of funding allocated to the state or territory from FEMA, and the average cost share for FEMA of the resources provided. Similarly, the disaster declaration summaries were also summarized to represent the unique disaster declarations for each state, bounded to the same time frame as the mission assignments data. Feature engineering was performed to create variables for incident duration, response duration, month of declaration, and year of the incident. The population data from the US Department of Agriculture Economic Research Service was incorporated into the disaster summary data, after creating a variable for state. The mission assignment data and declaration data were then joined to create a comprehensive dataset that was used for analysis. At this point in the analysis, Google developer data (Developer, n.d.) was added to specify coordinates of each date, and the outcomes of interest were subset to include only circumstances in which the federal government provided funding (e.g. only positive requested or obligated amounts).

Prior to fitting the models, the data were subset based on outcome of interest, and the least frequent disaster types were added to the “Other” category, due to the limited values in certain categories. The outcomes of interest were also log transformed to better represent the distribution of disaster relief funding. At this point, three outliers were removed from the analysis, as each had a negative value of the log transform of the outcomes of interest.

## 3.2 Parameters Considered

Ultimately, the following parameters were considered in the analysis: state, longitude/latitude of state, state population, state number of counties, the FEMA Region to which the state belonged, declaration type, month and year of the first date of the event, duration of the incident, duration of the response efforts, when the Individual/Household, Public Assistance, and Hazard Mitigation Programs were awarded to any county in the state as well as the percent of counties receiving each program, number of federal agencies involved in the response, and the average FEMA cost share for the resources provided to the state under the declaration.

## 3.3 Model Development

Four models were considered for the prediction analysis, and each were fit for both outcomes of interest (e.g. requested and obligated funds). The data were split such that 66.7% were used for training the models and the remaining 33.3% were used for testing the final chosen model. Spatial cross-validation was used for all models, using a five-fold resampling structure. The tidymodels framework utilized in the analysis does not support a repreated spatial cross-validation resampling structure, as is standard in typical cross-validation.

## 3.4 Model Definition

There are numerous machine learning models that could be applied to the research question, but this analysis focuses primarily on two different types of models: tree-based and regularization-based. The tree-based models were chosen as the primary users of the prediction algorithms are non-statisticians, and such models are among the easiest machine learning models to interpret and understand (James, Witten, Hastie, & Tibshirani, 2021). However, to robustly consider the machine learning methodology, two regularization-based methods were also included in the analysis. Broadly, regularization allows variables to be included in a model with less weight than other variables by reducing the value of the variable’s coefficient (Hastie, Tibshirani, & Friedman, 2016).

• Decision Tree (DT): Also known as a classification and regression tree (CART), the DT examines each predictor and splits the outcome at a value of the predictor that leads to the best performance increase in the model. The tree development continues until a certain threshold criteria is met, such as number of observations in each leaf of the tree. A DT model is often the most intuitive machine learning algorithm, but it also typically has a reduced performance compared to other models (Boehmke & Greenwell, 2019).

• Random Forest (RF): A slightly more sophisticated model, a RF model aims to reduce the variance by building a tree for each re-sampling of the existing data. However, instead of considering all possible predictors at each decision tree split, the split occurs based on the best predictors of a random sample of all predictors. This method allows for de-correlation in the trees by avoiding the default nature of decision trees that aim to include as many predictors as possible. Each individual tree is then averaged to find the final model. RF models are also more difficult to interpret than standard decision tree models (Breiman, 2001).

• Least Absolute Shrinkage and Selection Operator (LASSO): LASSO balances the goodness of fit with a penalty for the coefficients, calculated by the respective absolute value. This methodology allows coefficients to go to zero and thus be dropped from the model. The disadvantage of the LASSO model comes predominantly in the form of selection of just one variable from highly correlated variables, ignoring the rest entirely (Roth, 2004).

• Elastic Net (EN): Of particular use in circumstances with correlations between predictor variables, an EN model further adds to a LASSO model by incorporating a mixture parameter in addition to the overall weight given of the penalty. The mixture parameter determines the distribution of the penalty, which is the sum of the LASSO regularization component as well as a new regularization component, which is calculated by squaring the magnitude of the coefficient, and then all coefficients are shrunk by the same factor (Zou & Hastie, 2005).

## 3.5 Model Performance Evaluation

The standard metric for evaluating regression models is the root mean square error (RMSE), which is the square root of the mean of the square of all error. It is formally defined as: $RMSE = \sqrt{\frac {1}{n} \sum_{i = 1}^{n}{(S_i - O_i)^2}}$ where Oi are the observations, Si are the predicted values of the outcome, and n is the number of observations in the analysis. RMSE is scale dependent, but the lower the RMSE, the better the model performance.

## 3.6 Software

This analysis was conducted in R 1.4, using Rstudio 4.1 on a Windows Server operating system (Team, 2021). The following R packages were utilized: here (Müller, 2020), tidyverse (Wickham et al., 2019), tidymodels (Kuhn & Wickham, 2020), skimr (Waring et al., 2021), spatialsample (Silge, 2021), broom.mixed (Bolker & Robinson, 2021), rpart.plot (Milborrow, 2021), vip (Greenwell & Boehmke, 2020), doParallel (Corporation & Weston, 2020), ranger (Wright & Ziegler, 2017), viridis (Garnier et al., 2021), maps (Richard A. Becker, Ray Brownrigg. Enhancements by Thomas P Minka, & Deckmyn., 2021), table1 (Rich, 2021), cowplot (Wilke, 2020), scales (Wickham & Seidel, 2020), gtsummary (Sjoberg, Whiting, Curry, Lavery, & Larmarange, 2021), car (Fox & Weisberg, 2019), ggplot2 (Wickham, 2016), and summarytools (Comtois, 2021). All processing and analysis code can be found in the supplementary materials.

# 4 Results

## 4.1 Outcomes of Interest

A total 3266 disaster declarations were included in this analysis, ranging from February 2012 to November 2021. On average, $40,600,000 (SD =$255,000,000) were requested in FEMA funding, and $40,700,000 (SD =$256,000,000) were obligated in FEMA funding. A total 51 states and territories were represented in the analysis, and on average, the FEMA cost-share was 96.4% (SD = 5.73%). Figure 4.1 shows the almost identical density distribution of the two outcomes of interest, which is further confirmed by the correlation plot between the two shown in figure 4.2.