Code and data files required to reproduce this analysis are available on GitHub
ARCHIVE
folder.
Reproducing this project requires R, RStudio, and Microsoft Word. Files should be run in the following order.
In the code > processing_code folder:
In the code > analysis_code folder:
In the code > processing_code folder:
In the products folder
Figure 3.1 shows a histogram of microplastic concentration observations. The minimum concentration is 16.67 particles/L and the maximum is 1193.33 particles/L. The mean concentration is 104.39 particles/L, and the median is 66.67 particles/L.
Microplastic concentrations remained in similar ranges throughout the study period. Figure 3.2 shows a boxplot of concentrations by sample date.
There is some seasonal variation in concentration at each individual site. Figure 3.3 shows a plot of concentrations at each site.
There are similar microplastic levels throughout the watersheds within the Upper Oconee. Some watersheds experienced greater variation in microplastic levels than other watersheds. Figure 3.4 shows the microplastic concentrations by watershed.
Figure 3.5 shows a line graph of the mean watershed microplastic concentrations at each seasonal sampling date.
Population, land cover/use, and bacteria levels are hypothesized predictors of microplastic concentration. Figure 3.6 and Figure 3.7 demonstrate the relationship between microplastic concentration and population and microplastic concentration and bacteria levels (CFU/100mL), respectively.
Figure 3.8 and Figure 3.9 show correlation matrices for the hypothesized predictor and for the different categories of land use.
Preliminary modeling reveals that there is not a strong relationship between microplastic concentration and population level. Figure 3.10 demonstrates a linear model fit.
Figure 3.11 shows a linear model of microplastic concentration vs CFU (both variables log-transformed).
Figure 3.12 demonstrates a linear model of particles/L vs turbidity.
Table 3.1 shows a table summarizing a linear model fit predicting particles/L with 6 predictors.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -85.4155522 | 200.1731751 | -0.4267083 | 0.6708307 |
visual_score | 3.1501966 | 3.5503662 | 0.8872878 | 0.3777987 |
turbidity.ntu | 8.1849262 | 5.1982368 | 1.5745582 | 0.1196244 |
temperature.c | -0.9153529 | 3.7770358 | -0.2423469 | 0.8091818 |
e.coli.cfu | -0.0269844 | 0.0352879 | -0.7646913 | 0.4468871 |
population | -0.0019187 | 0.0108574 | -0.1767177 | 0.8602129 |
dist | 0.0033504 | 0.0079782 | 0.4199472 | 0.6757408 |
watershedBear Creek | -96.7602059 | 210.3266462 | -0.4600473 | 0.6468312 |
watershedBrooklyn Creek | -4.7845357 | 140.7570975 | -0.0339914 | 0.9729755 |
watershedCalls Creek | -17.2263069 | 97.7382122 | -0.1762495 | 0.8605794 |
watershedHunnicutt Creek | 67.1138455 | 115.7405074 | 0.5798648 | 0.5637672 |
watershedMcNutt Creek | -5.3292443 | 87.4009356 | -0.0609747 | 0.9515438 |
watershedMiddle Oconee River | -49.3718448 | 114.0909789 | -0.4327410 | 0.6664615 |
watershedNorth Oconee River | 104.2539248 | 116.1557777 | 0.8975354 | 0.3723442 |
watershedOconee River | -0.7654564 | 137.9000000 | -0.0055508 | 0.9955861 |
watershedSandy Creek | -60.1850641 | 203.0550068 | -0.2963978 | 0.7677565 |
watershedTanyard Creek | 77.6196199 | 158.0304559 | 0.4911687 | 0.6247607 |
watershedTrail Creek | 13.0256205 | 132.2887015 | 0.0984636 | 0.9218304 |
Beyond the basic linear model, we have applied additional methods to improve model performance, including LASSO regularization and building decision trees and random forests for model comparison. The predictions, outcomes, and residuals resulting from each type of plot are demonstrated in Figure 3.13
Based on the results of the three different models, the LASSO model is the best option for this dataset, though the minimal difference in RMSE when compared to the null model suggests that even though LASSO is the better model method compared to others, it still does not produce a great model for predicting microplastic concentration.
Figure 3.14 demonstrates variable importance in the final selected LASSO model. None of the hypothesized predictors appear as important variables in this model.