Predicting pipe failures in Oslo city's water distribution network
Abstract
Water distribution pipe failures cause significant disruptions to service and often damages to the pipes' surroundings through e.g. flooding and erosion. However as water distribution pipes are commonly buried underground, assessing their condition and risk of failure to prevent failures is challenging. In this thesis, different statistical, machine learning and hybrid models for predicting the risk of a water distribution pipe failure have been applied to the water distribution pipes between 75 and 1200 mm in the city of Oslo, using data made available for this thesis by the City of Oslo Agency for Water and Wastewater Works. The data has been preprocessed both assuming that the pipes are repaired as new, and without this assumption. Additionally, feature scaling was applied to the data for the methods that use gradient descent. Missing values were either assigned the median for numerical values, or assigned uncategorised for categorical values.
Random forests, LightGBM and artificial neural networks (ANN) have been trained to classify previous failures. Although the results achieved were highly comparable, the ANN slightly outperformed the other classification methods. The classification results were improved when not assuming improved as new, and keeping the installation year as a feature together with the duration until failure or censoring. The ANN results in particular were improved from an AUC of 0.964 and accuracy of 0.908 when assuming repaired as new, to AUC of 0.997 and accuracy of 0.994 when calculating all durations from the installation year and still also including the installation year as a feature.
Random survival forest was applied in addition to the statistical-machine learning hybrid survival models DeepHit, DeepSurv, CoxCC and CoxTime. While DeepSurv has only been applied to a handful of studies within this area of research, most of the hybrid models have not and are seldom applied in engineering. The random survival forest model were able to achieve a C-index-IPCW of 0.786 and a mean cumulative/dynamic AUC of 0.796. While the random survival forests achieved at best an integrated Brier's score (IBS) of 0.126, the hybrid models achieved IBS between 0.002 and 0.009, time dependent C-index of 0.989 to 0.999, C-index-IPCW around 0.996 and mean cumulative/dynamic AUC as high as 0.999. The hybrid survival models also outperformed RSF in previous research, and the few studies applying DeepSurv to water distribution pipes.The mean cumulative/dynamic AUC and AUC results can be used to compare the classification and survival methods, showing that the hybrid models were able to achieve similar results while also modelling the time until failure.