HPCC Systems 7.2.16: Blocked Workunits After Disk Fail
Up to now, I only dealt with Thor slave fails that caused blocked workunits. That was usually due to C++ plugin code from me that broke. What helped against that was stopping the master with systemctl stop hpccsystems-platform.target, killing remaining thorslave processes on the slaves, and starting the master again.
This time, I got blocked workunits due to a failed harddisk on a slave. The harddisk suddenly changed to read-only and obviously left the DFS in an inconsistent state. The DFS service couldn't get started anymore on the machine with the disk failure after repairing it. Spraying any file was blocked due the missing DFS process. Furthermore, I found unnecessary DFS data from different cluster configurations before which wasted harddisk space. So I decided to wipe all variable data of HPCC systems except for the configuration and the code. I stopped the cluster with systemctl stop hpccsystems-platform.target on every node and killed remaining processes (ps aux | grep hpcc). I deleted all variable files on each node with rm -rf /var/lib/HPCCSystems /var/log/HPCCSystems. After starting the master, I had to recreate the two folders /var/lib/HPCCSystems/mydropzone and /var/lib/HPCCSystems/thorslaves manually. Otherwise, the landing zone wouldn't work and the Thor slaves would not be started (see errors in /var/log/HPCCSystems/mythor/thorslaves-launch.debug). After that procedure, the system could be started again in a fresh consistent state.