Greetings,
Within the realm of what is practical, or prudent, I will attempt to explain what I feel happened to cause the server crash on Labor Day (Yesterday).
I cannot rule out some manner of Network Chicanery on the part of the D00rKn0bs; however, I seriously doubt that was the case. To even attempt to rule it out would require much more work than I could reasonably expend without even any assurance of being able to do so. Going forward at some point, I might very well be interested in setting up a '[Honey Pot
http://www.honeypots.net/:37u357n5
to ‘instrument’ the Miscreants in their game and then we can publish the results
In any case, I believe the Disk that served the site is about to fail. Regardless, it is evident that there is a region on the disk’s magnetic surface coating that has begun to fail. These things are unpredictable. It could last for another year (unlikely), of fail utterly tomorrow, even if I were to restore it to a status of reliable operation, by recreating the file system (after saving ALL the data, then copying the data back to the newly created file system) as then the Operating system would, as a function of creating the new file system map any areas of disk that were unreadable as unusable, and thus no disk writes would ever be attempted to that region of the disk, and therefore, no further errors would be raised. However, once a disk starts to go, it is likely to continue and get worse, and the process would have to be repeated, etc… till the whole thing just died. Therefore, the best thing to do was to move the sites to other infrastructure.
The up shot is that I had hoped to do this regardless, for a number of reasons, and thus it is now accomplished. The new infrastructure makes it MUCH easier, and faster, to recover from such a failure in the future, though even with this new infrastructure if there was such a failure, there would be an outage, but one which I could almost always respond to remotely, without the need to travel to the datacenter, and without having to change out hardware (unless the failure was truly catastrophic, based on the quite resilient nature of the sites newly provisioned hardware).
My apologies for the outage, but these things do happen unless one spends a SIGNIFICANT amount of MONEY, and TIME (more money), to build a load balanced, High Availability, Web Server Farm. If this were a static site I would have already done so. The issue of High Availability gets significantly more complex when there is a database behind the site, as the database would have to be synchronized between the pool of servers in real time. Clearly, such is doable, I have engineered many such sites; however, again, it significantly increases the complexity of the site’s infrastructure to the point of being impractical without large resources.
In any case, we are back up, and I believe ultimately in better shape than before…
Regards,
–Azti