Q: RAID can help protect me against data loss. But how can I also ensure that the system is up as long as possible, and not prone to breakdown? Ideally, I want a system that is up 24 hours a day, 7 days a week, 365 days a year.
A: High-Availability is difficult and expensive. The harder you try to make a system be fault tolerant, the harder and more expensive it gets. The following hints, tips, ideas and unsubstantiated rumors may help you with this quest.
IDE disks can fail in such a way that the failed disk on an IDE ribbon can also prevent the good disk on the same ribbon from responding, thus making it look as if two disks have failed. Since RAID does not protect against two-disk failures, one should either put only one disk on an IDE cable, or if there are two disks, they should belong to different RAID sets.
SCSI disks can fail in such a way that the failed disk on a SCSI chain can prevent any device on the chain from being accessed. The failure mode involves a short of the common (shared) device ready pin; since this pin is shared, no arbitration can occur until the short is removed. Thus, no two disks on the same SCSI chain should belong to the same RAID array.
Similar remarks apply to the disk controllers. Don't load up the channels on one controller; use multiple controllers.
Don't use the same brand or model number for all of the disks. It is not uncommon for severe electrical storms to take out two or more disks. (Yes, we all use surge suppressors, but these are not perfect either). Heat & poor ventilation of the disk enclosure are other disk killers. Cheap disks often run hot. Using different brands of disk & controller decreases the likelihood that whatever took out one disk (heat, physical shock, vibration, electrical surge) will also damage the others on the same date.
To guard against controller or CPU failure, it should be possible to build a SCSI disk enclosure that is "twin-tailed": i.e. is connected to two computers. One computer will mount the file-systems read-write, while the second computer will mount them read-only, and act as a hot spare. When the hot-spare is able to determine that the master has failed (e.g. through a watchdog), it will cut the power to the master (to make sure that it's really off), and then fsck & remount read-write. If anyone gets this working, let me know.
Always use an UPS, and perform clean shutdowns. Although an unclean shutdown may not damage the disks, running ckraid on even small-ish arrays is painfully slow. You want to avoid running ckraid as much as possible. Or you can hack on the kernel and get the hot-reconstruction code debugged ...
SCSI cables are well-known to be very temperamental creatures, and prone to cause all sorts of problems. Use the highest quality cabling that you can find for sale. Use e.g. bubble-wrap to make sure that ribbon cables to not get too close to one another and cross-talk. Rigorously observe cable-length restrictions.
Take a look at SSI (Serial Storage Architecture). Although it is rather expensive, it is rumored to be less prone to the failure modes that SCSI exhibits.