NAS failsafe and disaster recovery options

A subforum dedicated to topics around the NAS.
hamishmb
Posts: 1891
Joined: 16/05/2017, 16:41

NAS failsafe and disaster recovery options

Post by hamishmb »

Here follows a quick summary of a discussion we originally started over email.

Main concerns:
  • Loss of data logging facility.
  • Loss of historic data.
Potential ways to back up the database:
  • Partition each SD Card with the OS on one partition and the software and results on the other. Mark the OS partition Read-only so that and SD Card failure does not prevent the Pi operating. However, the results and logs on the other partition could be lost; shouldn't be a problem assuming that the NAS is still working.
  • Just use the built-in RAID feature.
    • I need to find a way of notifying us when a drive fails - otherwise this isn't very useful.
  • Run a script/similar on the NAS box regularly to backup to USB storage.
  • Still store results in files, but do it to separate USB storage independent of the NAS box.
  • Use USB storage attached to 1 or more of the Pis for storing backup readings.
    • Awkward price-wise, cos only 1 USB port on the Pis.
  • Using the WAL (Write-Ahead Log) to stream to a fallover database.
    • Good idea, but potentially complicated.
    • Also requires the NAS box database to support this, but it might not cos it's old and I haven't been able to cross-compile a newer database system.
    • Would also require another database server (?).
    • Would allow the system to continue running normally if the NAS box goes down?
Notes for WAL and my other thoughts:

I've never even heard of this before, so I'll need to look into it and get back to you.

We need to be careful here - there's definitely the potential for making things over-complicated and for feature-creep. Having said that, there's ample time to sort this out, especially seeing as I have the NAS box with me for the foreseeable future. It's also much easier to get all of this stuff in place now, rather than when the NAS box is installed at WMT.

Hamish
Last edited by hamishmb on 29/03/2020, 12:38, edited 1 time in total.
Hamish
wmtprojectsforum
Amministratore
Posts: 73
Joined: 16/05/2017, 16:24

Re: NAS failsafe and disaster recovery options

Post by wmtprojectsforum »

Hamish,

I've added the dual partition solution to the list.
TerryJC
Posts: 2616
Joined: 16/05/2017, 17:17

Re: NAS failsafe and disaster recovery options

Post by TerryJC »

hamishmb wrote: 29/03/2020, 12:06
  • Just use the built-in RAID feature.
    • I need to find a way of notifying us when a drive fails - otherwise this isn't very useful.
Don't forget the mantra:
RAID is not a backup solution.
Terry
hamishmb
Posts: 1891
Joined: 16/05/2017, 16:41

Re: NAS failsafe and disaster recovery options

Post by hamishmb »

Yep :)

But it is useful if a drive fails of course, as long as we have a way of knowing a drive has failed (besides the lights on the box).
Hamish
TerryJC
Posts: 2616
Joined: 16/05/2017, 17:17

Re: NAS failsafe and disaster recovery options

Post by TerryJC »

I don't deny that we want to know if a drive has failed, but we shouldn't use RAID as the backup.
Terry
crumeniferus
Posts: 9
Joined: 04/12/2019, 18:23

A little summary of Write-Ahead Logs and Database Engines

Post by crumeniferus »

I'd like to offer this description which may help with getting a picture of how a Write-Ahead Log can be used to increase availability of a database.

The Write-Ahead Log (WAL) is not an obscure name so you can probably get a feel for its general nature without much explanation. I'll go straight to what this means with respect to database engines.

For a database engine, the write-ahead log (WAL) is a journal of all transactions, complete and partial that have been or need to be committed to the database. As far as the engine is concerned, the WAL is the most reliable source of data. The tables, constraints, etc are all simply part of communicating with the outside world. Writing to the WAL is considerably faster than updating the structured representation. Usually, in the event of system failure, there will be only a handful of transactions which fail to make it to the WAL, thus minimising loss of data. When the system restarts, transactions that were not completely written into the WAL will most likely be dropped, or maybe marked as unusable and kept for analysis. Any structured data that was being updated from complete WAL entries but not yet fully committed, will be dropped. Transactions in the WAL that were complete in the WAL but not committed to the structure can be replayed. In this way, only a small data loss should occur and the remaining data will have no loss of integrity.

Database replication, in the context of database engines, usually describes a process specific to the database engine rather than as a synonym for copying files via the operating system. This will typically require both the replica and master databases to be running the same engine.

One way of achieving replication is to stream the WAL from the master out over TCP/IP to the replica. The engine of the replica reads the transactions in the WAL and applies them to its structured representation in just the same way as the master does.

Once the concept of a WAL is established, as well as the possibility of streaming or storing it then replaying it on a different instance, a few different configurations become possible. These configurations address different balances of fault tolerance, load balancing, and storage capacity. Some configurations can ease the very knotty problem where two copies are separated and diverge (office with travelling salesmen and no internet access, for example), resulting in a potentially very complicated data merge. The configurations available depend on the capabilities of the particular database engine.

Here, my knowledge of the specifics is limited to PostgreSQL so, to reduce noise in the thread, I'll leave the description as it is for now.
hamishmb
Posts: 1891
Joined: 16/05/2017, 16:41

Re: NAS failsafe and disaster recovery options

Post by hamishmb »

Okay, sounds interesting. I'll look into whether we can do this or not.
Hamish
hamishmb
Posts: 1891
Joined: 16/05/2017, 16:41

Re: NAS failsafe and disaster recovery options

Post by hamishmb »

Okay. so here are my thoughts on the ideas so far:

Multiple partitions on the SD card

Having the system partition read-only is a good idea I think. This whole idea seems quite good to me, but there are some issues:
  • It makes setting up the base image more complicated and error prone.
  • While data corruption won't stop the pi from booting, if the software is unable to write readings files we may get some strange behaviour (though this should be handled correctly I think).
  • We'll still slowly wear out SD cards - higher maintenance costs and volunteer time than other solutions.
Built-in RAID

Not a backup solution, as Terry said. Does at least provide some redundancy against hardware failure though.

Regularly backup to USB storage using the USB port on the NAS

Seems like a good idea, but stops us from communicating with a UPS through the USB port if we wanted to (the software in the box can automatically shut it down).

Also only provides one backup, in the same location as the NAS box - if, say, it was stolen or there was a fire/flood, the backup would probably go with it.

Backup through USB storage connected to 1 or more of the Pis

Would work, but we would need USB hubs. Also more devices, and if these are flash based they will wear out like the SD cards.

This would give us backups in multiple locations though.

WAL

As far as https://www.digitalocean.com/community/ ... n-in-mysql goes, this seems to involve editing database configuration files before starting. On the NAS box, these are all on a RAM disk, and reset at boot...

I'm not saying it's impossible to do it, but the solution may end up being complex. I wonder whether dumping the database to a file, which would be a lot simpler, would suffice. It would mean that we don't have a warm-standby database to swap to, of course.

Summary

I'm really not sure what to do, there are too many options and too many trade-offs :lol: . Perhaps budget will be the decision-making factor. I don't think we have any budget left at the moment, but also I don't imagine the model town will make much money this year compared to previous ones because of COVID-19. Thoughts?
Hamish
hamishmb
Posts: 1891
Joined: 16/05/2017, 16:41

Re: NAS failsafe and disaster recovery options

Post by hamishmb »

New idea:

How about, in addition to one or more of the previous idea, if the NAS box is connected to the internet (and I think we were talking about doing this), we could have it compress and upload backups to this webserver periodically, so we're closer to employing the 3-2-1 backup strategy (3 backups of all important data, in at least 2 local, but different mediums, and 1 remote backup.
Hamish
TerryJC
Posts: 2616
Joined: 16/05/2017, 17:17

Re: NAS failsafe and disaster recovery options

Post by TerryJC »

Herewith my response to one or two of the items in your previous two posts:
Multiple partitions on the SD card
It makes setting up the base image more complicated and error prone.
Setting up shouldn't be too much harder than it is currently. In the Installation Spec I detail a process whereby a Base Image is created with the OS and common software packages installed. It would be one more step to partition the SD Card used to create that Base Image and then copy the Software Framework to the new partition instead of the OS partition.
We'll still slowly wear out SD cards - higher maintenance costs and volunteer time than other solutions.
True, but that was always gong to be the case and disk activity would be significantly reduced I suspect because the OS will now be writing it's logfiles to a RAM disk instead of the SD Card. Writes to the second partition would be reduced to once per measure cycle.
WAL
I think that if we were setting up a critical data centre, then this would be by far the best solution. However, we are not and in the extreme our main goal is to keep the system running until a member of staff can intervene in the event of an outage. Having a backup solution isn't that critical in fact because the worst that would happen is that we would have a gap in historical data. Hardly the end of the world :)

Also this is very much a Network Administrator task; not just the capture of the stream but also in the restoration of the data after a disaster. The WMT Staff and volunteers do not have many Network Administrators amongst them, so its reliability would rely on 'one of us' to deal with it. When I initially set out to automate various aspects of the WMT (when I did the Model Railway Lighting), I decided that anything done should give later volunteers a fighting chance of picking up where I left off (should I fall under the proverbial bus). Now that the team is much larger I don't see any reason to change that policy. We are all specialists in some way (although we do have more software expertise than previously), but if an individual moved on, having a complex backup solution could make things that much more difficult.

We could mitigate the previous issue by creating additional software to automate the process of course.
Offsite Storage
This is probably the best solution and could be used to store (compressed) log and Results files, a full NAS database backup or the WAL. However, there are some points worth making:
  1. It is more complex than a simple backup file, so all the arguments put forward about the WAL solution would apply.
  2. The WMT Management are reluctant to increase the traffic on the Town's Internet connection due to cost.
  3. An even more skilled Network Administrator would be needed to ensure that the security of the River System Network and the Office Network wasn't compromised.
All through my engineering career I have tried to follow the 'Keep it Simple Stupid' dogma. On that basis the partitioned SD Card and direct NAS Box backups are at the simple end with WAL and Internet backups at the other. The simple end increases cost a bit and the other end increases complexity a lot.

I'm still all for KISS.
Terry
Post Reply