How to handle database failure

A forum for discussion on the software for the WMT River Control System
PatrickW
Posts: 146
Joined: 25/11/2019, 13:34

How to handle database failure

Post by PatrickW »

Hamish and I ended up having a text message discussion about handling database failure in the software (and particularly in the control logic).

We decided to move the discussion here. Here is a (very) lightly abridged transcript of the discussion so far.

[We had been talking about the Stage Pi control logic.]
hamishmb wrote:07/09/2020, 13:18Only question I think I have is what if the database goes down while a valve is open? But I think other parts of the framework may need improvements to deal with that.
PatrickW wrote:07/09/2020, 13:44I didn't think of DB failure handling. Is it even possible to close a valve if the DB is down? I suppose the valve would have to do that itself after a timeout. I think my intention was for the logic to 'hang' in its current state if it can't get valid sensor readings. [...] If it can get sensor readings but cannot control a necessary device, then it can do no more than monitor the situation.
hamishmb wrote:07/09/2020, 13:55No actually, the valve would have to do that itself. It's just making sure anything local like a pump connected directly to the pi would be turned off, so yeah you're right, never mind.
hamishmb wrote:07/09/2020, 13:55Terry was saying that remembering the last few readings/state could help, but for G6 I don't think it'd change anything.
PatrickW wrote:07/09/2020, 16:41Yes, if it is relying on a cache of old readings, it will just decide to stay in the current state until they change, the effect of which is the same as what it already does. I suppose previous readings might let it detect when a water level isn't changing the way it should be, but only if it takes into account the actions of the other algorithms.
PatrickW wrote:07/09/2020, 16:45But previous readings are already available when the database is working.
PatrickW wrote:07/09/2020, 16:50Should get_latest_reading be able to fall back to using local data if the database is down? That would be enough to keep sump pi running.
PatrickW wrote:07/09/2020, 16:56To keep that clean, I would have a separate local get_latest_reading function/method and then another get_latest_reading that tries the database one or falls back to the local one.
hamishmb wrote:07/09/2020, 17:17That'd be a good idea, something to go in logiccoretools? Then it can be seamless.
hamishmb wrote:07/09/2020, 17:17Best to wait until after the merge I think - that's kinda a separate work item, and we don't really need to delay things to do that.
hamishmb wrote:08/09/2020, 23:34Just realised that your code is using the database to fetch readings from the probes connected directly to Stage Pi. This will work but be a bit inefficient and depends on the NAS box when it need not. That's okay for here, but for sites with pumps and other devices it'd be best if we could still control those based on local readings if the NAS box/network goes down for some reason
PatrickW wrote:08/09/2020, 23:45I think I asked about this when I began to switch it [the Stage Pi logic] over to logiccoretools. I understood that everything was going via the database, but perhaps I misunderstood. What is the intended method for obtaining local readings? In practice, Stage Pi can't really do anything without readings from Wendy Butts Pi or control of V12, so loss of NAS will always grind it to a halt.
hamishmb wrote:09/09/2020, 12:42The intended way is to use the readings dictionary passed to the control logic function. This will require making the main loop a bit less stupid than it is right now. I'm sorry because this is my fault for being unclear, that is indeed what you asked, but I didn't realise you were asking about local readings too.
hamishmb wrote:09/09/2020, 12:43Again, we can do that after merging, because your code is, as you say, fully functional.
PatrickW wrote:09/09/2020, 12:50Oh, I already removed all the readings dictionary code from Stage Pi's logic, because I thought the readings dictionary was going away. I don't particularly like having two separate ways of getting readings. It seems to me as though logiccoretools should do that automatically, but we can cross that bridge later on.
PatrickW wrote:09/09/2020, 12:51Readings dict will make unit testing more complex.
hamishmb wrote:09/09/2020, 12:53Yeah, point. How about this:
hamishmb wrote:09/09/2020, 12:55This is seamless and means existing code need not change. Again, we can do that later on
hamishmb wrote:09/09/2020, 12:55logiccoretools can automatically stash the last few readings for each probe, and if the call to DatabaseConnection fails, it can just return the newest one it had instead (or if you asked for multiple readings, it can do its best to do what you asked)
hamishmb wrote:09/09/2020, 12:55Does that sound better?
(Patrick apparently doesn't read the suggestion properly.)
PatrickW wrote:09/09/2020, 13:51Why can't it just get the actual readings if they are local readings? It could just go 'Oh, DB is down. What does the readings dict say then?' This kind of thing was the rationale behind using logiccoretools instead of directly using DatabaseConnection: you can slip in a different way of doing things without changing either DatabaseConnection or the control logic. (N.B This seems like a good point to mention that I was thinking the functions in logiccoretools would be empty variables (= None), and then run_standalone could assign the appropriate function to each variable, and unit tests would assign a different, fake set. So, if we just want DB access it would assign each of DatabaseConnection's methods to the variables. If we want a version that tries the DB but falls back on local readings, that's a separate set of functions, somewhere in Tools, that can be assigned to the logiccoretools variables.)
hamishmb wrote:09/09/2020, 13:54That's pretty much what I was suggesting (readings wise). What's the advantage of doing this with run_standalone()? It seems overcomplicated and unnecessary for me. We can override them anyway in the unit tests so it doesn't obstruct testing (I've done this in many places in my tests). Does it make more sense to have it fall back automatically? Why would we ever want it not to fall back?
hamishmb wrote:09/09/2020, 13:54Also, should we discuss this on the forum so Terry and co can see the messages?
hamishmb wrote:09/09/2020, 13:55That way we have a better record of why we picked this design and why we discarded other ideas
PatrickW wrote:09/09/2020, 13:59Yes, I would prefer to discuss this on the forum for that reason and because texting is making my little finger numb!
(Unable to suppress a sudden thought, Patrick then continues texting...)
PatrickW wrote:09/09/2020, 14:06I think having a default behaviour for the functions is likely to encourage coding against the default behaviour rather than against the abstract interface.
PatrickW wrote:09/09/2020, 14:09There could be a logiccoretools.setup_functions_db(), etc. to quickly assign a particular set of functions without making other parts of the code remember which ones go together.
hamishmb wrote:09/09/2020, 14:10If we make the fallback behave exactly the same to the knowledge of anything using the function, it shouldn't matter? Let's discuss on the forum anyway :)
Hamish asked me to summarise, but I thought I may as well just post a transcript to avoid misrepresenting what was said.
Last edited by PatrickW on 12/09/2020, 18:10, edited 5 times in total.
PatrickW
Posts: 146
Joined: 25/11/2019, 13:34

Re: How to handle database failure

Post by PatrickW »

I am willing to concede the finer points around whether logiccoretools contains function bodies versus empty variables that have functions assigned to them by other code. At the end of the day, it will work either way and it is not worth getting caught up on a disagreement about coding style.

My preference to define it as a "pure" interface, detached from any particular implementation, is difficult to justify except in terms of "because it is just neater that way", and clearly not everyone thinks it is actually neater!
(My stance on this is probably rooted in something akin to the Open-Closed Principle, expressed in my thinking as a strong preference to interchange modules rather than modify existing ones, and thus to design modules in a way that minimises the need for modification.)

(One other thing I noticed is that some of the functions in logiccoretools [store_tick and store_reading] are not [as far as I know] intended to be used by logic, so what we actually have is a something-else-tools, but let's not get into that. It's good enough. :D )

So, does this mean we have reached a decision to implement a fallback to local readings within the functions that are defined in logiccoretools? Does that fully resolve the question of how to handle database failure, or are there other facets to consider?
hamishmb
Posts: 1891
Joined: 16/05/2017, 16:41

Re: How to handle database failure

Post by hamishmb »

I think I'd prefer a pure interface too, but in Python this seems rather clunky with none/very few of the guarantees that are offered in say Java.

store_tick is used in the NAS logic to be fair, but you're right about the other one :) Didn't think of that, ah well.

I believe so, in both cases, but I have no doubt that there's something I haven't thought of. It's probably okay for now, but any ideas Terry?
Hamish
TerryJC
Posts: 2616
Joined: 16/05/2017, 17:17

Re: How to handle database failure

Post by TerryJC »

I don't feel that my knowledge of the capabilities of Python qualify me to comment on the detailed implementation of any solution, but I am confident that I can comment on Policy. First a history lesson.

When the NAS Box became available my idea was that it would simply be a more robust place for the readings files with the SD Cards being retained for the backups. In other words the idea of using the built in database didn't arrive until later. As time passed other more powerful reasons for using a database surfaced, in particular the 'zoning' of device control. Now we have one fully working implementation of this (the Lady Hanham SAC) and working but not yet deployed implementation (the Stage SAC), we are (rightly) getting worried about database failure.

So the primary reason for having the database was to allow any SAC to access the level (and ultimately flow) data that's needed to carry out it's control functions. Another important function was to allow graphing functions so that anyone could call up a chart showing the water levels over any given period from days through to years. That secondary function is very much a 'nice to have' though; gaps in the record won't be the end of the world.

The important thing to safeguard is the time between the database going down and it subsequently being restored. This could be anything from hours to days, so the solution needs to be robust and if possible, simple. The important thing is that at any given time the data is held in two places; the NAS Box and the various Pis distributed around the site, so how about this for an idea:
  1. Each SAC keeps a separate file containing the data for the levels that it is interested in (eg Wendy Butts and Sump for Sump Pi, Wendy Butts and Stage Butts for Stage Pi). These devices are fetching this data anyway, so it is a simple matter to write it to a local file for fall-back purposes only. In normal operation these files would be retained for minutes, rather than hours, ie, they are overwritten each time a new dataset is obtained from the NAS Box.
  2. In the event of a database failure each SAC would revert to it's fall-back file initially while it switches to fail-over mode. In this mode, the old sockets code is used to read the required data from the relevant SACs and Gate Valves and operation continues fairly seamlessly.
  3. When the database error has been resolved, operation reverts to normal.
I appreciate that there will be a bit of work involved in this because, apart from the Wendy Butts, all the SACs will have to run similar code to Sump Pi to allow them to obtain data from the other Pis. Apart from that the solution will be largely using pre-existing code re-used in different locations.

Any obvious pitfalls?
Terry
PatrickW
Posts: 146
Joined: 25/11/2019, 13:34

Re: How to handle database failure

Post by PatrickW »

The main requirement I draw from Terry's policy comment is:
When the database goes down, we may end up with a gap in the data in the database, but the other functions of the system should continue with minimal disruption.

The rest of Terry's comment seems to be a logical conclusion from that requirement.

Having had my head stuck in the stateful Stage Pi control logic for a while, the usefulness of a local file containing sensor readings was not immediately apparent to me, but of course it is useful when the logic is stateless. It is a valid suggestion.

In its current state, I don't think the sockets code is 'ready' to be used as a fallback option, but it's certainly not impossible.

As far as I know, we can't just re-instate the old sockets code as it was. The old, sockets method of distributing sensor readings and device control requests was not updated to support multiple SACs, because we knew we were going to use the database instead. Since then, the way sockets are used has evolved into a more centralised arrangement which uses the NAS as a message broker. I think this was done to avoid the complexity of having everything connect to everything else. (Complexity that partly motivated the move to using the database to distribute sensor readings, if I recall correctly.)

(To be clear, I think the actual, underlying sockets code functions in pretty much the same way it always has, but typically when we talk about 'sockets' in this project, and certainly when we talk about using sockets to distribute sensor readings, we are referring to the overall communication strategy in which sockets are employed, rather than just to the sockets themselves, and that has changed.)

It seems obvious to me that we would adapt the new sockets/messaging arrangement to serve as a fall-back for sensor readings and device control, rather than adapting the older code. But, it's not much of a fall-back if, like the database, it depends on the NAS being up and running (to act as message broker). So, the obvious solution is to assign the message broker role to a different node than the one that runs the database, so that two nodes need to go down before the SACs will grind to a halt.

I think Hamish knows more about the nitty gritty of this than I do, though, so he may have better ideas and/or corrections to my misconceptions.

Beyond that, I find myself mentally exploring a few different approaches that would eliminate the message broker as a single point of failure while maintaining compatibility with that general (message-based) approach, but I will refrain from writing an essay on those approaches unless we actually want to go down that road.
TerryJC
Posts: 2616
Joined: 16/05/2017, 17:17

Re: How to handle database failure

Post by TerryJC »

PatrickW wrote: 12/09/2020, 15:26In its current state, I don't think the sockets code is 'ready' to be used as a fallback option, but it's certainly not impossible.
I never quite said that :-) What I actually said was:
I appreciate that there will be a bit of work involved in this because, apart from the Wendy Butts, all the SACs will have to run similar code to Sump Pi to allow them to obtain data from the other Pis.
I realise that there will be some rewriting to do. However:
PatrickW wrote: 12/09/2020, 15:26It seems obvious to me that we would adapt the new sockets/messaging arrangement to serve as a fall-back for sensor readings and device control, rather than adapting the older code. But, it's not much of a fall-back if, like the database, it depends on the NAS being up and running (to act as message broker). So, the obvious solution is to assign the message broker role to a different node than the one that runs the database, so that two nodes need to go down before the SACs will grind to a halt.
As mentioned earlier, I'm not good on detailed implementation, but the old sockets code has been working well in recent months, so there may be a way to simply re-use it but with different targets for the data sources.

I'll leave that for you and Hamish to discuss.
Terry
PatrickW
Posts: 146
Joined: 25/11/2019, 13:34

Re: How to handle database failure

Post by PatrickW »

TerryJC wrote: 12/09/2020, 15:47
PatrickW wrote: 12/09/2020, 15:26In its current state, I don't think the sockets code is 'ready' to be used as a fallback option, but it's certainly not impossible.
I never quite said that :-) What I actually said was:
I appreciate that there will be a bit of work involved in this because, apart from the Wendy Butts, all the SACs will have to run similar code to Sump Pi to allow them to obtain data from the other Pis.
Ah, but I never quite said that you said it! My sentence probably needed an "as Terry said" or "I would agree" somewhere in it. :)

The rest of what I said, where I went into implementation details, was probably aimed more at Hamish than Terry. I think Hamish probably has a clearer idea than both Terry and me about how this does and/or should work.
hamishmb
Posts: 1891
Joined: 16/05/2017, 16:41

Re: How to handle database failure

Post by hamishmb »

Okay, so here are my main takaways from the above:

1) We need a proper fallback option if the database goes down. We knew this already I think.

2) We could use the Sockets class (our abstraction) for this fallback.

It was previously my understanding that we would no longer use the sockets way of transferring readings, and only use them for time-sensitive communications (like the system ticks). If this was the grand plan then I'm perhaps slightly annoyed at being confused, but I admit the idea makes sense. I maybe wouldn't have removed so much of the old sockets logic, but it's still in the old pre-merge (use-database branch) commits, so no matter.

The old sockets control logic will probably not be fit for purpose here, but the sockets themselves are well tested, and the associated SocketsMonitors (previously used to monitor a remote probe over a socket) will work fine. I think we want a central message broker if we are to take this approach, because otherwise we are essentially losing one of the main reasons we added the NAS box - the database enables us to make communication simpler. The code for this broker - message forwarding basically - has already pre-emptively been written and tested, and integrated into the system software (but not yet used).

I agree that this broker needs to run on a system other than the NAS box, otherwise it won't help much to mitigate against NAS box failure. I'm much more comfortable with having two points of failure. Testing that this fallback works effectively may be a bit of a problem. Perhaps I should modify test mode so probe values and float switches go up and down predictably rather than sitting at 0/empty?

3) Files could be part of this solution.

Agreed, it makes sense. This could be managed transparently by logiccoretools as well. In fact, the whole fallback could be managed by logiccoretools. All of the methods there have their own docstrings separate from DatabaseConnection equivelants, so we can easily add to and modify the behaviour without documentation troubles.

4) Terry's comment:
The important thing to safeguard is the time between the database going down and it subsequently being restored. This could be anything from hours to days, so the solution needs to be robust and if possible, simple. The important thing is that at any given time the data is held in two places; the NAS Box and the various Pis distributed around the site, so how about this for an idea:

1) Each SAC keeps a separate file containing the data for the levels that it is interested in (eg Wendy Butts and Sump for Sump Pi, Wendy Butts and Stage Butts for Stage Pi). These devices are fetching this data anyway, so it is a simple matter to write it to a local file for fall-back purposes only. In normal operation these files would be retained for minutes, rather than hours, ie, they are overwritten each time a new dataset is obtained from the NAS Box.

2) In the event of a database failure each SAC would revert to it's fall-back file initially while it switches to fail-over mode. In this mode, the old sockets code is used to read the required data from the relevant SACs and Gate Valves and operation continues fairly seamlessly.

3) When the database error has been resolved, operation reverts to normal.
Makes a lot of sense.

5) Terry said:
Now we have one fully working implementation of this [device control] (the Lady Hanham SAC) and working but not yet deployed implementation (the Stage SAC), we are (rightly) getting worried about database failure.
The Hanham SAC does not currently control anything using the database. Are you confusing this with the Wendy Butts?

6) This is worth doing by will take significant time to create, test, and deploy. Does anyone have a spare USB stick we could plug in for a temporary automatic backup to USB? I have some old 128MB ones that are free, but I don't trust them. Also they would fill up very quickly.

Backup thread is here: viewtopic.php?f=36&t=217&start=20

7) As above with the time: I'll be starting up university again on the 3rd of October. This is a staggered start, with the second module (the Project) beginning in February, so I will have time to work on this (hopefully). However, I will also need to focus on my Project planning and the exam revision for the first module. How much we can get done will depend on how much you two can do as well :)

I'm obviously keen to get this implemented, but I do have limited time, so we need to prioritise carefully what order to do things in. I think merging, testing, and deploying Patrick's code is the first priority here, perhaps shared with setting up a simple USB database backup.

Apologies for the long post, and if I've forgotten anything.
Hamish
TerryJC
Posts: 2616
Joined: 16/05/2017, 17:17

Re: How to handle database failure

Post by TerryJC »

hamishmb wrote: 12/09/2020, 19:57It was previously my understanding that we would no longer use the sockets way of transferring readings, and only use them for time-sensitive communications (like the system ticks). If this was the grand plan then I'm perhaps slightly annoyed at being confused, but I admit the idea makes sense. I maybe wouldn't have removed so much of the old sockets logic, but it's still in the old pre-merge (use-database branch) commits, so no matter.
The grand plan was to effectively do away with sockets for all but the system tick. It's only just occurred to me as a result of this discussion that we could utilise the sockets code for to solve the relatively newly identified problem.
hamishmb wrote: 12/09/2020, 19:575) Terry said:The Hanham SAC does not currently control anything using the database. Are you confusing this with the Wendy Butts?
No. Just a Senior Moment. :?
hamishmb wrote: 12/09/2020, 19:576) This is worth doing by will take significant time to create, test, and deploy. Does anyone have a spare USB stick we could plug in for a temporary automatic backup to USB? I have some old 128MB ones that are free, but I don't trust them. Also they would fill up very quickly.
I have a fairly big one somewhere that could be utilised on a temporary basis. If we want a permanent backup for the NAS Box, we should buy something (after discussion with the team and management).
hamishmb wrote: 12/09/2020, 19:577) As above with the time: I'll be starting up university again on the 3rd of October. This is a staggered start, with the second module (the Project) beginning in February, so I will have time to work on this (hopefully). However, I will also need to focus on my Project planning and the exam revision for the first module. How much we can get done will depend on how much you two can do as well :)

I'm obviously keen to get this implemented, but I do have limited time, so we need to prioritise carefully what order to do things in. I think merging, testing, and deploying Patrick's code is the first priority here, perhaps shared with setting up a simple USB database backup.
I agree. If we have the temporary backup in place then all this can be done as time allows.
Terry
hamishmb
Posts: 1891
Joined: 16/05/2017, 16:41

Re: How to handle database failure

Post by hamishmb »

TerryJC wrote: 13/09/2020, 6:27 The grand plan was to effectively do away with sockets for all but the system tick. It's only just occurred to me as a result of this discussion that we could utilise the sockets code for to solve the relatively newly identified problem.
Seems fair, I'm glad it wasn't just me who hadn't noticed this problem.
TerryJC wrote: 13/09/2020, 6:27
hamishmb wrote: 12/09/2020, 19:576) This is worth doing by will take significant time to create, test, and deploy. Does anyone have a spare USB stick we could plug in for a temporary automatic backup to USB? I have some old 128MB ones that are free, but I don't trust them. Also they would fill up very quickly.
I have a fairly big one somewhere that could be utilised on a temporary basis. If we want a permanent backup for the NAS Box, we should buy something (after discussion with the team and management).
Yes, that'd be great for temporary use. If you can get it ready soon, I'll be happy to pick it up and plug it in. Just wary of the COVID crisis intensifying so I'm keen to get all physically-at-WMT stuff out of the way in case it suddenly gets much worse than it already is.
Hamish
Post Reply