Docs To Do:
- Check that using package.py to prepare installation/update tarballs is in the installation spec and/or user guide
- Check that there are no references to master and client/remote pis in any of the current documentation - hangover from before this was a distributed system.
- Change references to the system ID to site ID - more clear.
- What are the DNS names for the pis and where are they documented?
- Use "tear down" to refer to shutting down the river system software.
- This is to differentiate tearing down the river system software from shutting down Raspberry Pi OS.
In-Code Documentation To Do:
- Tests: Add test numbers in docstrings for all tests that are missing them.
- Make language consistent in code and docs:
- system id -> site id - Change DB field names, all else is done.
- Change in written documents too - ask Terry to do this.
- Asked:
- Remove reference to socketsmonitors in the code and docs (check if we might want them later first).
- Improve docstrings + docstring coverage.
- Don't bother specifying the object type in usage examples for methods.
- Use more C-style things eg list lt Socket gt.
- Fix sphinx warnings.
- Control logic functions and controllogic.py: Refer to flowcharts and details written elsewhere in sw API docs (write them, for the logic I wrote).
Code To Do (0.12.1/0.13.0):
- Relicense as AGPL, need to read and offer download link somehow.
- Test sockets code really thoroughly with VMs including multiple disconnections and reconnections.
- Implement a basic CLI.
- Do dependency: silenceable print function.
- Will be a hybrid thing that runs after main.py, but can also be started with a separate script.
- Commands (case insensitive):
- HELP & ?
- SILENT & VERBOSE
- LEVEL G4:M0 = 400
- CONTROL SUMP:P0 = True/False
- CONTROL V4:V4 = 50
- Make sure to note to user that commands will potentially be overridden immediately by the control system at the end of the next reading interval.
- May need to remove some of Patrick's manual override code - file-based stuff maybe not needed with this improvement?
- If not, make sure to honor it in all relevant places.
- SILENT & VERBOSE will require using a custom print function. Also useful for unit tests to shut stuff up a bit.
- main.py:
- Try to shutdown cleanly when there are unhandled errors.
- coretools.py:
- get_and_handle_new_reading():
- What do do if a fault is detected when fetching and logging and printing readings?
- MonitorLoad.run():
- Catch proper exceptions from psutil.
- controllogic.py:
- stagepi_control_logic:
- Make sure status strings are always short enough to fit in DB/edit db schema if needed (better option).
- Implement current_action status other than "None" by extending ControlStateABC with a currentAction member to be overriden by each state class.
- stagepilogic.py:
- StagePiReadingsParser:
- Handle sensor faults and failure to get readings in a more robust manner.
- Each g4x or g6x method should try to return a best guess result based on the sensor readings available, even if some are missing, contradictory or indicate faults. Only when none of the readings agree, or when none are available, should these methods raise ValueError.
- Implementing such a policy will require the initialiser to record the fault status of the readings, in addition to their values.
- StagePiDeviceController:
- When the matrix pump is impl'd, add optional arguments for specifying the state of the matrix pump.
- When the matrix pump is impl'd, add matrix pump control and event logging here, as above for V12
- StagePiG4OverfilledState.controlDevices:
- When the matrix pump is implemented, instead of closing V12, we should check whether G6 is full. If it is, we should just close the valve. If it isn't full, then we should pump water the other way, from G4 to G6.
- StagePiG4FilledState.controlDevices:
- When the matrix pump is implemented, we need to request it to be turned off and release any lock this Pi holds on using the pump.
- StagePiG4VeryNearlyFilledState.controlDevices:
- When the matrix pump is implemented, we need to: (a) get the lock to control the pump before we open V12; and (b) either open the pump's valves to allow passive water flow, or start pumping downstream at a low rate.
- StagePiG4NearlyFilledState.controlDevices:
- When the matrix pump is implemented, we need to: (a) get the lock to control the pump before we open V12; and (b) either open the pump's valves to allow passive water flow, or start pumping downstream at a medium rate.
- StagePiG4FillingState.controlDevices:
- the lock to control the pump before we open V12; and (b) either open the pump's valves to allow passive water flow, or start pumping downstream at a high rate.
- StagePiG6EmptyState.controlDevices:
- When the matrix pump is implemented, we need to request it to be turned off and release any lock this Pi holds on using the pump.
- For my own interest: generate proper SLOC stats once done.
- Move all non-sitewide actions notes to this list.
- Fix site-wide updater.
- Improve error handling in noted places and retry downloads from the NAS box if needed.
- Get each pi to ACK or NACK the update through the socket to the NAS box when queried or overkill?
- Or just signal the event that way too? Overkill?
- One way or another we need another way to confirm the shutdown/reboot/update actually happened, maybe sockets would be easier, and then NAS box can update status of the pi in the DB?
- Or wait for the client to close the socket after sending a final message indicating the action and waiting for NAS box ack of that message? Simpler.
- Also better for normal use as we can log why clients lost the connection, rather than just that they went offline.
- Abort the update if one or more pis doesn't seem to be updating?
- Keep update package around on NAS box with revision, so clients can check when they boot, and update themselves automatically if not a match?
- Would mean you always have to update the NAS box first, and then just rebooting the other pi's would trigger an update.
- ^ But updating just one pi would no longer be possible. Probably best in the long run.
- Would help mitigate the race conditon (but try to diagnose and fix it first).
- To Check:
- Test if the river system software works somewhat with an empty database (no tables) - should do I think.
- Fix potential race condition in site-wide update logic (pis rebooted w/o updating)/diagnose issue - In Progress
- Logging added, need to do some investigation.
- Seems like sometimes pis reboot before the update is finished, but I don't know why yet.
- Make sure any old update archive is removed first?
- Find all the todos and fixmes and put them in a list (some are in Patrick's unit tests).
To think about:
- Post my workflow somewhere?
- Will include installing up-to-date pylint.
- Check if pipeline still always passes regardless of what happens.
- Eng GUI show pump states too - needs a monitor to be running, but simple enough.
- Eng GUI show more detailed status when NAS box is doing DB integrity check.
- Should DB integrity check be done as a scheduled task instead (run by rcs ideally)?
- Engineer GUI replacement notes:
- Could do:
- Add a health indicator for each pi, based on:
- Whether any issues were posted to the event log in the last day.
- The last date of communication.
- Use the "Software Status" db column for.
- High CPU load/abmormal RAM usage.
- Any sockets problems.
- Any monitor crashes.
- Unit tests:
- Use mock like Patrick did?
- pylint them? Code quality here is still important.
- The CI tool is doing this, so we should be as well.
- Make them execute faster somehow, nice to have.
- Add a global queue of errors to report in the event log to be picked up by the DB thread.
- Report an event if CPU usage or load averages are abnormally high.
- NAS scripts: Write very simple scripts to go under /rcs-scripts/ to shutdown, shutdownall, reboot, rebootall, update the system?
- Would really just run one command to create a file in /tmp with touch - worth doing?
- Remove tests under Testing/Software/SystemTests.
- These features are tested more thoroughly in the unit tests now, rendering these tests pointless.
- Report CPU temperature of Pis too (see NAS box logic implementation).
- Convenience:
- Make site id commandline option case-insensitive (always capitalise the input).
- Improve logging statements to use the class and method/function names more.
- Perhaps this can be done with some extra logging module configuration?
- Note how to run tests in docs somewhere?
- At least point to readme.md, or include it?
- Raise some more specific user-defined errors rather than just RuntimeError.
- This can be seen in some of Patrick's code, for example.
- Switch to running pipeline on bullseye before moving the pis over to that.
- Run both each time for now?
- Use pylint to tidy up unit test code too?
- If magnetic probe levels don't change at all for a very long time, potentially report in event log as a potential hardware issue.
- Add new option --no-readings-files to disable writing the readings files to disk (useful for testing, esp if we go RO at some point).
- All documents: switch to Liberation Sans as it is more accessible to those with dyslexia. I personally also find it much easier to read.
- Make sure this gets into the installation spec, possibly in an annex at the end?: https://wmtprojectsforum.altervista.org/forum/viewtopic.php?p=5887#p5887
- Installation spec link to SW API docs at top.
- Analyse voltage data from ADCs to figure out why we get negative valve positions sometimes.
- Implement a watchdog soon.
- (this would auto-restart the software in the event of a crash).
- deviceobjects.py:
- General: Throw errors if setup hasn't been completed properly.
- BaseDeviceClass.set_pins():
- Check if these pins are already in use - throw error if so.
- Check if these are valid I/O pins.
- Allow specifying both input and output pins (future proofing).
- Motor.set_pwm_available():
- Do a hardware check to determine if PVM is available.
- ^ If so, check if the PWM pin is valid and not in use.
- Motor.get_reading():
- FloatSwitch.get_reading():
- HallEffectDevice.get_reading():
- Rename class to WaterWheel?
- Do fault checking?
- HallEffectProbe.get_reading():
- Do fault checking - should have enough info eg if things are outside the limits?
- Fault checking should maybe be done by the mgmt thread while taking voltage readings, to avoid inaccuracies and stalling main thread execution.
- devicemanagement.py:
- General: Throw errors if setup hasn't been completed properly.
- ManageHallEffectProbe.get_level():
- Possibly faulty probe - no limits passed fills up log fairly quickly. Not an indication of a fault so remove this.
- Instead, could log if multiple levels are being shown as active.
- ManageGateValve._get_position():
- Figure out why we sometimes get -1 readings - ADS errors?
- controllogic.py:
- General: Don't try to pump water if we can't control a valve - might break valve or pump!
- wbuttspi_control_logic:
- Do we always want the bypass valve open during the day?
- wbuttspi_backup_logic:
- nas_control_logic:
- Write code to check for and free expired locks if needed (future proofing).
- valve_control_logic:
- Decide how to set reading interval rather than just defaulting to 15 seconds.
- Seems to work well - maybe just leave as is?
- generic_control_logic:
- Decide how to set reading interval rather than just defaulting to 15 seconds.
- wbuttspi_control_logic:
- Decide how to set reading interval rather than just defaulting to 15 seconds.
- coretools.py SyncTime:
- rdate often doesn't produce output - make error reporting better so we know what happened, at least include return code.
- Check if peer is up before trying to sync time - avoids long timeout periods.
- statetools.py ControlStateMachineABC.setState:
- todo: throw a more meaningful exception if there is no matching state in self.states. (currently will throw an index out of bounds exception.)
- monitortools.py:
- Do we use SocketsMonitor? Maybe remove it? Or keep for potential future use?
- Just add a note saying it was replaced with the database?
To think about:
- Rotation:
- Could also compress files.
- Useful for slow VPN connection, easy to do post-merge.
- Why are the gate valve readings sometimes negative values?
- These are ignored now, but it's still weird.
- Event log: Any site-local faults detected should go in here too.
- Water wheel (HallEffectDevice):
- RPM data for flow rate at water wheel is available (where?).
- Can't use it till we stop newts from getting stuck in there.
- ^ Mesh?
Ideas:
- Assume all butts are at 400mm by default, so any water in them will be used upon reboot, and we are more likely to get proper readings.
- Make sumppi (and maybe other pis) generate nice graphs for us when the reading files have been rotated.
- coretools.get_and_handle_new_reading():
- What to do if a fault is detected?
- Currently no code here to do anything if it does happen!