Spontaneous disconnection of the delta stage

Hey everyone,

I built a delta stage microscope. It is a really awesome device. However, after some time of usage the stage does not move anymore and when I reload the webpage It says “Stage positioning is disabled since no stage is connected.” When I restart the microscope everything works normal again for some time…

I am using a raspberry pi 4 with a sangaboard v0.5.1 (@filip.ayazi).

Did anyone experience similar issues? Could it be caused by motors getting hot?

Looking forward to any ideas on the issue

I get the following logs in journalctl for the openflexure-microscope-server.service:

Jul 07 17:12:22 openflexurepi python[610]: ERROR:root:Traceback (most recent call last):
Jul 07 17:12:22 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/.venv/lib/python3.7/site-packages/labthings/actions/thread.py", line 255, in wrapped
Jul 07 17:12:22 openflexurepi python[610]:     self._return_value = f(*args, **kwargs)
Jul 07 17:12:22 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/.venv/lib/python3.7/site-packages/flask/ctx.py", line 158, in wrapper
Jul 07 17:12:22 openflexurepi python[610]:     return f(*args, **kwargs)
Jul 07 17:12:22 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/.venv/lib/python3.7/site-packages/webargs/core.py", line 594, in wrapper
Jul 07 17:12:22 openflexurepi python[610]:     return func(*args, **kwargs)
Jul 07 17:12:22 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/openflexure_microscope/api/v2/views/actions/stage.py", line 47, in post
Jul 07 17:12:22 openflexurepi python[610]:     with microscope.stage.lock(timeout=1):
Jul 07 17:12:22 openflexurepi python[610]:   File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
Jul 07 17:12:22 openflexurepi python[610]:     return next(self.gen)
Jul 07 17:12:22 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/.venv/lib/python3.7/site-packages/labthings/sync/lock.py", line 55, in __call__
Jul 07 17:12:22 openflexurepi python[610]:     result = self.acquire(timeout=timeout, blocking=blocking)
Jul 07 17:12:22 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/.venv/lib/python3.7/site-packages/labthings/sync/lock.py", line 83, in acquire
Jul 07 17:12:22 openflexurepi python[610]:     raise LockError("ACQUIRE_ERROR", self)
Jul 07 17:12:22 openflexurepi python[610]: labthings.sync.lock.LockError: ACQUIRE_ERROR: LOCK <labthings.sync.lock.StrictLock object at 0xb3a88bf0>: Unable to acquire. Lock in use by another thread.
Jul 07 17:12:26 openflexurepi python[610]: Exception in thread Thread-888:
Jul 07 17:12:26 openflexurepi python[610]: Traceback (most recent call last):
Jul 07 17:12:26 openflexurepi python[610]:   File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
Jul 07 17:12:26 openflexurepi python[610]:     self.run()
Jul 07 17:12:26 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/.venv/lib/python3.7/site-packages/labthings/actions/thread.py", line 224, in run
Jul 07 17:12:26 openflexurepi python[610]:     self._thread_proc(self._target)(*self._args, **self._kwargs)
Jul 07 17:12:26 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/.venv/lib/python3.7/site-packages/labthings/actions/thread.py", line 277, in wrapped
Jul 07 17:12:26 openflexurepi python[610]:     raise e
Jul 07 17:12:26 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/.venv/lib/python3.7/site-packages/labthings/actions/thread.py", line 255, in wrapped
Jul 07 17:12:26 openflexurepi python[610]:     self._return_value = f(*args, **kwargs)
Jul 07 17:12:26 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/.venv/lib/python3.7/site-packages/flask/ctx.py", line 158, in wrapper
Jul 07 17:12:26 openflexurepi python[610]:     return f(*args, **kwargs)
Jul 07 17:12:26 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/.venv/lib/python3.7/site-packages/webargs/core.py", line 594, in wrapper
Jul 07 17:12:26 openflexurepi python[610]:     return func(*args, **kwargs)
Jul 07 17:12:26 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/openflexure_microscope/api/v2/views/actions/stage.py", line 47, in post
Jul 07 17:12:26 openflexurepi python[610]:     with microscope.stage.lock(timeout=1):
Jul 07 17:12:26 openflexurepi python[610]:   File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
Jul 07 17:12:26 openflexurepi python[610]:     return next(self.gen)
Jul 07 17:12:26 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/.venv/lib/python3.7/site-packages/labthings/sync/lock.py", line 55, in __call__
Jul 07 17:12:26 openflexurepi python[610]:     result = self.acquire(timeout=timeout, blocking=blocking)
Jul 07 17:12:26 openflexurepi python[610]:   File "/var/openflexure/application/openflexure-microscope-server/.venv/lib/python3.7/site-packages/labthings/sync/lock.py", line 83, in acquire
Jul 07 17:12:26 openflexurepi python[610]:     raise LockError("ACQUIRE_ERROR", self)
Jul 07 17:12:26 openflexurepi python[610]: labthings.sync.lock.LockError: ACQUIRE_ERROR: LOCK <labthings.sync.lock.StrictLock object at 0xb3a88bf0>: Unable to acquire. Lock in use by another thread.

I don’t have any good ideas for you unfortunately. However it is unlikely to be a result of the motors getting hot, the software sends signals to energise the motor coils in a sequence to move, but it has no way of knowing anything about what the motors are actually doing. However, if the motors are getting very hot, that would mean that they are drawing a lot of current, which might affect the power supply to other components. How are you powering the microscope? If you are powering the Pi via the Sangaboard 0.5 then it should be able to cope. Powering the other way by connecting the power to the Pi would be more likely to give problems. Either way they would show up as low voltage warnings on the Pi desktop. The symptoms do not sound like that.

As @WilliamW said this does seem like a power issue, if you are connecting to the microscope remotely you can check for undervoltage warnings in the output of dmesg.

Thanks for your feedback :star_struck:

I was powering the microscope via the Sangaboard but mostly used a 5V 2.1A (~10W) power supply. I also used my laptop power supply (60W) but now I figured out that for 5V it also only provides 2A. Using the dmesg command I could see that the pi was running into undervoltage. Thanks @filip.ayazi :+1:

I am now testing with a 5V 3A power supply and will report whether I still see disconnection of the stage or undervoltage

1 Like

Unfortunately the problem still occurs after a while (2 - 15 minutes).

With the new power supply there are no more undervoltage events reported and I also could not find the traceback of my first post again.

Also when I restart the pi (sudo restart) it does not report the problem with the stage anymore but the coordinates are zeroed and the stage does not move. That’s why I think the sangaboard somehow crashes :thinking:

I have another sangaboard and also different motors. So I will test to swap the components in order to rule out that it is a simple hardware problem.

However, I am still a bit puzzled why it works so nice in the beginning and then randomly disconnects. @filip.ayazi are there sangaboard logs that I could share or have a look at?

That is very strange. There are no logs produced directly by the sangaboard. Are you using the release build of the microscope software or the Sangaboard v5 specific one? Does it detect the stage after restarting the pi (there should be messages in ofm journal about checking firmware version and that a stage was found)?

After the stage hangs, can you try running minicom -D /dev/ttyAMA0 (or the port where ofm journal shows it found the stage). That will give you direct communication with the stage. You can try sending help in there to see if it responds, or mr 2000 2000 200 to move the 3 axes. (minicom doesn’t buffer input it will directly send whatever you type, so you won’t see the input, only the response, to exit use ctrl+a followed by q). Also using minicom, when the stage is working, can you try blocking_moves?, it should return blocking_moves true.

If it can still communicate but the motors don’t move it might a hardware issue with the motors. If they try to use more than 1A continuously it will trigger a resettable fuse on the board cutting power to the overcurrent motors. That might result in the motors stopping one by one and they would then start cooling down which is one way of checking if this is the case without a multimeter.

I used the open flexure lite raspberry pi image. Now I also activated the sangaboard plugin, found out that a beta firmware is installed on the board and installed release version. But the error remains.

I will try the minicom commands:
blocking_moves? returns blocking_moves true when the stage is working

I also did some further experiments and found out that the board works perfectly fine for the non-delta stage setup. Then I turned on the delta stage in the stage geometry and shortly after the error ocurred again. I thought that I could be caused by all three motors moving at the same time. But switching the setting back to the non-delta stage geometry and performing movements in all three axis at the same time I could not create the error :thinking:

Could it be that the motors need to apply higher forces in the delta stage setup and, therefore, need more current?

When the movement stops the minicom commands do not return anything. Neither help nor the movement command mr 2000 2000 200. So I suppose the whole board is down

The difference between the delta and the regular stage is very strange, but if the board crashes with no response it seems to be a firmware bug, I’ll see if I can find a way to reproduce it. In the meantime, can you please check if just resetting the board with the RST button near LEDs fixes it (if it’s unreachable use the software method below instead)? The board reset is connected to the raspberry pi gpio so it can also be reset from the pi by pulling gpio23 low e.g.

raspi-gpio set 23 op
raspi-gpio set 23 dl
sleep 1
raspi-gpio set 23 dh

if that works that might be a temporary workaround while the core issue is investigated.

1 Like

First, I really appreciate your work @filip.ayazi and thanks a lot for your invested time :+1:

For the reproduction of the issue: I usually ran a script of driving an x-y square with 2000 motors steps for a 100 times. This always lead to the problem.

The software reset works fine. I needed to add an ofm restart for the stage to reappear in the UI.

I also tried to run the microscope with only 1 or 2 motors attached to the sangaboard. Again I see the sangaboard stopping to work in my test case. I hope that rules out problems with the overall power supply.

This is very strange - the error you posted above (about a “lock”) is a downstream symptom of the same thing - probably that communication has hung with the board, before a query (i.e. sending a message and receiving a response) completed. I am guessing Filip is more likely to solve this than I am; it sounds like you are ruling out most of the likely causes… If you run it with no motors plugged in, do you still see the error? The board should not know if the motors are disconnected…

If you connect the Sangaboard over USB rather than as a HAT, it would be interesting if you get the same problem (Filip, I’m not sure what the best way is to achieve that?). That said, I’m not 100% sure what that would actually tell us…

I can confirm that it also happens with no motors attached at all

Thank you for the detailed report and your assistance debugging this! I’m trying to reproduce it, but no luck so far by directly talking to the stage with moves [2000 1320 -1234] and back, I’m trying the same while running some CPU intensive tasks next to see if I can trigger it that way, I’ll try going through the full OFM software next.

With no motors it should be possible to just connect the board over USB and power the pi directly. It should recognise the board on the USB serial connection, it would be very useful to know if you can trigger this over USB.

I think I got a little bit closer to the problem as I started a bit of digging in the source code (:heart: open-source) and could manage to crash board’s firmware.

The crash happens when I send out incomplete commands, e.g. mr 0 0 (last axis missing), through the serial shell from minicom -D /dev/ttyAMA0. Then the firmware crashes. In the example it seems that the last argument char pointer is not initialized and most likely during access you will get a pretty bad memory access (see Blame · src/main.cpp · master · Filip Ayazi / sangaboard-firmware · GitLab).

I added safety check to the firmware so that the commands are skipped when the argument count is not exactly as expected (see my fork). I also added a safety check to the pysangaboard library. It now raises an exception if not the default stopped is returned for a move command (see my fork).

With the fixes in place, it seems that the board does not crash anymore :partying_face: However, I also do not get any exception raised by the pysangaboard library (which I hope would turn up in the journalctl -u openflexure-microscope-server.service). Thus, I cannot explain why or whether incomplete movement commands have been sent to the board in the first place :see_no_evil: Probably it needs some further testing. I’ll do an overnight run to see whether I can get the forked firmware to crash…

Amazing work, thank you! That is really promising! Even if that wasn’t the core cause, a hard crash on an incomplete command is definitely a bug.

I still haven’t been able to reproduce this after hundreds of cycles through the OFM software, but I did notice that the delta stage sends float positions (should probably be changed in DeltaStage) as a result it sends really long commands, e.g.
mr -1086.6666666666667 -1086.6666666666667 2133.3333333333335
The values should just fit the 20-character buffer used in argument parsing, but I’ll increase it in case there’s a longer string by one for any reason (or maybe if there’s \n in the string or something).

The situation when communication gets interrupted and it doesn’t get the full command is more likely to happen with the very long floats being sent, so that might still explain why the problem occurs with DeltaStage and not with the cartesian stage (or it is much less likely to happen with the cartesian stage. There is a 1s timeout in ArduinoCore-API/api/Stream.cpp which could cause this situation but if I’m reading it correctly that is 1s between characters being read, which seems like a lot (it might fail once every 50 days or so when millis overflows, but I wouldn’t expect the serial connection to be this terrible regularly).

I’ve managed to reproduce this! With some accelerated “moves”, it crashed after ~1500 moves using the same command as sent by DeltaStage.

This does indeed seem to address the core issue when occasionally part of the serial message seems to get lost somehow (in my testing 4/20000 moves trigger the protection), I tried expanding the FIFO and changing the argument buffer size but it didn’t change anything. It is almost certainly related to the length of the command, using shorter commands (i.e. without a ridiculous number of decimal places), avoids this issue, which is why it only happens for the delta stage.
I’ll dig into this further, but for now, a solution along the lines of what you implemented will be the best option, the firmware will need to stay backwards compatible with pySangaboard so it will have to return a single-line response to invalid number of arguments, probably from within stage itself, but otherwise it will be the same.

When removing the floats gets rid of the issue I suggest converting the movement coordinates to integer( see merge request. This is already done for single axis movement and I suppose decimal motor steps are not useful anyway (correct my if I am wrong)

1 Like

@filip.ayazi Are you okay with the code for checking the number of arguments? (see merge request) I can add this for every command that expects a fixed number of arguments.