MicroManager - serial port failure when saving/opening configuration file

MicroManager is giving an error on the serial port when I save a configuration file or open another configuration file. The error is coming from the standard serial port manager, not from device adaptors.

Error: Failed to load hardware configuration
Line 28: Property, Core, Iniitalize, 1
Error in device “COM5”: Unknown error in the device (1)

System:-

Windows 10
MM installation as of 2021-04-20 for all DLLs
Git code from head as of this morning (08 Jun)
MMCore rebuilt for debugging and copied across to MM installation, so that I can debug from Visual C++
Running/debugging MM Studio using Netbeans

Sequence of events:-

  1. Load a configuration file with a device using the serial port.
  2. Save the configuration file, or load another configuration file with that device and serial port.
  3. In SerialPort::OpenWin32SerialPort(), the call to CreateFileA() fails with an invalid handle and GetLastError() returning 5. This produces the error message above, (Happens roughly 95% of the time on my build setup.)
  4. Load another configuration file. This time it usually works.

But…

  1. Breakpoint SerialPort::Shutdown() anywhere, and attach Visual C++ for debugging.
    1-2) As above, and hit “continue” in Visual C++ when it hits the breakpoint in SerialPort::Shutdown().
  2. Everything works first time.

This suggests to me that there’s a timing/race condition between the serial port shutting down from the initial configuration and starting back up again with the new configuration. I suspect this is related to the Boost AsioClient implementation, as well as the SerialPort::Shutdown() code which simply uses a “SleepMs(100);” call and assumes that’s long enough for the port to be shut down. I can’t see anything in there which actually checks (using the Win32 API) that the port is genuinely unused.

I’m seeing this during development of Prior PureFocus support. We also have a client in China though who’s seeing the same issue with standard Prior hardware and the existing (unchanged) Prior device adaptors. I should note that my VM is (intentionally) fairly underpowered, and I suspect our Chinese client is also not on a latest-and-greatest PC, which naturally will tend to show up timing issues more.

Has anyone else reported this before?

Thanks for the report and the detailed analysis!

The next part of the code in Shutdown does a timed_join on the thread running the asio service, which I think is supposed to indirectly wait for the Close to finish, for up to 1.1 seconds. (The thread is running the io_service's run(), which “blocks until all work has finished and there are no more handlers to be dispatched, or until the io_service has been stopped” according to the Boost 1.47.0 docs.) So one question would be whether you see the log entry “Failed to cleanly close port (thread join timed out)”.

And if yes, does increasing the timed join limit from 1000 ms help? If not, does increasing the sleep time before pPort_->Close() actually make a difference?

Thanks @MarkTsuchida. I don’t get that log entry - it seems that as far as Boost is concerned, it closed OK.

If I change the CDeviceUtils::SleepMs(100) immediately before the timed_join to be CDeviceUtils::SleepMs(1000), everything works fine. (Except that of course it takes 900ms longer to load a configuration file.) Returning it to 100 produces the fault again.

I’m reluctant to call this a solution - at best it’s a klugey workaround. It tends to reinforce my hypothesis that it’s timing-related, in which case an unlucky break on a bunch of background stuff deciding to run at the wrong time could still cause us grief. (I’m an embedded engineer, so I inherently distrust things which mostly work. Give me handshaking or give me death. :wink:

Is there some other synchronisation which the Serial Manager is missing? Or is it something which later versions of Boost have fixed? I don’t really know Boost, so I wouldn’t know where to go looking.

On a selfish level, I don’t want to spend a lot of time going down this rabbit hole, because I have a plugin to finish, and my boss will start to ask what I’m doing with my time. :wink: I’m reluctant to put forward changes to core code anyway without fully having a handle on the whole thing. Can I leave this with you, and ping me if there’s something you want me to check?

I should also mention that this is using an FTDI USB-to-RS232 chip for the serial link. I’ve seen significant comms latencies with these devices which I don’t get with a “real” serial port, although I don’t know if the same also applies to opening and closing the port. I don’t have access to a PC with a real serial port without taking my kit back to the office, so I can’t easily check whether that’s relevant. It’s still worth noting for anyone trying to reproduce it.

Sounds good. I didn’t mean to suggest increasing the sleep as a solution; I just wanted to probe the symptoms. What you observe does suggests that indeed something is not closing correctly, or not waiting for it completely, even when the thread is joined, yet it does eventually close within a second, and that’s useful to know. (And a hack such as waiting 50-100 ms after the join might possibly help if we can’t get to the bottom of the problem.)

I created a ticket for this: SerialManager fails to reopen just-"closed" port · Issue #1254 · micro-manager/micro-manager · GitHub and will try to investigate.

As for the latencies with the FTDI chip: does this document shed any light? https://www.ftdichip.com/Documents/AppNotes/AN232B-04_DataLatencyFlow.pdf (especially section 3.1)

Thanks Mark. Yes, the reasons for extra data latency on USB-to-serial chips are fairly clear. I thought I should mention it though in case there were issues on opening/closing which are specific to USB-to-serial devices, otherwise someone working with a “real” serial port might report they can’t reproduce it.

Good luck with this one - issues like this are always a pain to figure out!

This is hopefully fixed now (details at the GitHub issue).

Thanks Mark, and your writeup on your investigation makes fascinating reading. That’s a really nasty one! As much as I dislike sleeping and hoping it’s long enough, I completely agree that there’s no obvious other way out of a driver/kernel bug.

Glad you liked it. Serial ports are probably more sensitive to this kind of issue because they are the first devices to be initialized and the last to be shut down in most real config files.