Data now reliably segfaults in visgui under py3 - how best to debug?

I have a dataset (github link) with associated recipe (github link) that runs fine on py27 builds but reliably segfaults with the py37 build. This is all on macos. There is no revealing error message etc.

What may be the best strategy to debug?

Many thanks for tips.

CS

segfaults are really hard to debug as they tend to happen somewhere in c code, and often in an external module - e.g. sklearn, etc …

Looking at your recipe, there are a bunch of non-standard recipe modules ( localisations.FiducialTrack, localisations.FiducialApplyFromFiducial, localisations.DBSCANClustering2, localisations.ClusterTimeRange) which appear to have been added to the PYME source. This makes it very hard to say anything with certainty as we have no idea what is in those modules and/or if any of the standard modules have been altered. As a general rule it’s not good practice to hack custom modules into the core recipe modules within a source version of PYME. A much better approach is to use the plugin interface. That said, my debugging advice is as follows:

  1. identify all modules which might use c code. In terms of the core modules, that would be FindClumps and MergeClumps. Speculating that DBSCANClustering2 is a modified copy of the core DBSCANClustering, this would also call c code in sklearn.

  2. modify the recipe by removing suspect modules until the segfault goes away. Once you have identified the culprit, you have a few options:

    • If it involves a call to a library, e.g. sklearn, try upgrading or downgrading that library and/or looking at the release notes/issues to see if it is a known issue.
    • if it is not a library call, and/or you want to debug it yourself, construct a minimal test-case that can reconstruct the error outside the recipe context (e.g. in an ipython session).

In general, you are likely to get segfaults in c extension code in two circumstances:

a) out of bounds array indexing
b) reference counting errors (e.g. missing INCREFs or extra DECREFs)

The latter can be quite tricky to debug as they can manifest after the actual erroneous code, at a non- deterministic time when the garbage collector gets around to dealing with the array with the incorrect reference count.

Other debugging strategies include:

  • making debug builds of python, and all libraries and running through a debugger (generally prohibitive by the time you have python and all libraries such as numpy etc …)
  • turning on core dumps and examining these. In the absence of debug symbols these will only tell you which shared library (.so) the segfault happened in, but that can still be a useful clue

An untested, but potentially interesting option would be to run your code through the PYME.util.fProfile profiler (or add your own profiler hook). This will record all function calls and returns to a file on disk and looking at the tail of the profile file might give a psuedo-traceback for the segfault (modulo file IO buffer flushing).

See also faulthandler — Dump the Python traceback — Python 3.9.2 documentation

Hi @DavidBaddeley , many thanks for the helpful tips. I will give this a spin in sequence of increasing complexity, probably will walk back recipe modules in the first instance to see if there is an obvious culprit.

Custom recipe modules

Note that all recipe modules you mention above are properly inserted via plugins. They stem from the
PYMEcs.recipes namespace in PYME-extra and use the recommended plugin interface, see Pyme-extra/recipes. As the initial namespace is stripped they appear like core modules. So far, this has been chosen so that they insert in the respective module sections alongside the core modules as this makes it more straightforward to find them upon insertion in their matching section. Obviously, this could result in issues when name clashes were to occur but I have avoided that.

If there are reasons to map them under a different section it would be useful to hear.

I’d never really considered the possibility of shadowing the core module/category names from within a plugin. I’d somehow assumed all plugin derived recipe modules would be scoped to the plugin. I can see the benefit from an organisation perspective, but am not sure I would recommend it. The reasons I’d avoid name shadowing are:

  • shadowing leaves your recipes open to breakage on PYME updates. We have no way of knowing what your shadowed modules are and could easily introduce clashing names in future updates (avoiding such breakages was one of the motivations for introducing scoped module names in the first place - something that shadowing defeats).
  • when you share recipes, it’s not obvious that the required code is found in a plugin rather than core PYME and users might get frustrated when things don’t run.
  • name shadowing makes debugging harder as it is no longer obvious where to look for the code which implements a specific recipe module.
  • shadowing exploits an implementation quirk in how recipe modules are currently registered (as a dictionary mapping a scoped module name to an imported python class object). I’m reluctant to commit to maintaining this implementation quirk indefinitely - it would be nice to be able to lazily load modules if and when they are needed (As it stands, we require all modules to be loaded before we do anything with a recipe and, as a result, buggy 3rd party plugin modules have the potential to bring everything down. There would also be a performance advantage in lazy loading, particularly in scenarios where we are launching subprocesses etc as workers).

Part of this is admittedly about optics - it reflects poorly on us if someone is given a recipe and it doesn’t run, but I think it also makes troubleshooting easier if module provenance is clear from the module name.

Ok, this makes sense. I’ll ponder a renaming scheme (and a utility to convert existing yaml files that we have many of).

With the newish filter option available in Add Module it would be cool if you could also filter for a prefix (such as localisations, pymenf) with some simple syntax/switch as this could then quickly hone in on plugin modules as one option.

As to existing file conversion, you can move the module, use @register_legacy_module(old_name, old_module_name) with the old name in addition to the normal @register_module(...) with the new name. Loading and saving a recipe should then convert all the names.

Nice! I will come up with something in due course.

Just to wrap up this topic, I have now traced the segfault to the PYME-extra localisations.DBSCANClustering2 module which uses a multithreding approach that initially was not available in the PYME core localisations.DBSCANClustering. However, this has since been enabled as an option in localisations.DBSCANClustering and runs fine whereas my PYME-extra version bombs.

Not yet sure what part of the very subtle implementation difference triggers the segfault, but anyway switching back to the core localisations.DBSCANClustering in the yaml exposed in our data repository.