ExportToDatabase produces erroneous object files

cellprofiler

#1

Hi,

When running ExportToDatabase, I’m occasionally getting SQL_*_object.CSV files with the wrong number of elements in a line.
I tried the new bugfix release, and that didn’t fix the problem.

I’m attaching a tgz file with a README that explains in more detail, the pipeline, and two datasets, one that shows the problem, and one that doesn’t. I’ve also included the output I’m seeing.

In case the upload attachment fails (it’s large), the tgz file is also available at: ftp://maguro.cs.yale.edu/pub/CellProfiler/BUG111308.tgz

Thanks for your help!

Rob Bjornson


#2

Hi Rob,

In addition to Mark’s fix, I found a Relate module bug that was also a source of the column number discrepancy. When calculating the means across children for each parent, if there were no Parents defined for a particular image, then these would get skipped, and result in a different number of measurements, and hence, output columns. Please replace your Relate module with the one attached, and to my testing, this should fix it.

And if that were not enough(!), a further issue that we have discovered will affect you if you try to upload this to a database. Some very long object column names, when truncated by CPtruncatefeaturename.m because of Matlab’s annoying 64 character limit for variables, are getting truncated in such a way as to render some column names as duplicates. You will get a “Duplicate column name” error when uploading. The quick fix is to shorten the image or object names in your pipeline, until each column name is unique. We are working on a more intelligent, comprehensive fix.

Thanks for reporting this.
David
Relate.m (10.7 KB)


#3

Hi Bjorn,

The problem was a bug in which right-most columns with no measurements were getting skipped in the CSV file, even though the column headers were still written to the SETUP.SQL file. Fortunately, since you are working with the developers version, you can fix it.

To fix this, change/add the following lines in CPconverttosql in the CPsubfunction directory (lines are approximate since I’m working with a newer version):

Immediately before this line (around line 449):

Add:

[quote]if size(perobjectvals,2) < length(per_object_names) + 2,
perobjectvals(end,length(per_object_names)+2) = 0;
end[/quote]

Regards,
-Mark


#4

Dear David and Mark,

Thanks for your replies to my bug report. I’m glad to hear that you’ve been able to locate the issues. I have a couple of questions about your proposed solutions to these issues. A little more context is necessary, which I probably should have given you, but I wanted to keep my initial report succinct.

First, in production I’m only using the gui CellProfiler to create batch files for CPCluster, so any fix would need to encompass CPCluster too.
Although I’m using the developer version
for CellProfiler (because the compiled version didn’t exist for linux at the time) I’m using the compiled CPCluster. I don’t have a license for the Matlab compiler, so
patching CPCluster and recompiling it is not a very attractive option; I’d probably wait until you do a new release.

Given that, I’m hoping I can work around this problem for the time being, rather than patching CPCluster. Can you be a bit more specific about where columns might be omitted in a given row entry? If it’s always at the end, I can pad out the rows with 0’s in a post-processing step. However, if internal values can be omitted, then my columns will get out of sync, and I’ll need a different strategy (maybe tossing short rows).

Finally, thanks for warning me about the SQL name collisions. I’m not actually loading the data into a database; rather I’m just parsing the CSV files, so it should be ok for now.

Rob


#5

Hi Robert,

Thanks for the additional info.

[quote=“robertbjornson”]
Given that, I’m hoping I can work around this problem for the time being, rather than patching CPCluster. Can you be a bit more specific about where columns might be omitted in a given row entry? If it’s always at the end, I can pad out the rows with 0’s in a post-processing step. However, if internal values can be omitted, then my columns will get out of sync, and I’ll need a different strategy (maybe tossing short rows).[/quote]

You’re right, the columns are always at the end.

If the empty measurements are internal, zeros get filled in as long as there is a non-empty measurement somewhere in a column to the right. In your case, there were no non-empty measurement before the last column, so no zeros got filled in.

The fix basically checks and does this for you. If you can do it as a post-processing step, then I believe you should be fine. Let us know if you have any more problems.

Regards,
-Mark