I am trying to setup environment for CP on our cluster . After solving many problems and mysterious behavior our version of CP finally works in headless mode. However when I send jobs to scheduler many of jobs finish with memory errors. I would be very happy if somebody can share with me some impressions of setting up CP 2.X in the cluster environment.
My current workflow looks in the following way.
- I am interested in HCS data analysis. We use 384 well plates, with 9 images per well, 3 channels. Primary analysis usually require processing of ~50 plates. This means ~600k images (in 200k sets). Processing of one set takes about 30 sec. It means ~70 days of computation on single machine (this is the reason we need a cluster).
- I have a CP pipeline which computes features, writes the results to the database and mask files and some additional CSVs on the hard drive for each site (processing a site is independent of each other and can happen then in parallel).
- I created a python script which reads directories and creates CSV file read by LoadData in CP (it sets up correct metadata, etc.). This file contains ~200k lines
- I tried many different scenarios of calling CP and now I decided to use quite old way (without grouping by metadata - I use first and last image set to be processed). It looks something like:
python /cluster/apps/cellprofiler/2.1.1/x86_64/CellProfiler/CellProfiler.py --jvm-h
eap-size=1g --do-not-fetch -c -r -b -i /cluster/work/scr3/sstoma/analysis/TEST07/in
put/ -t /cluster/work/scr3/sstoma/tmp/ --project=/cluster/work/scr3/sstoma/analysis
/TEST07/input/01.cpproj -o /cluster/work/scr3/sstoma/analysis/TEST07/output/ -f 611
28 -l 61343
5. I submit processes to our queue system with different -f -l values so they are computed in parallel.
Now I have few questions:
- To adapt CP to our cluster we had to configure JAVA to run with:
Why? When a java virtual machine is started it tries to allocate a huge chunk of the memory (often exceeding physical memory - it is called memory over-commitment; some argue that it improves efficiency). On a workstation with 4 GB of physical memory this is not a problem. However, on a cluster, memory over-commit is often not enabled (it is indeed explicitly disabled on the cluster I use). One can only allocate as much memory as you have physical memory. The two above lines allow us to limit JAVA appetite for RAM and limit memory-overcommitment behavior (and they should be compliant with --jvm-heap-size=1g which we pass to CP).
a) Does anyone of you make similar things? Does it work reliably? Are there any other options which can help in managing RAM consumption of headless CP?
b) My pipeline does not need ImageJ. I guess the bioformat’s file readers are the only beneficent of this memory. What is the rule of thumb for setting --jvm-heap-size? My images are 3x ~10MB tiffs, number of objects goes into few thousand max.
c) I often get indeterministic errors:
Version: 2014-08-07T16:22:21 02e67c8 / 20140807162221
Picked up _JAVA_OPTIONS: -XX:ParallelGCThreads=1 -Xmx8000m -Xms4000m
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
from bioformats.omexml import OMEXML
File "/cluster/apps/cellprofiler/2.1.1/x86_64/lib/python2.7/ctypes/__init__.py", line 353, in __init__
self._handle = _dlopen(self._name, mode)
OSError: Error occurred during initialization of VM
Could not reserve enough space for object heap/libjvm.so: cannot open shared object file: No such file or directory
I guess this is due not enough resources on the cluster for a job. Any ideas what I can change in memory management to get rid of these? This happens only in the beginning of job creation. It might be the reason that I need to request this gigantic heap and 16GB of memory per job (it is kind of problem when I submit 1000 jobs).
c) When jobs are finished I see that:
[code]Resource usage summary:
CPU time : 8726.82 sec. Max Memory : 1329 MB Max Swap : 13849 MB Max Processes : 3 Max Threads : 63
The output (if any) follows:
Version: 2014-08-07T16:22:21 02e67c8 / 20140807162221
Picked up _JAVA_OPTIONS: -XX:ParallelGCThreads=1 -Xmx8000m -Xms4000m[/code]
The swap consumption scares me. Why my jobs needed ~14GB of RAM (again I am processing sets of 3x images 10MB each… number of objects is ~1000).
d) In the code above: why a job uses 63 threads? I thought that CP run in headless mode uses only one thread. Am I wrong? If not, why 63 threads seems to be initialized (I see in a log these lines - look below):
stopping worker thread 43
stopping worker thread 44
stopping worker thread 45
stopping worker thread 46
stopping worker thread 47
stopping worker thread 48Exiting the JVM monitor thread
I tried to create small CSV files stored in different subdirectories (my CSV is 200k lines…) for each process and use -i parameter for specifying input dir (then I did not use -f -l params – all lines are processed). It works, but records in database get overwritten (image indices…). Is there anyway to go around this problem? I observe that processing of 200k CSV file takes a significant amount of time for each process.
What is the advantage of using .h5 generated by CreateBatchFiles instead of .cpproj? What is inside of this file? My problem is that to create it I need to use GUI to drag files into the CP file import modules. It does not work very reliable when I have 300k images in the directories. Also I use my script to exchange the paths from local ones to ftp. What are the recommendations?