The Drake equation of bioimage analysis: Guessing the number of users of a bioimage analysis software

To support a proposal I tried recently to guess how many people were users of an academic bioimage analysis software. In Fiji we cannot track how many times a plugin is run. We also do not log users or do not ask them to register. So we have to resort to indirect measurements and guesses. I made a tweet (https://twitter.com/jytinevez/status/1372097783499526148?s=20) about this questions and several people answered. I am summarizing in this short blog post their input.

Guessing the number of users via the number of citations.

If you have a paper that describe the software, and if you put a visible link to the paper on the UI or in the documentation, we can try to estimate a number of users from the number of citations.

If N_u is the number of users and N_c the number of citations, then as first approximation we can assume a linear relationship. This approximation is expected not to be valid when N_c is small (kickoff phase) or very big (like the Fiji paper, stardom phase). Otherwise we can write:

N_u = F \times C \times P \times N_c

with:

  • F is the number of bioimage analysis endeavors that do not turn into a publication (where the tool could have been cited).
  • C is the ratio of the publications that use the tool but do not cite it.
  • P is the number of people that use the tool for one publication.

We then discussed what could be an estimate of these 3 factors. Our guesses are based on what we see collaborating with biologists. The contributors to this topic work in different institutes, different country and on different topics, so we expect our guesses to be very different.

Guesses of F.

Emmanuel Reynaud suggests that F is about 4, based on how people tinker with a software before getting further.
@haesleinhuepf estimates it to be around 10.
From how many projects (that we receive in our respective facilities) that are dropped before turning in a publication, Guillaume Gay says about 3, and I would say about 5.

So, a range for F is 3 - 10.

Guesses of C.

Publications that use the tool but do not cite it.

Guillaume and Robert guesses are around 10.
Based on the biology papers he reviews, Emmanuel is more optimistic and says it is lower, from 2 to 4, sometimes even as good as 1 or 1.5.
@petebankhead has quite an accurate estimate when it comes to QuPath, thanks surveying publications that uses it. He found an estimate of about 1.3.

It is likely that the efforts made to to raise awareness of scientific software and the need to cite the associated papers paid off recently, and that C is probably lower than 10. A plausible range would be 1 - 5.

Guesses of P.

Number of people that use the tool for one publication.

Based again on what we see in our respective facilities, Guillaume and myself propose a guess around 3. Probably not much more because we observe that in a project the image analysis task are done by only a small part of the project team. The others just want to see the results.

Range estimate.

From these guess we can therefore write N_u =\gamma \times N_c
with \gamma in the range 9 - 150 estimated from the 3 factors P, C and F.

Sebastian Munck extrapolates from the return rate in marketing. If people report usage of a tool as they do elsewhere, this return rate should be about 5%, which put \gamma around 20.

Bounding the range via the number of downloads.

For core software packages like Fiji or Icy, we have access to the number of downloads. @fjug and @acrevenna suggested we use it to estimate a max value of the number of users.
Indeed, a single user can download the software several times (I download Fiji and Icy once over 2-5 months). Or a someone can download the software and not use it (as noted by Robert, e.g. an admin of an analysis room that prepares computers for a course).

If we take Fiji as an example, we could measure last year 500k downloads and 7k citations the same year. If we apply our linear relationship (keeping in mind it should not apply to Fiji), we find that an upper bound for \gamma would be 500k / 7k, roughly 140. This is somewhat close to our guess for a max value above.

The case of BoneJ.

@mdoube has access to usage statistics for the BoneJ software suite. He notes that citations lag usages by several months, so we should be cautious with this latest estimate.

The BoneJ statistics give us other example of an estimate:
Michael notes that over the period 2010-2020, there were 51k downloads and 1600 citations. This should correspond to an upper value of \gamma of about 30.

But last year BoneJ had 9k users and 250 citations. This yield a direct estimate of \gamma around 36.

Other metrics.

@StephanPreibisch warned that describing the usage of a software with a single number is far from enough. He suggests to combine several measures, for instance:

  • Number of applications for a workshop
  • Number of GitHub issues
  • Number of downloads
  • Size of community this is relevant for?
  • Number of forum posts
  • update frequency
  • how many other tools and libraries depend on it.

The later point is particularly important for core libraries (e.g. ImgLib2) that are sometimes invisible to the user and are less likely to turn usage into a citation.


Thank you so much to all the people that contributed and are cited here!

15 Likes