Friday, 22 April 2011

Open Community Research: cross-institutional integrative Bioinformatics - something for Debian Med to aim for in 2012+ ?

A few days ago this blog opened with a series of observations on the multi-directional education and collaboration that comes with an active or passive participation in Debian Med. My personal ambition is to find ways to further institutionalise this constructive exchange beyond packaging. What came to my mind is that this may mean to talk more about actually doing things with our packages.

This will lead us to discussing/optimising/specifying workflows, i.e. the graph connecting data sources with tools and their outputs with other tools plus the optimisation of command line arguments and the evaluation of the findings. This sounds all very natural to me since the desire to complete a particular workflow locally is the motivation to get most packages to the distribution today. Until recently, we just did not have a way to formally talk about those workflows, except for exchanging shell scripts. This has changed with Alan's and Hajo's continued collaboration to get command line tools integrated with the workflow suite Taverna. It allows describing our executables for inputs and outputs and presents them as regular workflow elements, right next to the (today :o) ) dominant remote web services. The site is a repository of (frequently nested) workflows, with all the typical user comments and ranking. To have that extended for all those bits one can achieve with various tools in Debian will be highly interesting. Admittedly, knowing about the rather limited success in uploading bits as trivial as screenshots, we need much of a positive feedback loop and should not just expect this to be accepted by the community because it could.

So, this leads us to my initial impetus: the community needs something to work on to develop itself and the technologies (like this blog) it has adopted. And this is where public data sets in. We had previously discussed the integration of data with the distribution in the context of BioMaj/getData for curated protein, structure or interaction data. But when we extend that also for some "weird stuff", maybe something novel from the more clinical branch of Debian Med or for the joint (re-?)analysis of a genome (a virus, maybe?) then I have some good confidence that the enormous heterogeneity of us as a community allows us to yield something that a regular institution's Bioinformatics service unit would find difficult to match.

So, we would apply Open Source principles to biomedical (re-)research. Beyond the further development of ourselves, this certainly has many direct benefits through our findings and indirectly because of the education it brings to of all those who are following the development online. Such shared research efforts could start any time, in principle. The anticipated deeper integration of Taverna with our distribution will allow specifying many smallish workflows as legitimate subgoals. Let's hope for some soonish additional posting with a tutorial for Taverna's external tools. With the advent of Ensembl or gbrowse in our distribution we have the sensation of some sort of "completeness" for the end users: once my genome has arrived in either, the work is perceived as done. This may be wrong or right, just filling those web interfaces with data is a challenging workflow. There is quite something to do for it all, still, and we should talk about it.


  1. Just additional info regarding workflow tools. There are really nice tools (Mobyle and Galaxy, not yet packaged) who provide a web interface to locally installed or cluster installed tools. Those can be used as a desktop tool or a server tool to execute bioinformatics tools with an easy web interface.
    Galaxy (and soon Mobyle) offer possibility to chain the tools to create a workflow.
    The idea behind is to describe the tools in XML with input/outputs and generate automatically a web form. Help with this, it is possible to hide some complex parameters to the end user, or to add some help in the interface.
    Those tools can be of great help with biologists, hiding the complexity of the command line.
    With those tools packaged, it would be possible to create xml description file of tools as a package, easying the sharing of those descriptions.

  2. Links

    Mobyle was new to me. I have not worked with Galaxy, yet.

    Taverna is not yet available for Debian - but it runs, just please use Sun's JDK for it. The OpenJDK has crashed for me. It will take another while to get Taverna in. I just checked again, it just expects other versions of Jars than they are shipping with Debian and the differences between those versions are subtle. We sadly failed to get this accepted as a Debian Google Summer of Code project this year. The dependencies for Galaxy I cannot tell, yet.

  3. Robert Haines may start looking at putting Taverna into Debian as part of his creation of virtual machines, especially for Amazon.

    Galaxy is very good. It is not as flexible as Taverna, but shows a more user-friendly interface. There is already a tool to call Taverna workflows as Galaxy tools, so you can get the best of both worlds.