Thursday 30 August 2018

ethics - Hugely important database replete with errors


I am about to finish my PhD in bioinformatics. I am trying not to give out too many details, so I won't be very specific.


In the field I am working in, there are two extremely important databases being used. During my MSc internship, I chose one of the two databases, decided to stick with it and never looked back. However, as my PhD approaches its term, I find more and more errors in this database.


I only contacted the database team once to tell them I thought I had spotted an error and to ask them whether my reasoning was correct. They took more than 2 weeks to reply and only said they had fixed the error without providing other explanations. Intrigued, I looked more closely at the data and started "reverse engineering" the database: I don't know how everything is internally articulated but I start getting a pretty good picture of how they do it. It's a very old and big database that keeps getting bigger and richer. The scientists behind the database are constantly trying out many new things to enrich the data. They keep some changes and discard others.


Now here's the problem: these attempts at enriching the data are badly integrated. The database is never checked for internal consistency. It is highly inconsistent, and I can prove it with a systematic approach. I can point out actual problems and potential problems, with suggestions to discard dangling references.


They also license their data. Some of it is freely available and some of it can be obtained with an academic or a commercial subscription. If they sell their data they might as well sell clean and coherent data!...



Many database maintenance operations seem to be handled manually and with no regard to existing data. Their enrichment methods seem to be based on in-house scripts that the team praises in different publications and that can be used in web forms, but for which the source code is never available. The whole database seems to use a very old and messy design and I have the feeling they never had engineers in their team. The team itself and their research... well, they are not really transparent. The website only has a feedback form. One can never know what problems have already been reported and fixed. There is no changelog. They can change what they please, when they please and in any way they please.


My PhD adviser encouraged me to pursue this investigation and warned me not to contact the database team any more because we will need to figure out what to do. According to my adviser, even though I'm right, this investigation alone does not warrant a publication. In any case, it will at least make a chapter in my PhD thesis.


What I want to obtain is to get these problems fixed. What I would like to obtain, but I don't think it's likely to happen because of the aforementioned lack of transparency, is to get a paper co-published with the authors behind the database, or at least a collaboration for my lab.


Side note: my PhD research is utterly uninteresting and advances a boring method of doing something nobody wants to do and that will therefore never be used. If I manage to link my name to updating and correcting this database, then my PhD will have served a purpose.


Do you have any experience and advice on how to best handle this delicate kind of issue?


EDIT: I want to thank everybody for their contribution. I will select the most helpful answer. The answers and comments I received pointed me to the right direction. I want to make the issues known once I finish writing up my tool on consistency checking. How exactly I will go about making the errors known is something I still need to decide on. A few final notes:



  • As one of the comments pointed out, it is only fair for the bioinformatics community depending on this database that such errors be known. Maybe making them public will lead some research groups to reevaluate the degree of this dependence.

  • I did not mean to start a stir on computer science vs engineering.

  • My PhD is valid and defend-able as it is; it's just not innovative, but that is another story. Pointing out inconsistencies in a public database is not what my PhD is about and will not be used to "save" my PhD as it doesn't need to be saved. :-)




Answer



There are many critical scientific resources out there that have massive known flaws, but are still useful because the flaws don't prevent people from getting high value from the resources. GenBank, for example, is the predominant source of genetic information in the world and is also known to have many mislabelled sequences.


From what you have written, it is not clear whether something like this is also the case with the resource that you are dealing with. The course I would recommend you take depends on 1) the degree to which the flaws are known and 2) their degree of impact on scientific conclusions derived from the database. The key cases that I see are:




  • The flaws are well-known and researchers are able to work around them: Here, there's nothing you really have to do, and the maintainers are unlikely to be particularly responsive to your complaints, since their system is "good enough" and they likely have other priorities.




  • The flaws are well-known and difficult to work around: This seems the least likely case, as why would people use this database and not the alternative that you mention? If it is so, however, you should probably just finish your thesis and move on: a paper on the flaws isn't interesting if they're already well-known, and while you should report your problems to the maintainers, you'll just be one more instance of the issues they already know about.





  • The flaws are not well-known and likely to cause serious problems in most research: In this case, a publication about the flaws is likely to be of interest and worth doing. It might or might not cause the database to be fixed, but it is likely to be important to alert researchers using the database to the problems in their work, creating pressure for the database to be fixed or people to migrate elsewhere.




  • The flaws are not well-known, but not likely to have a serious impact: This is likely to be the case if you are using the database in a very different way than most others, such that your research is more strongly impacted. In this case, talking about it in your thesis seems sufficient. You should document the problems you had and the flaws you discovered, but you are unlikely to get them corrected because you are not their target market.




Notice that in all of these cases, I assume that the database is unlikely to get fixed. That is because the persistence of the problems over time and the non-professional curation that you report indicate an organization that is probably missing either the resources or the incentives to make the fixes you would like, even in collaboration---although you might turn out to be pleasantly surprised.


No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...