Saturday, 30 November 2019

Ethics of scraping "public" data sources to obtain email addresses


I am wondering whether the following research practice is ethical.


A software engineering researcher downloads source code repositories from Github, a large source of publicly available open source code. The researcher searches the git commit logs to find email addresses of software developers who have committed to a project, and uses these email addresses to send them an email asking them to participate in a survey. If the recipient clicks on the link to the survey, the survey contains an appropriate briefing and obtains informed consent. The researcher follows all institutional and legal requirements related to human subjects research. The researcher limits the number of emails sent to only the number of participants they think they will need to test their hypotheses. However, at least one recipient of this email is annoyed that the researcher obtained their email address in this fashion and sent them unsolicited email.



Is this an ethical research practice? In particular, what would be the relevant ethical principles or ethical framework for analyzing this question? I've read a bunch of papers and backgrounders on ethics in human subject research and in engineering research, but they seem focused on other issues. Are there accepted norms or guidelines relating to this sort of situation? Has it been considered in other fields, such as the social sciences?


A possible argument that the practice is ethical: The data source is publicly available, and the email addresses were collected from this publicly available data. Developers chose to make their software repository publicly available, and they should assume that any information contained in it are public. Developers who don't want to be contacted could have configured their git client specially to use a different email address. The research will benefit our understanding of the science of software development. Subjects have an opportunity to decide whether or not to participate in the survey. Participant confidentiality will be protected, and all responses will be treated anonymously. The research complies with all legal and compliance requirements. From a legal perspective, the emails are not "spam", since the unsolicited email was not sent for a commercial purpose.


A possible argument that the practice is unethical: Software developers probably would not expect someone to scrape email addresses from the git commit logs. Their email address might be contained in a publicly available data set, but some developers might expect/consider the information private, or at least not public and free for unrestricted use. Some developers might object that it is one thing to use email addresses that are publicly listed on their Github profile page, but it is another thing to extract private email addresses that are provided as part of their git configuration, and that their understanding of social norms is that the email addresses automatically inserted into the commit logs by their git client were not intended for this purpose. Some software developers might object to having an unwanted email message in their inbox or find the practices "creepy".




Please note: I am not asking about IRBs, legal requirements, or compliance. I am super-familiar with those considerations. Assume that the researcher has complied faithfully with all of those requirements that are applicable in their country. I'm not asking about that aspect. In my view, researchers have an independent obligation to conduct research in an ethical manner, and to exercise their own judgement in avoiding unethical behavior, even if is legally permitted or approved by an IRB.




No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...