Thursday 4 August 2016

publications - Google Scholar stopped indexing PDFs on GitHub/GitLab Pages



I self-archive my papers on a personal homepage hosted using GitHub Pages (sulir.github.io). Some time after the page was published, I had my papers properly indexed and displayed in Google Scholar. Next to each article, there was a link named [PDF] pointing to the file hosted on my site, or after clicking "All versions", I could see a link to the PDF on my website.


A few months ago, I noticed that all these mentioned PDF links disappeared. Furthermore, papers which were published only on my website disappeared completely from the search results.


Note that this problem is not specific to my site. I tried to search for other researchers' papers hosted on GitHub/GitLab and found out that Google Scholar do not display PDF links for them. I suppose it affects all files hosted on GitHub Pages and GitLab Pages (or at least that not using a custom domain).


To support this claim: I know the "site:" operator on Google Scholar is quite limited (only primary versions are returned), but anyway:
filetype:pdf site:github.io - 0 results
filetype:pdf site:gitlab.io - 0 results


Is Google actively blocking such platforms? Or is it related to technical details how these pages are served? I also tried contacting scholar-support@google.com, but I did not get any response.


Edit: I also checked for a robots.txt file - it is not present. Furthermore, traditional Google search can find these files, so this problem is specific to Google Scholar.



Edit 2: Suppose Google Scholar intentionally excluded the whole github.io and gitlab.io domains from indexing. What options do I have to have my PDF files indexed?




No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...