Tuesday 8 March 2016

publications - Getting a dump of arXiv metadata


Prompted by discussion in this post on Meta.MathOverflow.net I got interested in comparing usage of tags from MathOverflow to submissions in the respective disciplines of arXiv. (Vide a similar idea of language popularity, GitHub vs StackOverflow (or one from 2015).) Moreover, as people often use their real names on MO, it may be interesting to check the overlap of mathematicians.


The question is, how to get such data?


There is arXiv API, but for bulk downloads of metadata they recommend Open Archives Initiative (OAI). Yet, as I see, it can query one article at a time, and one needs to know its id. So without knowing arXiv ids beforehand, it turns into a guessing game.



There are some plots in arXiv usage statistics, yet I don't see this exact data.


Also, one can get total submission to math from links in Mathematics -> Article statistics by year, but it misses the splitting into subdisciplines.



Answer



My main confusion was not realizing that The Open Archives Initiative Protocol for Metadata Harvesting is a separate protocol, not a subset of arXiv API.


In this case, the relevant queries are ListIdentifiers (10k items per query) and ListRecords (1k items per query). To get just identifiers we need to write:


http://export.arxiv.org/oai2?verb=ListIdentifiers&set=math&metadataPrefix=oai_dc

It results in 10k identifiers in the following form:



2015-02-16T19:28:22Z

http://export.arxiv.org/oai2


oai:arXiv.org:0704.0002
2008-12-13
math

...

oai:arXiv.org:0712.1769

2011-06-23
math

760571|10001



As there are more results, to get next batch we need to specify resumptionToken, in this case:


http://export.arxiv.org/oai2?
verb=ListIdentifiers&resumptionToken=760571|10001


and so on.


Other useful parameters are from and until, e.g. as in


http://export.arxiv.org/oai2?verb=ListIdentifiers&set=math&metadataPrefix=oai_dc&from=2015-01-14&until=2015-01-14

To directly get categories (bear in mind that set=math specifies mathematics, but there are no smaller subsets), one can write:


http://export.arxiv.org/oai2?verb=ListRecords&set=math&from=2015-01-01&until=2015-01-31&metadataPrefix=arXiv

It's important to set metadataPrefix=arXiv, so that subdisciplines will be listed:




math-ph cond-mat.other math.MP nlin.CD physics.class-ph


EDIT:


I used delay as Nate Eldredge suggested, in my case - 25s. Yet, while trying to get all math (250k items so in 250 queries) it gave error at 70. I did continue it (with even higher delay) but sometime around 110 the query was not longer available.


So, the way to go is in getting smaller chunks - e.g. by month (or for mathematics - at most by year).


No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...