Prompted by discussion in this post on Meta.MathOverflow.net I got interested in comparing usage of tags from MathOverflow to submissions in the respective disciplines of arXiv. (Vide a similar idea of language popularity, GitHub vs StackOverflow (or one from 2015).) Moreover, as people often use their real names on MO, it may be interesting to check the overlap of mathematicians.
The question is, how to get such data?
There is arXiv API, but for bulk downloads of metadata they recommend Open Archives Initiative (OAI). Yet, as I see, it can query one article at a time, and one needs to know its id. So without knowing arXiv ids beforehand, it turns into a guessing game.
There are some plots in arXiv usage statistics, yet I don't see this exact data.
Also, one can get total submission to math
from links in Mathematics -> Article statistics by year, but it misses the splitting into subdisciplines.
Answer
My main confusion was not realizing that The Open Archives Initiative Protocol for Metadata Harvesting is a separate protocol, not a subset of arXiv API.
In this case, the relevant queries are ListIdentifiers
(10k items per query) and ListRecords
(1k items per query). To get just identifiers we need to write:
http://export.arxiv.org/oai2?verb=ListIdentifiers&set=math&metadataPrefix=oai_dc
It results in 10k identifiers in the following form:
2015-02-16T19:28:22Z
http://export.arxiv.org/oai2
oai:arXiv.org:0704.0002
2008-12-13
math
...
oai:arXiv.org:0712.1769
2011-06-23
math
760571|10001
As there are more results, to get next batch we need to specify resumptionToken
, in this case:
http://export.arxiv.org/oai2?
verb=ListIdentifiers&resumptionToken=760571|10001
and so on.
Other useful parameters are from
and until
, e.g. as in
http://export.arxiv.org/oai2?verb=ListIdentifiers&set=math&metadataPrefix=oai_dc&from=2015-01-14&until=2015-01-14
To directly get categories (bear in mind that set=math
specifies mathematics, but there are no smaller subsets), one can write:
http://export.arxiv.org/oai2?verb=ListRecords&set=math&from=2015-01-01&until=2015-01-31&metadataPrefix=arXiv
It's important to set metadataPrefix=arXiv
, so that subdisciplines will be listed:
math-ph cond-mat.other math.MP nlin.CD physics.class-ph
EDIT:
I used delay as Nate Eldredge suggested, in my case - 25s. Yet, while trying to get all math (250k items so in 250 queries) it gave error at 70. I did continue it (with even higher delay) but sometime around 110 the query was not longer available.
So, the way to go is in getting smaller chunks - e.g. by month (or for mathematics - at most by year).
No comments:
Post a Comment