“The dblp computer science bibliography provides open bibliographic information on major computer science journals and proceedings. Originally created at the University of Trier in 1993, dblp is now operated and further developed by Schloss Dagstuhl.” -DBLP website
Fortunately, the dblp website provided a way to download their database, unfortunately, the data was a 2.8GB XML file. As cool as the dblp team is, they provided a java “library” to parse the XML for data analysis. But, it was too slow for our analysis to be run in a reasonable amount of time.
To overcome this problem, I wrote a python script. to parse the XML and move the data into a Postgres. database, which once indexed would be faster by orders to query than the XML. Before starting the analysis, to test if all the data was parsed correctly. We ran queries for some citation data we already knew, and we found out a substantial amount of citations were missing. It had nearly all the publications available, but their citation data was missing, which was what this study was about. Later on, it was found that the XML did not have that data.
We set out to find other sources of data like Orcid., Scopus., unfortunately, they did not include the citation data we needed. Google Scholar. had the data but did not provide an API (between us, I tried scraping. Spoiler alert, that did not work), Microsoft Academic’s. API had time-based call restrictions, which if adhered to would cause the analysis to take an unreasonable amount of time.
|Source||Why it didn't work?|
|DBLP||Lack of Citation Data|
|Orcid||Lack of Citation data for public|
|Scopus||Lack of Citation data|
|Google Scholar||No publicly available API|
|Microsoft Academic||API restrictions were too hard for the project|
Now onto the real conclusion, There were no freely available sources of citation data, which would allow a large scale analysis. This is unfortunate as self-citation is an unethical practice. Such an analysis would have led us to find correlations between the environment or other factors and increase in self-citation practices, further leading us to avoid creating an environment or atmosphere which promotes or enables self-citation.