HACKER Q&A
📣 mnky9800n

Is there a data set for GitHub repos associated with academic papers?


Codes are often included in academic publications however I haven't seen a list of repos anywhere that is connected to doi numbers from papers or zenodo. Does this already exist somewhere or should I need to create it?


  👤 mks_shuffle Accepted Answer ✓
For ML/DL papers you can check https://paperswithcode.com/

👤 sargstuff
20+ tools to help you mine and analyze GitHub and Git data[0]

"Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations" (2020)[1]

"Analyzing the GitHub Repositories of Research Papers"(2020)[2]

HN comments on "OpenAlex: The Promising Alternative to Microsoft Academic Graph"[3] has additional related links

"The Lens"[4];

Resource reference links to related knowledge graph sites(2022)[5]

-----

[0] : https://livablesoftware.com/tools-mine-analyze-github-git-so...

[1] : https://link.springer.com/article/10.1007/s11192-020-03690-4

[2] : "Analyzing the GitHub Repositories of Research Papers" : https://livablesoftware.com/tools-mine-analyze-github-git-so...

[3] : https://news.ycombinator.com/item?id=31271477

[4] : The Lens : https://about.lens.org/the-lens-scholarly-metarecord-strateg...

[5] : "The Microsoft Academic Knowledge Graph enhanced: Author name disambiguation, publication classification, and embeddings " : https://direct.mit.edu/qss/article/3/1/51/109628/The-Microso...

-----

post publication edit addition #1:

If have github reference, can try and use [i] to find related papers.

[i] https://aarontay.medium.com/3-new-tools-to-try-for-literatur...


👤 ebfe1
One idea: You can play with the github dataset in ClickHouse playground.

This is just a quick sql query to look for the DOI number pattern mentioned in any comment on repositories:

https://play.clickhouse.com/play?user=play#c2VsZWN0IHJlcG9fb...

``` select repo_name,event_type,body from github_events where event_type in ('IssueCommentEvent','IssuesEvent','PullRequestEvent','PullRequestReviewCommentEvent') and match (body,'.10\.\d{4,9}\/[-\._;()\/:A-Z0-9]+.') limit 10 ```

perhaps you can extend on that :)


👤 Amir6
Google the following;

site:arxive.org filetype:pdf “GitHub.com”