What is the best way to get a data dump of HN?
I’m working on a little text indexing side project & I think the content posted to HN would be a good dataset to work on. What is the best way to get a dump of all the url’s That have been submitted to HN? Asking for ideas before firing up a crawler. Are there existing dumps? APIs?
You can find a link to HN's API in the footer of the page. Unfortunately, it's a bit awkward to work with, but it isn't rate limited.
Have you considered looking at the bottom of the page as well as well as the top?