- HN search results (https://hn.algolia.com/?q=information+extraction)
- Apache Tika (https://tika.apache.org/)
- Apache OpenNLP (https://opennlp.apache.org/)
- Apache UIMA (https://uima.apache.org/external-resources.html)
- GATE (https://gate.ac.uk/)
But I am not sure if any of these can do the job, as I haven't used them. I also know that there are companies that have developed similar solutions (https://www.ontotext.com/knowledgehub/case-studies/ai-content-generation-in-scientific-communication/), possibly by using GraphDB. In addition, what is the best data storage solution? In one case, you extract a table from the publication, whereas in another - a single data point. It's not worth the effort of creating a separate table for a single data point. What would be the right approach, software (library) and possible workflow and data storage solution in this case?
If getting the right answer matters for you you need to start with a workflow system that will let you do the task manually. You will absolutely need it for two reasons: (1) editing cases that the extraction system gets wrong, (2) creating a training/evaluation set for the extraction pipeline.
When you've got a well-defined task you can do manually then you can think about automating some of the extraction (80% is realistic) with rules like regexes, RNN/CNN/Transformer models.
My contacts in Argentina who do projects like this all the time say that it takes maybe 20,000 examples to train an extraction model and that fits my experience. What separates the people who succeed at this kind of project from those who fail is that those who succeed make the training set, those who fail exhaust themself looking at projects like Tika, OpenNLP, UIMA, etc.