HACKER Q&A
📣 sgt

Reverse Generating Regexp's


Long shot, but ...

Is there a way to reverse-generate regular expressions from sample data? I have a bunch of sample data that is sometimes grouped with commas in between (almost like addresses), sometimes not.

But like addresses they are often somewhat similar, but there's still a lot of combinations to think of, so this is why I'm wondering whether it's possible to generate valid regular expression (for the respective groups, catering for numbers and words that are different) from a huge amount of sample data.


  👤 compressedgas Accepted Answer ✓
I thought you meant generate strings from a regex. But you actually want to infer a regex from a set of example strings.

I am unaffiliated with the authors of the following paper. I only think that it might act a gateway in aiding you in finding other relevant literature if its implementation does not turn out to be useful.

> Inference of Regular Expressions for Text Extraction from Examples (Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao) 2016 doi:10.1109/TKDE.2016.2515587 https://www.human-competitive.org/sites/default/files/bartol...

Which has a GPLv3 implementation in Java: https://github.com/MaLeLabTs/RegexGenerator

It also happens to be on the first Google search result page for:

> infer a regex from a set of strings