You could then evaluate the results by running TeXmacs with one ported method at a time to see if it seems to be computing the same thing as the original C++ method.
So this AI benchmark would serve two purposes: testing AIs on their ability to port apps in general while also porting a particular app.
Just imagine the both extremas:
- translated every feature very good to other language, but not the main function, and hence, the proggi is not starting
- no functions or feature is properly translated, but it starts.
Which will get the better mark? In my eyes, it would be the second one - it's starting in the end.
So, a benchmark must consist of a lot of different comparable tasks and not of a single (oss) program like TeXmacs. If I've understood everything properly...
This approach seems not manageable.