I am really curious around how some plagiarism tell whether a code/text was copied from the internet or not.
I understand the tokenization part for matching the code to find similarity but do these plagiarism tools actually do web crawl or do they do a huge web crawl and store the results in some kind of db
👤 palmfacehn Accepted Answer ✓
https://commoncrawl.org/ is a non-profit which offers a pre-crawled dataset. The specifics of individual tools probably vary. I imagine most tools would be based on academic datasets.