How does one implement web plagiarism?

Question

I am really curious around how some plagiarism tell whether a code/text was copied from the internet or not.I understand the tokenization part for matching the code to find similarity but do these plagiarism tools actually do web crawl or do they do a huge web crawl and store the results in some kind of db

palmfacehn · Accepted Answer

https://commoncrawl.org/ is a non-profit which offers a pre-crawled dataset. The specifics of individual tools probably vary. I imagine most tools would be based on academic datasets.