Quite apart from the expense I'll incur training this model, I'm struggling to figure out a way of finding enough definitive propaganda to begin with given that it's such a subjective topic. Ideally I'd use a wide variety of propaganda types such as advertising, political propaganda, social media comments plugging a given product or service, and other categories of propaganda. Is anybody here on HN aware of any collections of propaganda I can easily ingest in a machine-readable form or failing that, have any ideas of how I could go about constructing one myself?
This is purely a personal project with no desire for monetisation, I'm just curious to know how much of my browsing does actually consist of propaganda given it's such a topic of discussion at the moment.
Machine learning models do a good job learning to classify texts by topic. You'd have no problem separating out texts on, say, abortion, electoral politics, quantum physics, and molecular biology. The basis for that is that different topics use different vocabulary.
If you threw a bunch of texts at your classifier you'd probably mix up "topic" and "is propaganda." For instance articles about abortion and smartphones might be intended to persuade, but articles about ham radio and japanese animation might not be.
Machine learning models are famous for "cheating" by taking advantage of whatever features work for the training set without any knowledge that topic and "is propaganda" could or should be separated from each other.
Get more involved in the data, draw you up lists of scored adjectives and see if you can get a "favorable" to "pejorative" sentiment analysis on some samples that looks right. Then anything that scores to either extreme from random inputs is more likely to be propaganda.
Of course that only reflects your original scoring back but that doesn't mean it hasn't got value; just you have to be careful to seperate that bias back out again.
For actual data; look at textfiles.org political and issue 'zines. usenet archives from misc.activism.* and so on could be good sources too.