HACKER Q&A
📣 backend-dev-33

What's the best framework for text classification (few-shot learning)?


I am looking for software to classify documents into 10-20 categories. The documents are about half-screen to screen long.

There are some labeled data (about 50-80 labeled documents per category. not 500 per category), so a few-shot learning might be an option.

Algorithms used: it might be something like KNearestNeighbor or some ML/Neural networks (transformers? LLM?). Should just do the proper classification.

Some restrictions: It should be a "ready to use" pipeline with documentation about training the model, parameter optimization etc. If possible - there should be some way to use this framework/library without Python (I'm not a Python developer) For example, the [1] and [2] allow to use command-line interface for everything - it seems using Python is optional for these frameworks. The SetFit framework (see [3] and [4]) looks quite promising (good results with 8 labeled samples per class!). But requires doing everything in Python.

[1] https://fasttext.cc/docs/en/supervised-tutorial.html

[2] https://neuml.github.io/txtai/pipeline/text/labels/

[3] https://github.com/huggingface/setfit

[4] https://www.philschmid.de/getting-started-setfit


  👤 txtai Accepted Answer ✓
SetFit is a great framework for building a text classifier.

This is a pretty straight forward problem and a good fit for a standard text classifier as well.

Here is an example of fine-tuning a model with txtai: https://colab.research.google.com/github/neuml/txtai/blob/ma...