HACKER Q&A
📣 jejeyyy77

Tools for managing large image datasets?


I have a large image dataset, around 100-500K images.

Do the ML folks have any recommendations on tool they use to manage/organize/annotate/view large image sets?

Things like: - Visually and manually deleting bad images - Applying mass cropping/letterboxing/resizing - Annotating - Tagging, etc


  👤 PaulHoule Accepted Answer ✓
I have built a bunch of these for various projects over the past 20 years from sources like Flickr, Wikimedia Commons, online image galleries, “booru”, photos i take, etc. I have also used Lightroom heavily as well as disciplined schemes for keeping them in a directory.

I just built one last week by cutting and pasting code out of my YOShInOn RSS reader, which is based on the PAX stack, Python-ArangoDB-HTMX. Fraxinus has a bookmark manager that and a focused webcrawler that knows the markup of (now) a handful of sites so it can get images, metadata, text ,links, etc. Queues in Fraxinus are much simpler than in YOShInOn, there is no A.I. or M.L. in it yet. I am planning, however, to build a rather gold plated ‘tagging’ system which will let tags be positive, negative or indeterminate which would let an active learning system queue judgements on tags. I'd say that it also contains a 'personal data lake' in that crawled content goes into a repository which can be rapidly reprocessed when developing the data enrichment system.

I’ve collected 55k since Saturday, I’d expect no trouble at 10x the size, I’ve built them up to about 2M.