Asking for help on slack, Veera told me about Roboflow. It’s an online application that’s like a more professional version of the github repos I’ve been using. However, it was still going to be labelled by hand and I really wanted to figure out a way to prep a dataset that would take less time. I’m just one human being, and now I completely understand why most of the large datasets were created by cheap offshore labour like Amazon Turks (just the name of this ‘global workforce’ has some serious issues). Timnit Gebru and other authors, wrote a great article on the ‘hidden’ labour of AI and uncovering some of the farces of automation:
The Exploited Labor Behind Artificial Intelligence | NOEMA
If an image takes 5 minutes to annotate (which is faster than me currently), 12 images would be annotated an hour. One of the popular COCO datasets, ADE20k, has around 30,000 annotated images. Let’s say we only made a dataset half that size, it would take 1250 hours to prep. That’s 52 days. Madness, I need to figure out something better.
Obviously doing this alone is 💀
This is when I had a conversation with my dear friend George and he reminded me of the power of synthetic datasets. I’ve been doing Blender for a couple years now so I was like ‘yeah let me try doing this’.
https://www.cgtrader.com/3d-models?keywords=slide&free=1
Mainly used blendswap and chose files that have Creative Commons licenses.