11-30, 14:10–14:40 (Europe/Amsterdam), Auditorium
Want a dataset for ML? Internet says you should use ... active learning!
It's not a bad idea. When you're creating your own training data you typically want to focus on examples that can teach a machine learning algorithm the most. That's why active learning techniques typically fetch examples with the lowest confidence scores to annotate first. The thinking is that low confidence regions represent the areas where the algorithm might learn more than regions where the algorithm seems sure of itself.
Again, it's not a bad idea. But it's an approach that can be improved by rethinking some parts. Maybe it would be better for the human to understand the mistakes that the model makes and uses this information to actively teach the model on how to improve.
This talk is all about exploring this idea.
In particular I hope to explain:
- How active learning can perform worse than random sampling.
- How active learning can benefit from streaming methods.
- How there are many techniques to find interesting subsets to prioritise.
- How the human could be learning by actively teaching the machine.
- How you can learn a lot from looking at a single label.
These points will all be supported by running live demos on stage.
Previous knowledge expected
Vincent D. Warmerdam is a software developer and senior data person. He’s currently works over at Explosion to work on data quality tools for developers. He’s also known for creating calmcode.io as well as a bunch of open source projects. You can check out his blog over at koaning.io to learn more about those.