ALTERNATE UNIVERSE DEV

Podcast.__init__

An Open Source Toolchain For Natural Language Processing From Explosion AI

Summary

The state of the art in natural language processing is a constantly moving target. With the rise of deep learning, previously cutting edge techniques have given way to robust language models. Through it all the team at Explosion AI have built a strong presence with the trifecta of SpaCy, Thinc, and Prodigy to support fast and flexible data labeling to feed deep learning models and performant and scalable text processing. In this episode founder and open source author Matthew Honnibal shares his experience growing a business around cutting edge open source libraries for the machine learning developent process.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. And now, the events are coming to you, with no travel necessary! We have partnered with organizations such as ODSC, and Data Council. Upcoming events include the Observe 20/20 virtual conference on April 6th and ODSC East which has also gone virtual starting April 16th. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Matthew Honnibal about the Thinc and Prodigy tools and an update on SpaCy

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by giving an overview of your mission at Explosion?
  • We spoke previously about your work on SpaCy. What has changed in the past 3 1/2 years?
    • How have recent innovations in language models such as BERT and GPT-2 influenced the direction or implementation of the project?
  • When I last looked SpaCy only supported English and German, but you have added several new languages. What are the most challenging aspects of building the additional models?
    • What would be required for supporting symbolic or right-to-left languages?
  • How has the ecosystem for language processing in Python shifted or evolved since you first introduced SpaCy?
  • Another project that you have released is Prodigy to support labelling of datasets. Can you talk through the motivation for creating it and describe the workflow for someone using it?
    • What was lacking in the other annotation tools that you have worked with that you are trying to solve for in Prodigy?
  • What are some of the most challenging or problematic aspects of labelling data sets for use in machine learning projects?
    • What is a typical scale of data that can be reasonably handled by an individual or small team working with Prodigy?
      • At what point do you find that it makes sense to use a labeling service rather than generating the labels yourself?
  • Your most recent project is Thinc for building and using deep learning models. What was the motivation for creating it and what problem does it solve in the ecosystem?
    • How does its design and usage compare to other deep learning frameworks such as PyTorch and Tensorflow?
    • How does it compare to projects such as Keras that abstract across those frameworks?
  • How do the SpaCy, Prodigy, and Thinc libraries work together?
  • What are some of the biggest challenges that you are facing in building open source tools to meet the needs of data scientists and machine learning engineers?
  • What are some of the most interesting or impressive projects that you have seen built with the tools your team is creating?
  • What do you have planned for the future of Explosion, SpaCy, Prodigy, and Thinc?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Episode source