Natural language processing is more approachable than ever
Natural language processing (NLP) is everywhere lately, with OpenAI’s GPT-3 generating as much hype as we’ve ever seen from a single model.
As I’ve written about before, the flood of projects being built on GPT-3 is not down to just its computational power, but its accessibility. The fact that it was released as a simple API has made it so that anyone who can query an endpoint can use state of the art machine learning.
Experiencing machine learning as “just another web service” has opened the eyes of many engineers, who previously experienced machine learning as an arcane, unapproachable field.
Suddenly, machine learning is something you can build things with.
And while GPT-3 is an incredible accomplishment, it’s far from the only impressive language model in the world. If you’re interested in machine learning engineering, my goal in this article is to introduce you to a number of open source language models you can use today to build software, using some of the most popular ML applications in the world as examples.
Before we start, I need to give a little context to our approach.
What made GPT-3 so accessible to engineers was that to use it, you just queried an endpoint with some text, and it sent back a response. This on-demand, web service interface is called realtime inference.
In the case of GPT-3, the API was deployed for us by the team at OpenAI. However, deploying a model as an API on our own is fairly trivial, with the right tools.
We’re going to use two main tools in these examples. First is Hugging Face’s Transformers, a library that provides a very easy-to-use interface for working with popular language models. Second is Cortex, an open source machine learning engineering platform I maintain, designed to be make it as easy as possible to put models into production.
To deploy a model as an API for realtime inference, we need to do three things.
First, we need to write the API. With Cortex’s Predictor interface, our API is just a Python class with an
__init__() function, which initializes our model, and a
predict() function, which does the predicting. It looks something like this:
Cortex will then use this to create and deploy a web service. Under the hood, it’s doing a bunch of things with Docker, FastAPI, Kubernetes, and various AWS Services, but you don’t have to worry about the underlying infrastructure (unless you want to).
One thing Cortex needs to turn this Python API into a web service, however, is a configuration file, which we write in YAML:
Nothing too crazy. We give our API a name, tell Cortex where to find the Python API, and allocate some compute resources, in this case, one CPU. You can configure in much more depth if you’d like, but this will suffice.
Then, we run
$ cortex deploy using the Cortex CLI, and that’s it. Our model is now a functioning web service, a la GPT-3:
This is the general approach we will take to deployment throughout this list, though the emphasis will be on the models themselves, the tasks they’re suited to, and the projects you can build with them.
Smart Compose is responsible for those eerily-accurate email suggestions Gmail throws out while you type:
Even though Smart Compose is the result of huge budgets and engineering teams, you can build your own version in a couple of hours.
Architecting Smart Compose
Architecturally, Smart Compose is a straightforward example of realtime inference:
- As you type, Gmail pings a web service with the text of your email chain.
- The web service feeds the text to a model, predicting the next sequence.
- The web service delivers the predicted text back to the Gmail client.
The biggest technical challenge to Smart Compose is actually latency. Predicting a probable sequence of text is a fairly routine task in ML, but delivering a prediction as fast as someone types is much harder.
To build our own Smart Compose, we’ll need to select a model, deploy it as an API, and build some kind of text editor frontend to query the API, but I’ll leave that last part to you.
Building a text prediction API
Let’s start by picking a model. We need one that is accurate enough to generate good suggestions on potentially not a lot of input. We also, however, need one that can serve predictions quickly.
Now, latency isn’t all about the model—the resources you allocate to your API (GPU vs. CPU, for example) play a major role—but the model itself is still important.
There are a bunch of models capable of doing text generation, but for this task, we’re going to use DistilGPT-2, via Hugging Face’s Transformers library.
GPT-2 is, shockingly, the predecessor to GPT-3. Until GPT-3’s release, it was widely regarded as the best model for text generation. The tradeoff with GPT-2, however, is performance. It’s really big—like 6 GB—and even with GPUs, can be slow in generating predictions. DistilGPT-2, as the name suggests, is a distilled version of GPT-2. It retains most of GPT-2 accuracy, while running roughly twice as fast (according to Hugging Face, it can run on a iPhone 7).
We can write a prediction API for DistilGPT-2 in barely 15 lines of Python:
Most of that should be intuitive.
We initialize our predictor in
__init__(), wherein we declare our device (in this case a CPU, but you can change that to GPU), load our model into memory, and load our tokenizer. A tokenizer encodes text into tokens the model can understand, and decodes predictions into text we can understand.
predict() function handles requests. It tokenizes our request, feeds it to the model, and returns a decoded prediction.
Once you deploy that API with Cortex, all you need to do is connect it to your frontend. Under the hood, that’s all Smart Compose is. A single text generator, deployed as an API. The rest is just normal web development.
Virtual assistants, from Siri to Alexa to Google Assistant, are ubiquitous. And while many actually rely on machine learning for multiple tasks—speech-to-text, voice recognition, text-to-speech, and more—they all have one core task in common:
For many, question answering is one of the more scifi seeming ML tasks, because it fits our pop culture image of a robot that knows more about the world than we do. As it turns out, however, setting up a question answering model is relatively straightforward.
Architecting extractive question answering
There are a few different approaches to this task, but we’re going to focus on extractive question answering, in which a model answers questions by extracting relevant summarizations from a body of reference material (documentation, wikipedia, etc.)
Our API will be a model trained for extractive question answering, initialized with a body of reference material. We’ll then send it inputs via the API, and return predictions.
For this example, I’ll use the Wikipedia article on machine learning.
Building an extractive question answering API:
We aren’t going to be selecting a model at all this time. Instead, we’ll be using Hugging Face’s Pipeline, which allows us to download a pretrained model by specifying the task we want to accomplish, not the specific model.
That’s roughly 7 lines of Python to implement machine learning that, just a few years ago, you would have needed a team of researchers to develop.
Testing out the API, when I ping it with “What is machine learning,” it responds:
"the study of computer algorithms that improve automatically through experience."
Translation is an incredibly complicated task, and the fact that Google Translate is so reliable is a testament to how powerful production machine learning has become over the last decade.
And while Google Translate represents the pinnacle of machine translation in production, you can still build your own Google Translate without becoming an expert in the field.
Architecting language translation
To understand why translation is such a difficult task to model computationally, think about what constitutes a correct translation.
A phrase can be translated into any number of equivalent sentences in another language, all of which could be “correct,” but some of which would sound better to a native speaker.
These phrases wouldn’t sound better because they were more grammatically correct, they would sound better because they agreed with a wide variety of implicit rules, patterns, and trends in the language, all of which are fluid and change constantly.
The best approach to modeling this complexity is called sequence-to-sequence learning. Writing a primer on sequence-to-sequence learning is beyond the scope of this article, but if you’re interested, I’ve written an article about how it used in both Google Translate and, oddly enough, drug development.
We need to find a sequence-to-sequence model pretrained for translations between two languages, and deploy it as an API.
Building a language translation API:
For this task, we can again use Hugging Face’s Transfomers pipeline to initialize a model fine-tuned for the exact language translation we need. I’ll be using an English-to-German model here.
The code is very similar to before, just import the model from the pipeline, and serve the request:
Now, you can ping that API with any English text, and it will respond with a German translation. For other languages, you can simply load a different model (there are many available through Hugging Face’s library).
All of these products are developed by tech giants, because for years, they’ve been the only ones able to build them. This is no longer the case.
Every one of these examples has implemented state of the art machine learning in less than 20 lines of Python. Any engineer can build them.
A natural objection here would be that we only used pretrained models, and that to build a “real” product, you’d need to develop a new model from scratch. This is a common line of thinking, but doesn’t square with the reality of the field.
For example, AI Dungeon is a dungeon explorer built on machine learning. The game went viral last year, quickly racking up over 1,000,000 players, and it is still one of the most popular examples of ML text generation.
While the game has recently transitioned to using GPT-3, it was originally “just” a fine-tuned GPT-2 model. The creator scraped text from a choose-your-own-adventure site, used the
gpt-2-simple library to fine-tune GPT-2 with the text, and then deployed it as an API with Cortex.
You don’t need Google’s budget. You don’t need early access to the GPT-3 API. You don’t need a PhD in computer science. If you know how to write code, you can build state-of-the-art machine learning applications right now.