chenglong

Why your AI Code Completion tool needs to Fill in the Middle

notion image
notion image

Code Completion Models

Large language models have been trained over billions of bytes of data to perform exactly one task extremely well: given the preceding N characters, predict the next one. The driving force behind the AI revolution we're currently experiencing is that being able to predict the character with high accuracy is an incredible superpower. It allows you to build chatbots like Bing and ChatGPT, copywriting assistants like Jasper, and code completion tools like Codeium and Copilot.
The models powering code completion tools know how to complete entire functions just from their signatures:
notion image
They can see your imports and predict what task you're trying to complete:
notion image
But there's a problem: the model only knows about the code before your cursor. What about everything that's after? The existing code there can be incredibly useful when programming, providing information about potential functions to call, coding practices to emulate, and approaches to take.
So, what's the solution? Enter Fill in the Middle (FIM). Introduced in a paper last year by OpenAI, FIM is an under-discussed technique that allows language models to incorporate the context that comes after the cursor during training.

How Fill-in-the-Middle works

It's quite simple: let's say we have a training example that looks like this:
notion image
and we want the model to learn to predict the middle text jumps over from the prefix The quick brown fox and the suffix over a lazy dog. First, we make two cuts to separate these sections, introducing new tokens <PRE>, <MID>, <SUF>, and <EOM> (end of middle):
notion image
Then we simply transpose the middle and suffix:
notion image
Now, we train exactly like we did before, predicting the following text jumps over<EOM> from the earlier text <PRE>The quick brown fox <SUF> a lazy dog<MID>. The model automatically learns the meaning of the special tokens and learns that it is expected to generate text that makes sense after the prefix but before the suffix!
At inference time, if we're trying to infill a document like the following:
notion image
we can present it as
notion image
to the model and request characters until the model emits an <EOM> token, at which point it has successfully joined the prefix with the suffix.

FIM vs non-FIM models

With FIM, we can greatly improve the accuracy of code completion tools by providing context to the model that would otherwise be missing. Let's see some examples comparing two different code autocomplete tools, Codeium and Tabnine Pro.
Codeium is a free code completion product used by tens of thousands of developers around the world. Codeium's enterprise offering allows customers to self-host Codeium in their virtual private cloud or on-premise to ensure that no data is sent outside of the company. Tabnine is an AI code assistant that also offers self-hosting for enterprises.
Here are two suggestions with the same prompt for each tool. Codeium, on the left, is using a FIM model which can see the usage of the distance function below the cursor and is able to infer that it is supposed to compute the edit distance between a and b. Tabnine Pro, on the right, at the time of writing likely didn't use FIM, and gives a worse suggestion as a result.
notion image
Codeium
notion image
TabNine Pro
In this Golang code, Codeium understands that it needs to initialize the messages channel, while TabNine just outputs Hello World:
notion image
Codeium
notion image
TabNine Pro
Codeium can even generate an accurate docstring for an already-implemented function:
notion image
Codeium
notion image
TabNine Pro

Conclusion

Software engineering is rarely a linear task: programs are usually not written in one shot from start to end. Most day-to-day programming involves adding functionality, refactoring code, and fixing bugs—all tasks that benefit greatly from context after the cursor.
It should be no surprise then that code completion models trained with FIM capabilities easily outperform simple left-to-right models. Indeed when we deployed FIM for all Codeium users we saw large increases in our acceptance rates and user satisfaction.
Off-the-shelf code completion models like Salesforce Codegen (which powers FauxPilot) have not been trained with FIM, so code completion tools that want to use FIM need to train their own models. This is harder than it may seem—there are some subtleties involved in choosing where to cut the document and in ensuring that your model's left-to-right performance does not suffer.
If you'd like to try out Codeium's FIM code completion model, head over to our playground or try us out in your IDE of choice.

Copyright © 2024 chenglong

logo