The last two weeks before the deadline were hectic. Officially part of the team still had desks in his 1945 building, but most of them worked in his 1965 as the micro-kitchen had a better espresso machine. “People weren't sleeping,” Gomez says. As an intern, he was constantly debugging and creating visualizations and diagrams for papers. Ablation is common in such projects. That means taking things out and seeing if what's left is enough to get the job done.
“Now that we have all sorts of combinations of tricks and modules, and we know which are useful and which are not, let's strip it away. Let's replace this with this,” says Gomez. “Why does the model behave this counter-intuitively? Oh, I forgot to do a good job of masking. Does it still work? OK, moving on.” All of these components of the design were the result of this very fast-paced, iterative trial and error process.'' The ablation, with the help of Shazeer's implementation, produced something “minimalistic,'' Jones says. . “Gnomes are wizards.”
Vaswani recalled crashing onto the couch in his office one night while his team was writing a paper. He stared at the curtain that separated the couch from the rest of the room, struck by the pattern on the fabric. It looked to him like a synapse or a neuron. Gomez was there and Vaswani told him that what they were working on would go beyond machine translation. “Ultimately, we need to integrate all modalities such as speech, voice, and vision under a single architecture, similar to the human brain,” he says. “We had a strong hunch that we were thinking about something more general.”
But at Google's upper echelons, the research was viewed as just an interesting AI project. I asked several Transformers people if their bosses had ever called them in to get an update on a project. There aren't that many. But “we knew this could potentially be a pretty big problem,” Ushkoreit says. “And that made us really fixated on a line at the end of the paper commenting on future work.”
This sentence predicted what would happen next: applying the Transformer model to essentially all forms of human expression. “We are excited about the future of attention-based models,” they write. “We plan to extend Transformers to problems involving input/output modalities other than text” and explore “images, audio, and video.”
A few nights before the deadline, Ushkolet realized he needed a title. Jones noted that the team led him to fundamentally reject accepted best practices, specifically his LSTM, regarding one technique: attention. According to Jones, the Beatles had named the song “All You Need Is Love.” Why not call this paper “All you need is attention”?
the beatles?
“I’m British,” Jones says. “I literally thought about it for five seconds. I didn't think they would use that.”
They continued to collect experimental results right up to the deadline. “The English and French numbers arrived about five minutes before he turned in his paper,” Palmer says. “In 1965, I was sitting in my kitchenette typing in the last number.” With only two minutes left, they sent out the paper.