Google has published a research paper on a new technology called Infini-attention. The technology can handle large amounts of data with “infinitely long contexts,” while also being easily inserted into other models to greatly improve functionality.
The last part should be of interest to anyone interested in Google's algorithms. Infini-Attend is plug and play. This means it can be inserted relatively easily into other models, including those used in Google's core algorithms. The part about “infinitely long contexts” may affect the behavior of some of Google's search systems.
The names of the research papers are: Leave no context: Efficient infinite context transformer with Infini-attention
Memory is computationally expensive for LLM
Large-scale language models (LLMs) are limited in the amount of data they can process at one time, which can significantly increase computational complexity and memory usage. Infini-Attend allows LLM to handle longer contexts while requiring less memory and processing power.
The research paper explains:
“Memory serves as the basis for intelligence because it enables efficient computation tailored to specific situations. However, Transformers and Transformer-based LLMs rely on constrained and context-sensitive memory due to the nature of their attention mechanisms. I have.
In fact, scaling LLM to longer sequences (i.e., 1 million tokens) is difficult with the standard Transformer architecture, and providing longer context models becomes economically costly. ”
Elsewhere in the research paper it is explained:
“Current transformer models are limited in their ability to handle long sequences due to quadratic increases in computational and memory costs. Infini-attention aims to address this scalability issue. ”
The researchers hypothesized that Infini-attention could be scaled to handle very long sequences using Transformers without requiring the usual increase in computational or memory resources.
Three important features
Google's Infini-tention allows transformer-based LLMs to process long sequences without memory issues and use the context of previous data in the sequence, not just data around the current point being processed. We solve the shortcomings of the transformer model by incorporating three features:
Features of Infinia tension
- compressed memory system
- long-term linear attention
- local masked attention
compressed memory system
Infini-Attend uses a so-called compressed memory system. As more data comes in (as part of a longer data sequence), compressed memory systems compress some of the older information to reduce the amount of space required to store the data.
long-term linear attention
Infini-attention also uses a so-called “long-term linear attention mechanism.” This allows LLM to process data that is present early in the data sequence being processed, allowing context preservation. This is different from the standard trans-based His LLM.
This is important for tasks where the context resides in a larger data plane. It's like being able to discuss the entire book and all the chapters and explain how the first chapter relates to another chapter near the end of the book.
local masked attention
Infini-attention uses, in addition to long-term attention, also so-called local masked attention. This kind of attention processes nearby (local) parts of the input data and is therefore useful for responses that depend on closer parts of the data.
Combining long-term and local attention can solve the problem of limiting the amount of input data that the transducer can remember and use for context.
The researchers explain:
“Infini-attention incorporates compressed memory into the vanilla attention mechanism and combines both a masked local attention mechanism and a long-term linear attention mechanism into a single Transformer block.”
Experiment and test results
Infini-attention was tested on other models for comparison across multiple benchmarks with long input sequences, including long context language modeling, passkey retrieval, and book summarization tasks. Passkey retrieval is a test where the language model needs to retrieve specific data from within a very long text sequence.
List of three tests:
- Long context language modeling
- passkey test
- Book summary
Long context language modeling and complexity scoring
The researchers wrote that Infini-attention performed better than the baseline model, and increasing the length of the training sequence led to further improvements. Puzzle score. Perplexity score is a metric that measures the performance of a language model, with lower scores indicating better performance.
The researchers shared their findings as follows:
“The Infini-Transformer outperforms both the Transformer-XL and Memorizing Transformers baselines while maintaining 114 times fewer memory parameters than the Memorizing Transformer model with a 65K-long vector search-based KV memory in the 9th layer. The Infini-Transformer outperforms the storage transformer with 65K memory depth, achieving 114x compression.
We also increased the training sequence length from 32K to 100K and trained the model on the Arxiv-math dataset. 100K training further reduced the perplexity scores for the linear and linear + delta models to 2.21 and 2.20. ”
passkey test
In passkey testing, random numbers are hidden within a long text sequence, and the task requires the model to fetch the hidden text. Passkeys are hidden near the beginning, middle, or end of long texts. This model was able to solve passkey tests up to 1 million in length.
“1B LLMs naturally scale to 1M sequence lengths when injected with Infini-attention to solve passkey retrieval tasks. Infini-Transformers scale up to 1M context lengths when fine-tuned with 5K-length inputs. We solved the passkey task. We report token-level retrieval accuracy for passkeys hidden in different parts (start/middle/end) of long inputs of length 32K to 1M.
book summary test
Infini-attention also performed well on the book summarization test, outperforming top benchmarks and achieving new state-of-the-art (SOTA) performance levels.
The results are explained as follows.
“Finally, we show that after continuous pre-training and task fine-tuning, the 8B model with Infini-attention reaches new SOTA results on a 500K-long book summarization task.
…We further scaled our approach by continuously pre-training an 8B LLM model with 30K steps with 8K input length. Next, we fine-tuned BookSum (Kry´sci´nski et al., 2021), a book summarization task. The purpose of this task is to generate a summary of the entire text of the book.
Our model outperforms previous best results and achieves a new SOTA on BookSum by processing the entire text of a book. …There is a clear trend showing that the summary performance metrics of our Infini-Transformers improve when more text is provided as input from books. ”
Impact of Infini Attention on SEO
Infini-attention is a breakthrough product that models long-range and short-range attention with higher efficiency than previous models without Infini-attention. It also supports “plug-and-play continuous pre-training and long context adaptation.”
This means it can be easily integrated into existing models.
Finally, “continuous pre-training and long-term context adaptation” makes it very useful in scenarios where the model needs to be continuously trained on new data. This last part is very interesting. This is because it can be useful for applications in the backend of Google's search system, especially those that need to be able to analyze long sequences of information and understand relationships from parts near the beginning of the sequence. And another part near the end.
Other articles have focused on the “infinitely long inputs” that this model is capable of, but what makes it relevant to SEO is the ability to handle huge inputs and “leaving no context behind” in search. How it relates to marketing and how it affects some of Google's systems. It works when Google applies Infini-attention to its core algorithm.
Read the research paper:
Leave no context: Efficient infinite context transformer with Infini-attention
Featured image by Shutterstock/JHVEPhoto