Like most in tech, I've been perpetually drowning in all of the recent news regarding LLMs. While all of this hype can be tedious, it's admittedly appropriate as the entire domain is currently in a gold rush for innovation. One week an unforeseen challenge with LLMs is pointed out, the next week five workarounds have been cobbled together, and not even month later the problem has a definitive solution and everyone's moved on.
Of course, this isn't true for every issue that arises. In Phillip Carter's article on The Difficulties of Building with LLMs, he discusses a few challenges that don't have an objective solution. These include the lack of best practices with prompt engineering, the unavoidable risk of prompt injections, and the concern we'll specifically be delving into in this post, the limitations of the LLM context window.
You might ask, what makes me qualified to evaluate this semi-complicated topic? Aside from a small hobby project and a day or so of research, honestly not much. I only recently bit the bullet that the realm of building with LLMs won't be a specialized niche done by a handful of overachieving engineers, but rather, it will be an essential skill that all developers will need to stash in their toolkit to survive.
But, then again, less than 18 months ago, I made the same claim regarding smart contracts, so I digress.
What's a Context Window
The amount of input an LLM can consume as well as the output it can produce, is finite. The magnitude of this combined finiteness is called the LLM's context window.
More technically speaking, the context window is the total number of tokens an LLM can process at inference time. The general rule of thumb for these tokens is that 100 of them represent around 75 words.
Therefore, when interacting with a model such as GPT-3.5, which has a context window of around 4000 tokens, your input prompt, any additional data you might want to pass it, and the model's response are all confined to being less than 3000 words.
This might not seem like an enormous deal at first, as 3000 words are more than enough for the majority of trivial consumer use cases. However, when trying to unlock the full potential of LLMs for anything that's more state-of-the-art, this limitation can become a severe handicap.
Bigger is Better, Right?
"Everything that exceeds the bounds of moderation has an unstable foundation."
Lucius Seneca
As with most things in life, there's no straightforward answer here. At the time of writing this, OpenAI has a flavor of GPT-4 that has a 32k token context window, and another startup called Anthropic has an LLM with a context window of 100k tokens. While the benefits of these inflated context windows are consistently in the spotlight, their less-glamorous cons aren't as frequently discussed.
$$$
The first issue we run into is that the memory and time costs of inference with the attention mechanism of LLMs scale quadratically with the size of their context window. In English, this means when going from 32k to 64k length of context window, it isn't 2x as expensive, but 4x more expensive.
(if I had to speculate why they do this, I'd say it's likely for increasing customer loyalty by offering a less confusing pricing model at the expense of reduced profit margins)
.However, this is the weaker of my critiques against using an inflated context window since:
No Silver Bullets
The second issue with increasing the context window size is that it isn't as much of a magical silver bullet for existing LLM problems as you might think. In Chelsy Ma's article on whether large context windows are a trend, she points out that "expanding the context window alone contradicts Lev Vygotsky's Zone of Proximal Development (ZPD) theory." Now what on earth does that mean?
Ma goes on to explain that "according to ZPD theory, the key to bringing learners to the next level is to identify their zone (through prompt engineering LLMs) and scaffold with tailored instructions (fine-tuning LLMs). A teacher wouldn't give a student a 100-page book (long context) and ask the student to answer any questions (text generation). Instead, the right way is to instruct (fine-tuning) the student (LLM) to build the knowledge (model skills)."
This excerpt demonstrates two main points. The first is that LLMs cannot retain information from prompt to prompt. Meaning, that every time you prompt, the same context data needs to be provided and relearned by the LLM again.
This is absurdly inefficient.
If you've interacted with ChatGPT before, this might be counterintuitive, as you might recall that it can remember details from the conversation history. However, this is just an illusion, as behind the scenes, OpenAI is feeding the whole conversation, or perhaps even an LLM-created summary of it, into the prompt every time you send a new message to the agent.
The second takeaway from Ma's excerpt is that by expanding the context window, all we're doing is keeping the LLM confined to its existing constraints, i.e. its "current understanding zone," while throwing more data at it. As Ma puts it, "You cannot count on an LLM to execute any tasks out of its zone."
Continuing down this path with the mindset of increasing the context window as an end-all solution for LLM's interactions with substantial amounts of data is pure laziness, and it will stifle future innovation if we're not cautious.
Therefore, we must abandon this mindset and expand our awareness of alternative solutions.
Improvise, Adapt, Overcome
"Complaining about a problem without posing a solution is called whining."
Theodore Roosevelt
When all is said and done, if you want to digest and analyze large amounts of data with LLMs, you have three main options:
Context Window
The first is what we've discussed up to this point in the article; find the LLM with the most sizable context window available and throw everything and the kitchen sink into its prompt. While I've been overly critical about this option and its pitfalls so far, I'd like to mention that there will still be niche use cases, though far and few in between, where this will be the appropriate option.
Fine-Tuning
The second option is to exploit few-shot learning by fine-tuning the LLM on the large data set. The details of this option are still being researched, as academics are trying to determine how to make models not overfit on fine-tuned tasks. Unfortunately, once this is solved, this option will still only be applicable in niche cases due to the cons it has of the high cost of the fine-tuning process as well as the requirement for the fine-tuning data to be immutable.
External Search
The third option, which I'll broadly call External Search, is currently seen by some as just a hacky workaround to the context window problem. However, I believe it's not going anywhere anytime soon due to its cost-effectiveness and versatility in comparison to the other options.
External Search can be widely defined as empowering LLMs to search for and retrieve the exact information they need within a large data set without directly inspecting all of the data during every inference.
In his article, mentioned earlier in this post, Carter comments that this can currently be done by "using embeddings and praying to the dot product gods that whatever distance function you use to pluck a 'relevant subset' out of the embedding is actually relevant." However, this is one of the most basic and naive searches one can do.
The backbone of External Search is information retrieval, a massive problem space. So massive that this academic paper exists, whose sole purpose is to assemble all other recent advances in information retrieval to provide a starting point for those who want to learn more about the domain.
Combine information retrieval with other recently developed LLM techniques, such as chunking data and chaining calls, and you end up with External Search.
And if you're still not sold, a Hacker News user by the name of binarymax provides this excellent perspective in this comment: "You can scan an entire book for the part you're looking for (a huge context window), or you can look it up in the index in the back (a good retriever). The latter is a better approach when you're serving production use cases. It's both faster and less expensive."
In Closing
As time progresses, models with million token context windows and beyond will certainly emerge. However, until breakthroughs that mitigate the aforementioned issues occur, an enlarged context window should not be treated as a silver bullet.
Instead, ingenuity and innovation from both academics and engineers will be the necessary alternative that will prevail, far beyond the short term.