Transformer AI models explained Why Your ChatGPT Can “Understand” Bahasa Malaysia?
Did you know it’s amusing to contemplate how if you asked your device to summarize an email that was lengthy you would have received just the first two lines of the email. The device didn’t provide useful/much assistance in doing that! Today you can ask ChatGPT to “Translate this ‘Monglish’ text to ‘proper’ English for my boss”. Note that ChatGPT would actually get the overall feel/vibe of the writing, not just the exact words. How did so much change? I do believe the major difference is not necessarily increasing amounts of data. But rather the major difference was the utilization of Transformer AI models explained. However, do not let the name ‘Transformer AI models’ fool you, it is much less complex than it sounds. The best way to explain all of this would be with a nice cup of “teh tarik” at a “mamak”!
Wait, so what exactly is a Transformer?

Before 2017, most AI that attempted to comprehend language behaved like a forgetful but patient person. Envision reading a lengthy message from your aunt about her neighbour’s cat, a recipe, and then suddenly asking, “can someone pick up my kids?”, by the time AI reads the word kids it had already forgotten the information about your aunt’s cat and the recipe. This was the dilemma with earlier AI models such as RNN and LSTM. They did not have prolonged attention spans. In 2017, Google released a paper called “Attention Is All You Need” which really sums up the introduction of a new method of AI language comprehension called the Transformer Architecture.
Rather than understanding words via left to right scanning of the text, Transformers look at all the context in the sentence as a whole. It then uses an internal mechanism (self-attention mechanism) to determine which of the words in the context of the entire sentence have the most weight in understanding the meaning of the sentence. For instance, in the sentence, “He gave me that old kayu cabinet from his tok kedai”, a Transformer does not think of the word “kayu” by itself but relates it to a part of a context connection to the phrase of “old cabinet” and the phrase “he gave me” to “tok kedai”. In this sense, it builds relationships between terms. Thus, when people refer to Transformer AI models explained, they really mean AI that now has a better understanding of context.
How does attention actually work?
Now let’s make it really practical. You know when you’re in a meeting with five other people, and you’re writing down just the important bits? You miss the « erm », side chats, and people repeating themselves, then get the key points. Multi-head Attention works in a similar manner as this but eight or more times, simultaneously. One of the « heads » may focus on the grammar; another on names; another on emotion or tone. The model then allocates importance of each head to interpret the sentence as a whole. That’s why GPT or Claude or any LLM Model can process things like: Reword this email to remove the keratin, make this 10-page PDF into 3 bullet points for my boss. Other than that, is explain what cloud computing is to me like I’m 15 years old.
The model learned which portions of a sentence are critical based on millions of various inputs. One thing people don’t consider is that the Transformer does not know the order of words as of yet. If you take a bunch of words and put them in a bag, Ali eats fish and Fish eats Ali look the same. This causes confusion. To remedy this, the Transformer employs positional encoding – a method of applying a position number to each word before being processed – similar to a seat number in a movie theatre. If you look at all of the seats all at once, you will still have knowledge of who sits where. If context/phrasal order is not known to the AI, then it would conclude that the phrase I love you is the same as you love me. This would be problematic.
But how does Transformer AI models explained learn to write so well?

The beginning of every transformer model is like a blank slate. It knows nothing. During pretraining, the transformer does not yet have to understand or produce language but is merely processing many amounts of text (books, webpages, and articles) in order to learn to predict the next word of an input sentence. Billions of repetitions of this process cause the transformer to learn patterns, grammar, facts, and biases within language, but at this time, the transformer has not developed any real understanding of how to follow instructions or produce useful responses.
Fine-tuning represents the next stage in developing a transformer language model. After pretraining is completed, the transformer will be trained on a smaller, structured dataset of question and answers or dialogue, which teaches it how to provide useful responses, follow instructions, and interact with humans (users) more naturally than just providing predictions of a word. Thus, pre-training and fine-tuning give modern large language models (LLMs) their general knowledge while providing a practical application for that knowledge.
Transfer learning is an important part of this process. When a transformer model has been trained in one knowledge domain it can be quickly adapted to other knowledge domains. For example, if a transformer has been trained in English, it can be readily fine-tuned to perform well using Malay legal documents even though it has not been exposed to that type of data previously; this is because the transformer does not begin again at zero; it builds upon a set of prior learning; much like if you can drive one vehicle, you can quickly learn how to drive another vehicle.
Why do bigger models sometimes feel smarter?
AI models often get hyped based on size, particularly the number of parameters. They’re like adjustment knobs and twisted during training. The more parameters you have, the deeper you can tune the model, like a convenience store versus a shopping mall.
The mall is big and powerful, but also slower and costs more. Lots of little 1M parameter models make shopping fast and efficient, but in very limited ranges. It turns out it’s not always the case that bigger is better. Formative AI says that lots of large costly models will blurt out long-worded answers to direct questions.
A little model that only costs you a coffee a day can answer better, especially on token saving skills. “So what abut those benchmarks? They matter,” AI says. “Perplexity, Bleu and Rouge scores, human scroing and others are helpful to measure performance on the model end of things.
Most NLP tasks of course work much better with smaller purpose built models, simple things like translation. And those big transformers are where they shine?”
That’s right. They’ve got that self-attention, running up long text and back again, being able to keep going, knowing what’s been said, connected strings of text being generated in generative AI.
So does the AI actually understand what it’s saying?

AI doesn’t understand language like we do. When I tell you that “Nasi Lemak is yum,” it conjures up images and memories, a smell, a taste, perhaps a favourite memory you ate it! not vectors and probability patterns. Words. Bank is not the same as an overhead 2200-pound smart bomb. Models like ChatGPT see it as a cluster of numbers.
Words are represented as contextual embeddings, which is just a way of saying, the “meaning” of a word depends on its surrounding words. Is it “bank” as in the side of a river or “money bank?” Even our context relies on our prior knowledge, not true understanding. This is why AI isn’t sentient, or even aware. It’s just one of the most advanced pattern recognizers, and it doesn’t even know what a river is, it’s learnt that “smash” and “mad,” “sick” and “chill” and “Nasi Lemak” come together enough to make correlations and adjust scores accordingly.
It’s a revolution in how AI happens to work, and to the performance leap behind the latest round of “really amazing AI,” is the attention mechanism. Basically, it enables the model to learn relationships between words across a sentence or indeed entire paragraphs. That’s what is giving it this feeling of coherent context awareness in its responses. So, when it summarizes your notes or even writes them from a prompt, “Nothing special about this really”, unless it really is magic, it’s just that math’s is now very powerful and its system is very good at contexts. Which is more than enough for me.