Written by Venkatesh Ramamrat
Language is a very defining character of our human species, perhaps one of the most important which enabled us to coordinate with each other leading to civilizations. It is through language that we formulate thoughts and communicate them to one another. Language enables us to communicate abstract thoughts, and turn them into complex ideas, and generation after generation we have built and learned from these ideas in the form of texts, books, and various other media to further our knowledge across time and space. We have always looked at creating technology that understands our language. This is the core philosophy of Natural Language Processing (NLP), one of the core technologies which Wranga utilizes in its AI technology.
One of the first ideas in the field of NLP could be as early as the 17th century. Descartes and Leibniz came up with a dictionary created by universal numerical codes used to translate text between different languages. An unambiguous universal language based on logic and iconography was then developed by Cave Beck, Athanasius Kircher, and Joann Joachim Becher.
In 1957, Noam Chomsky published Syntactic Structures. The monograph was considered one of the most significant studies in linguistics in the 20th century. The monograph constructed a formal linguistic structure with phrase structure rules and syntax trees to analyze English sentences. “Colorless green ideas sleep furiously.”, a famous sentence constructed based on the phase structure rule, is grammatically correct but makes no sense at all.
Colorless green ideas sleep furiously
Noam Chomsky constructed this sentence as an illustration that phrase structure rules are capable of generating syntactically correct but semantically incorrect sentences. Phrase structure rules break sentences down into their constituent parts. This led to the invention of two basic elements which are the basis of modern NLP algorithms. These constituents are often represented as tree structures (dendrograms). The tree for Chomsky's sentence can be rendered as follows:
Evolution Of NLP’s
Advancements in technology with neural networks and Machine Learning take inputs in the form of numerical vectors, but since language is not numerical, Word Embeddings were created which enable the same. numerical vectors. Two important technologies which help with word classification have been RNN and LSTM.
Recurrent Neural Network (RNN)
Long Short-Term Memory (LSTM)
Subsequently, if one is keener to understand and learn about much more advanced technologies which have been the precursor to modern NLP technology will be to understand.
Hierarchical Attention Network (HAN)
Gated Recurrent Unit: GRU
For the last 5 years, NLP and Machine Learning in the field of Ai have exploded with the invention of Bidirectional Encoder Representations from Transformers (BERT). Introduced by a team of Google researchers, in 2017 that has unleashed vast new possibilities in AI. The reason such large training datasets are possible is that transformers use self-supervised learning, meaning that they learn from unlabeled data. This is a crucial difference between today’s cutting-edge language AI models and the previous generation of NLP models, which had to be trained with labelled data. Today’s self-supervised models can train on far larger datasets than was ever previously possible: after all, there is more unlabeled text data than labelled text data in the world by many orders of magnitude.
Modern NLP Ecosystem
By definition, recurrent neural networks process data sequentially—that is, one word at a time, in the order that the words appear. Transformers’ great innovation is to make language processing parallelized, meaning that all the tokens in a given body of text are analyzed at the same time rather than in sequence.
To support this parallelization, transformers rely on an AI mechanism known as attention. Attention enables a model to consider the relationships between words, even if they are far apart in a text, and to determine which words and phrases in a passage are most important to “pay attention to.” “Attention Is All You Need” - Title of the original paper of Transformer
Two basic phases:
Pre-Training: The model is trained on unlabeled data over different pre-training tasks. In the first phase, a tech giant creates and opens-sources a large language model; for instance, Google’s BERT, Facebook’s RoBERTa, or most recently, China’s WuDao 2.0. Because they can be adapted to any number of specific end uses, these base models are referred to as “pre-trained.”
Fine-Tuning: The BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream task. Downstream users like us at Wranga, take these pre-trained models and refine them with a small amount of additional training data to optimize them for our specific use case or market.
Reference: Center for Research on Foundation Models (CRFM), Stanford
Leading Companies in NLP Technology-Alphabet BERT, Meta AI roBERTa, Spectrum Labs, Open AI, Cohere, Hugging Face, AI21 Labs, Primer, Inflection AI, and Twelve Labs
Applications of NLP: Search, Writing Assistants, Language translation, Video Search/translation, Sales Intelligence, Chatbots, Employee Engagement, Voice Assistants, Contact Centres, Moderation
Challenges and Opportunities
This is because mastering language is what is known as an “AI-complete” problem: that is, an AI that can truly understand language the way a human can would by implication be capable of any other human-level intellectual activity. Foundation models learn language by ingesting what humans have written online, and since data is unlabeled, it does include a lot of negative behaviour, abuse, prejudice, violence, sexism, etc which gets picked up in the vast data of the Master Data.
At Wranga, since we have been aware of these problems, working with large tech firms for over the last decade, we have labelled and utilized ethical safety checks for the data we process as we aim to make the content we review to be suitable for children. This issue will only grow more acute as foundation models become increasingly influential in society. We believe that as the systems become more complex, it becomes more difficult and the problems get highlighted as we grow to a larger scale, hence we adopted the checks early on, and our AI algorithms will be trained with more focussed and cultural-context-driven outcomes.
Language is at the heart of human intelligence. It therefore is and must be at the heart of our efforts to build artificial intelligence. No sophisticated AI can exist without mastery of language, and we at Wranga, looking at the Indian subcontinent, also look at various languages whose root is not Latin, and this also is an exciting challenge as we go forward, as most NLP systems are based on English and Western cultural context, this field is quite nascent in development in India and we look forward to finding solutions in the future.