What is Natural Language Processing (NLP)?

Natural Language Processing (NLP) focuses on using computers to understand and derive meaning from human languages. In this formulation, the challenge for NLP is an extremely difficult one. The average 20-year-old native speaker of American English knows 42,000 words (from 27,000 words for the lowest 5% to 52,000 for the highest 5%) and thousands of grammatical concepts. We need a large volume of linguistic knowledge for communication in a professional context, as well as writing books and articles, which we spend decades developing. On the other hand, in everyday life, our language needs are less complex; using a vocabulary of 3000 words is enough to cover around 95% of common texts, such as news items, blogs, tweets, and learning from a text context. This facilitates the process of meaning extraction for computers, especially in terms of performing “simple” tasks like summarization, relationship extraction, topic segmentation, etc. Historically, the research on NLP started with the IBM-Georgetown Demonstration in 1954. In this experiment, automatic translation (in a very rudimentary form) from Russian to English was exhibited. Generally, researchers in NLP were mainly focused on machine translation from the 1940s to the 1960s. The fact that this work was done with very poor computational resources in grammar and lexicon building is remarkable. NLP started to grow much faster alongside the growth of the World Wide Web in the 1990s, and then exploded in the early 2010s with the adoption of neural network models in NLP. Nowadays, we have commercial models based on NLP in healthcare, media, finance, and human resources. In this article, we will consider the basic principles of NLP, or, simply, how to get a computer to understand unstructured text and extract data from it.

Linguistic data

First of all, let’s say a few words about the specialized data used in NLP. When textual data is generated from conversations, tweets, declarations, posts and comments on Facebook, it is arbitrary and ambiguous. These pieces of machine-readable texts are collected in a dataset or a text corpus (pl. corpora), a name usually used in linguistics and NLP. The content of the corpus is defined by specification of the NLP task. In topic segmentation, corpora contain a balanced number of texts on different topics: religion, science, science fiction, detective fiction, etc. The Brown corpus is an example of such a corpus, consisting of 500 English text samples (1 million words), which are tagged with 80 different tags. In machine translation tasks, corpora contain texts in multiple languages. Another example of a corpus is the Switchboard corpus, which was collected for speech recognition in phone operation. It contains conversational speech and corresponding text of the different sexes and all different dialects in the United States.

1 What is Natural Language  Processing (NLP)_

NLP pipeline

The first step in the pipeline is to split our working document into units. In text analysis, the unit is called “token”, a string of symbols that represent text. Usually, one token corresponds to one word. Splitting into tokens starts with sentence segmentation. A machine can easily separate sentences using a punctuation mark (modern NLP pipelines use more complex techniques in the case of unformatted documents). Then, we split apart words whenever there is a space or punctuation mark between them - this is called tokenization.

The next step is to predict a part of speech for each token and tag it accordingly. Parts of speech (e.g., verbs, nouns, adjectives) are important, because they indicate how a word is functioning within the context of a sentence. NLTK is one of the more widely-used tools for part-of-speech tagging.

Due to the fact that, in most languages, words appear in different forms (e.g. be, is, was, are, or apple, apples) it is very helpful to find a basic form or lemma of each word in the sentence. This is called lemmatization. The lemmatization process is often used to decrease the number of overall word classes. We can provide lemmatization by using a table with lemma forms of words (for example, WordNet for English) based on their parts of speech.

There are also words that you might want to filter before statistical analysis, like “the”, “and”, “is”. These words are called stop words and they produce a high amount of noise due to their frequency in texts.

These steps are basic for extracting deeper levels of understanding from your data, like morphological, syntactic and semantic levels. Then, depending on the NLP task, one or several models are used for further analysis. We’ll consider some examples in this article.

Classification

Classification is a very popular text classification technique used for sentiment analysis of reviews, message polarity, reactions, etc. For example, this technique computes relative frequencies of positive words (“good”, “awesome”) and negative words (“bland”, “horrible”) in order to find negative and positive reviews.

The most commercially successful text classification model is one used for spam filtering: the Naïve Bayes classification model uses corpus of spam and not spam emails to predict the probabilities of a word’s presence in spam emails based on their frequencies.

“Bag-of-words” model

Effective language models should take into account the localization words in their context. Such an approach will be able to detect sentiment much better than classification models. The “bag-of-words” technique evaluates the frequency with which words co-occur with themselves and other words in a specific context. Thus, the model explores relationships between documents that contain the same mixture of individual words. This can be very useful when we want to do topic segmentation, because texts of a specific discipline usually contain specific vocabulary.

The context in which the words appear plays a huge role in conveying meaning. Extensions of the “bag-of-words” model consider co-occurrences of phrases in texts. This is called n-gram analysis (n simply means number of words in a phrase to scan on) and it is very effective in extracting meaning from texts.

Currently, neural networks play an increasingly significant role in text processing. Almost all big challenges of NLP are addressed now to neural networks, such as machine translation, summarization, paraphrasing, question-and-answer, dialog, and other challenges; Alexa’s speech recognition andGoogle Translate are just two examples of such applications. There are still many limitations in existing NLP models, e.g. it is very difficult for a machine to recognize sarcasm in sentiment analysis, and, furthermore, natural language changes with time (new words appear, and the meaning of existing words evolves).

Let’s have talk

Get a first consultation on your project

Interesting For You

Sentiment Analysis in NLP

Sentiment analysis has become a new trend in social media monitoring, brand monitoring, product analytics, and market research. Like most areas that use artificial intelligence, sentiment analysis (also known as Opinion Mining) is an interdisciplinary field spanning computer science, psychology, social sciences, linguistics, and cognitive science. The goal of sentiment analysis is to identify and extract attitudes, emotions, and opinions from a text. In other words, sentiment analysis is a mining of subjective impressions, but not facts, from users’ tweets, reviews, posts, etc.

Read article

Chatbots in NLP

Chatbots or conversational agents are so widespread that the average person is no longer surprised to encounter them in their daily life. What is remarkable is how quickly chatbots are getting smarter, more responsive, and more useful. Sometimes, you don’t even realize immediately that you are having a conversation with a robot. So, what is a chatbot? Simply put, it is a communication interface which can interpret users’ questions and respond to them. Consequently, it simulates a conversation or interaction with a real person. This technology provides a low-friction, low-barrier method of accessing computational resources.

Read article

What is Data Science?

In recent years, data science has become increasingly prominent in the common consciousness. Since 2010, its popularity as a field has exploded. Between 2010 and 2012, the number of data scientist job postings increased by 15 000%. In terms of education, there are now academic programs that train specialists in data science. You can even complete a PhD degree in this field of study. Dozens of conferences are held annually on the topics of data science, big data and AI. There are several contributing factors to the growing level of interest in this field, namely: 1. The need to analyze a growing volume of data collected by corporations and governments 2. Price reductions in computational hardware 3. Improvements in computational software 4. The emergence of new data science methods. With the increasing popularity of social networks, online services discovered the unlimited potential for monetization to be unlocked through (a) developing new products and (b) having greater information and data insights than their competitors. Big companies started to form teams of people responsible for analyzing collected data.

Read article