Python NLP
About Lesson

In this Python NLP lesson we are going to learn about Python NLP Tokenization, we will talk that what is Tokenization in Natural Language Processing(NLP) and what are the different types of NLP tokenization.



What is NLP Tokenization ? 

Tokenization is the process of splitting text in to smaller parts, and every smaller parts are called tokens. And it is one of the most important step in natural language processing. there are two level tokenization, we have sentence level tokenization and word level tokenization.




1: Sentence Tokenization 

Using sentence tokenization we can split a text to sentences. this is done by sen_tokenize() function. so sent_tokenize () function uses an instance of PunktSentenceTokenizer. also sent_tokenizer’ is pertained. It doesn’t require training text and can tokenize straightaway.



If you run the code this will be the result. you can see that the text is spitted in to separate sentences.



Also we have another tokenizer that is called PunktSentenceTokenizer, When we have huge chunks of data then it is good to use it.





This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, it means that it is unsupervised trainable model and it can be trained on unlabeled data.




If you run the code this will be the result.




You can also tokenize sentence from different languages using different pickle file other than English. so in this example we are going to tokenize a text from Spanish language.




If you run the code this will be the result.




2: Word Tokenization 

We can do word tokenization using word_tokenize() function , word_tokenize function uses an instance of NLTK that is called TreebankWordTokenizer.



This is the simplest tokenizer that is related to python, it is the split() method of the python string this is the most basic tokenizer, that uses white space as delimiter. 




This is the result for the code. you can see that our sentence splitted to separate words.



Now let’s use word_tokenize() from nltk, This is the most commonly used tokenizer, basically we can say that it is the default one.




This will be the result.




Regular Expression Tokenizer 

A RegexpTokenizer splits a string into substrings using a regular expression. most of the other tokenizers can be derived from this tokenizer . you can also build a very specific tokenizer using a different pattern.


We have used e \w+ as a regular expression, which means we need all the words and digits from the string, and other symbols can be used as a splitter.



In the second part we specify \d+ as regex. The result will produce only digits from the string.



If you run the code this is the result.