Python NLP Tokenization

Python NLP

About Lesson

In this Python NLP lesson we are going to learn about Python NLP Tokenization, we will talk that what is Tokenization in Natural Language Processing(NLP) and what are the different types of NLP tokenization.

What is NLP Tokenization ?

Tokenization is the process of splitting text in to smaller parts, and every smaller parts are called tokens. And it is one of the most important step in natural language processing. there are two level tokenization, we have sentence level tokenization and word level tokenization.

1: Sentence Tokenization

Using sentence tokenization we can split a text to sentences. this is done by sen_tokenize() function. so sent_tokenize () function uses an instance of PunktSentenceTokenizer. also sent_tokenizer’ is pertained. It doesn’t require training text and can tokenize straightaway.

from nltk.tokenize import sent_tokenize

mytext = "Hello friends. welcome to geekscoders.com. like the course "

print(sent_tokenize(mytext))

from nltk.tokenize import sent_tokenize

mytext = "Hello friends. welcome to geekscoders.com. like the course "

print(sent_tokenize(mytext))

If you run the code this will be the result. you can see that the text is spitted in to separate sentences.

['Hello friends.', 'welcome to geekscoders.com.', 'like the course']

1	['Hello friends.', 'welcome to geekscoders.com.', 'like the course']

Also we have another tokenizer that is called PunktSentenceTokenizer, When we have huge chunks of data then it is good to use it.

PunktSentenceTokenizer

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, it means that it is unsupervised trainable model and it can be trained on unlabeled data.

import nltk

# Loading PunktSentenceTokenizer with English pickle
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

text=" Hello friends . How are you . welcome to geekscoders.com "

print(tokenizer.tokenize(text))

import nltk

# Loading PunktSentenceTokenizer with English pickle

tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

text=" Hello friends . How are you . welcome to geekscoders.com "

print(tokenizer.tokenize(text))

If you run the code this will be the result.

[' Hello friends .', 'How are you .', 'welcome to geekscoders.com']

1	[' Hello friends .', 'How are you .', 'welcome to geekscoders.com']

You can also tokenize sentence from different languages using different pickle file other than English. so in this example we are going to tokenize a text from Spanish language.

import nltk.data

spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')

mytext =  'Hola amigos . Cómo estás . Por favor suscribete a mi canal'

print(spanish_tokenizer.tokenize(mytext))

import nltk.data

spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')

mytext = 'Hola amigos . Cómo estás . Por favor suscribete a mi canal'

print(spanish_tokenizer.tokenize(mytext))

If you run the code this will be the result.

['Hola amigos .', 'Cómo estás .', 'Por favor suscribete a mi canal']

1	['Hola amigos .', 'Cómo estás .', 'Por favor suscribete a mi canal']

2: Word Tokenization

We can do word tokenization using word_tokenize() function , word_tokenize function uses an instance of NLTK that is called TreebankWordTokenizer.

This is the simplest tokenizer that is related to python, it is the split() method of the python string this is the most basic tokenizer, that uses white space as delimiter.

mytext = "Hello World ! @ Welcome to Python Natural Language Processing Course 4 you"


print(mytext.split())

mytext = "Hello World ! @ Welcome to Python Natural Language Processing Course 4 you"

print(mytext.split())

This is the result for the code. you can see that our sentence splitted to separate words.

['Hello', 'World', '!', '@', 'Welcome', 'to', 'Python', 'Natural', 
'Language', 'Processing', 
'Course', '4', 'you']

['Hello', 'World', '!', '@', 'Welcome', 'to', 'Python', 'Natural',

'Language', 'Processing',

'Course', '4', 'you']

Now let’s use word_tokenize() from nltk, This is the most commonly used tokenizer, basically we can say that it is the default one.

from nltk.tokenize import word_tokenize

print(word_tokenize(mytext))

from nltk.tokenize import word_tokenize

print(word_tokenize(mytext))

This will be the result.

['Hello', 'World', '!', '@', 'Welcome', 'to', 
'Python', 'Natural', 'Language', 'Processing', 
'Course', '4', 'you']

['Hello', 'World', '!', '@', 'Welcome', 'to',

'Python', 'Natural', 'Language', 'Processing',

'Course', '4', 'you']

Regular Expression Tokenizer

A RegexpTokenizer splits a string into substrings using a regular expression. most of the other tokenizers can be derived from this tokenizer . you can also build a very specific tokenizer using a different pattern.

from nltk.tokenize import regexp_tokenize

mytext = "Hello World ! @ Welcome to Python Natural Language Processing Course 4 you"

print(regexp_tokenize(mytext, pattern='\w+'))
print(regexp_tokenize(mytext, pattern='\d+'))

from nltk.tokenize import regexp_tokenize

mytext = "Hello World ! @ Welcome to Python Natural Language Processing Course 4 you"

print(regexp_tokenize(mytext, pattern='\w+'))

print(regexp_tokenize(mytext, pattern='\d+'))

We have used e \w+ as a regular expression, which means we need all the words and digits from the string, and other symbols can be used as a splitter.

In the second part we specify \d+ as regex. The result will produce only digits from the string.

If you run the code this is the result.

['Hello', 'World', 'Welcome', 'to', 'Python', 'Natural', 
'Language', 'Processing', 'Course', '4', 'you']
['4']

['Hello', 'World', 'Welcome', 'to', 'Python', 'Natural',

'Language', 'Processing', 'Course', '4', 'you']

['4']