Python NLP POS Tagging

Python NLP

About Lesson

In this Python NLP lesson we are going to learn about Python NLP POS Tagging, for POS Tagging we can say Parts of Speech Tagging.

What is Parts of Speech ?

Parts-of-speech (POS) is one of the many tasks in NLP, you may have heard about Part of Speech (POS). so In English the main parts of speech are noun, pronoun, adjective, determiner, verb, adverb, preposition, conjunction, and interjection. Before this you will be familiar that what are adjectives or what are adverbs and what are difference between these. Now as a human you will know about this, but let’s think about the system where we can encode all this knowledge. The parts-of-speech tag identifies whether a word is a noun, verb, adjective, and so on. There are numerous applications of parts-of-speech tagging, such as information retrieval, machine translation and so on.

What is Parts of Speech Tagging (POS) ?

Parts-of-speech tagging is the process of assigning a category (for example, noun, verb, adjective, and so on) tag to individual tokens in a sentence. In NLTK, taggers are present in the nltk.tag package and it is inherited by the TaggerIbase class.

OK now let’s create a simple example in POS Tagging.

from nltk.tokenize import word_tokenize
from nltk import pos_tag



text=word_tokenize("python is a good language")
print(pos_tag(text))

from nltk.tokenize import word_tokenize

from nltk import pos_tag

text=word_tokenize("python is a good language")

print(pos_tag(text))

If you run the code you will see this result.

[('python', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('language', 'NN')]

1	[('python', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('language', 'NN')]

If you don’t know for example what is NN or what is VBZ, there is a function in NLTK that you can use.

from nltk.tokenize import word_tokenize
from nltk import pos_tag, help



text=word_tokenize("python is a good language")
print(pos_tag(text))


print(help.upenn_tagset('NNS'))

from nltk.tokenize import word_tokenize

from nltk import pos_tag, help

text=word_tokenize("python is a good language")

print(pos_tag(text))

print(help.upenn_tagset('NNS'))

This will be the result

[('python', 'NN'), ('is', 'VBZ'), ('a', 'DT'), 
('good', 'JJ'), ('language', 'NN')]
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products
 bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
None

[('python', 'NN'), ('is', 'VBZ'), ('a', 'DT'),

('good', 'JJ'), ('language', 'NN')]

NNS: noun, common, plural

undergraduates scotches bric-a-brac products

bodyguards facets coasts

divestitures storehouses designs clubs fragrances averages

subjectivists apprehensions muses factory-jobs ...

None

Let’s create another example, this time we want to use Wikipedia library, we want to extract some data from Wikipedia, first of all you need to install this library using pip.

pip install wikipedia

1	pip install wikipedia

This is our example

from nltk.tokenize import word_tokenize
from nltk import pos_tag
import wikipedia



grammer = wikipedia.summary('grammer')

gra_token = word_tokenize(grammer)


sample_pos = pos_tag(gra_token)

print(sample_pos)

from nltk.tokenize import word_tokenize

from nltk import pos_tag

import wikipedia

grammer = wikipedia.summary('grammer')

gra_token = word_tokenize(grammer)

sample_pos = pos_tag(gra_token)

print(sample_pos)

Run the code and this will be the result.

[('In', 'IN'), ('linguistics', 'NNS'), (',', ','), ('grammar', 'NN'),
 ('(', '('), 
('from', 'IN'), ('Ancient', 'NNP'), ('Greek', 'NNP'),
 ('γραμματική', 'NNP'), (')',
 ')'), ('is', 'VBZ'), ('the', 'DT'), ('set', 'NN'), ('of', 'IN'),
 ('structural', 'J
J'), ('rules', 'NNS'), ('governing', 'VBG'), ('the', 'DT'), 
('composition', 'NN'),
 ('of', 'IN'), ('clauses', 'NNS'), (',', ','), ('phrases', 'NNS'),
 ('and', 'CC'), ('words', 'NNS'), 
('in', 'IN'), ('a', 'DT'), ('natural', 'JJ'), ('language', 'NN'),
 ('.', '.'), ('The', 'DT'), 
('term', 'NN'), ('refers', 'NNS'), ('also', 'RB'), ('to', 'TO'), 
('the', 'DT'), ('study', 'NN'), 
('of', 'IN'), ('such', 'JJ'), ('rules', 'NNS'), ('and', 'CC'), 
('this', 'DT'), ('field', 'NN'), 
('includes', 'VBZ'), ('phonology', 'NN'), (',', ','), ('morphology', 'NN'),
 ('and', 'CC'), 
('syntax', 'NN'), (',', ','), ('often', 'RB'), ('complemented', 'VBN'),
 ('by', 'IN'), 
]

[('In', 'IN'), ('linguistics', 'NNS'), (',', ','), ('grammar', 'NN'),

('(', '('),

('from', 'IN'), ('Ancient', 'NNP'), ('Greek', 'NNP'),

('γραμματική', 'NNP'), (')',

')'), ('is', 'VBZ'), ('the', 'DT'), ('set', 'NN'), ('of', 'IN'),

('structural', 'J

J'), ('rules', 'NNS'), ('governing', 'VBG'), ('the', 'DT'),

('composition', 'NN'),

('of', 'IN'), ('clauses', 'NNS'), (',', ','), ('phrases', 'NNS'),

('and', 'CC'), ('words', 'NNS'),

('in', 'IN'), ('a', 'DT'), ('natural', 'JJ'), ('language', 'NN'),

('.', '.'), ('The', 'DT'),

('term', 'NN'), ('refers', 'NNS'), ('also', 'RB'), ('to', 'TO'),

('the', 'DT'), ('study', 'NN'),

('of', 'IN'), ('such', 'JJ'), ('rules', 'NNS'), ('and', 'CC'),

('this', 'DT'), ('field', 'NN'),

('includes', 'VBZ'), ('phonology', 'NN'), (',', ','), ('morphology', 'NN'),

('and', 'CC'),

('syntax', 'NN'), (',', ','), ('often', 'RB'), ('complemented', 'VBN'),

('by', 'IN'),

]

Now we are going to separate NN and NNP from our text, you can use this code for that.

from nltk.tokenize import word_tokenize
from nltk import pos_tag
import wikipedia



grammer = wikipedia.summary('grammer')

gra_token = word_tokenize(grammer)


sample_pos = pos_tag(gra_token)




all_noun = [word for word,pos in sample_pos if pos in ['NN','NNP'] ]
print(all_noun)

from nltk.tokenize import word_tokenize

from nltk import pos_tag

import wikipedia

grammer = wikipedia.summary('grammer')

gra_token = word_tokenize(grammer)

sample_pos = pos_tag(gra_token)

all_noun = [word for word,pos in sample_pos if pos in ['NN','NNP'] ]

print(all_noun)