Python NLP Default Tagger

Python NLP

About Lesson

In this Python NLP lesson we are going to learn about Python NLP Default Tagger, so Default Tagging provides a baseline for part-of-speech tagging , it is performed using the DefaultTagger class. and It simply assigns the same part-of-speech tag to every token. The DefaultTagger class takes ‘tag’ as a single argument. for example NN is the tag for a singular noun.

from nltk.tag import DefaultTagger

tagger = DefaultTagger('NN')

print(tagger.tag(['Hello', 'World']))
print(tagger.tag(['Good', 'Morning']))

from nltk.tag import DefaultTagger

tagger = DefaultTagger('NN')

print(tagger.tag(['Hello', 'World']))

print(tagger.tag(['Good', 'Morning']))

In here for every tagger we have a tag method which takes token as list of arguments. if you run the code this will be the result.

[('Hello', 'NN'), ('World', 'NN')]
[('Good', 'NN'), ('Morning', 'NN')]

1 2	[('Hello', 'NN'), ('World', 'NN')] [('Good', 'NN'), ('Morning', 'NN')]

Also you can untag a sentence using this code.

from nltk.tag import untag

print(untag([('Hello', 'NN'), ('World', 'NN')]))

from nltk.tag import untag

print(untag([('Hello', 'NN'), ('World', 'NN')]))

This is the result.

['Hello', 'World']

1	['Hello', 'World']

Also there is a function in Python NLP Default Tagger that you can predict the accuracy. so for this we are going to use Brown Corpus , The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial.

from nltk.corpus import brown
from nltk.tag import DefaultTagger

brown_tagged_sents = brown.tagged_sents(categories='news')

default_tagger = DefaultTagger('NN')

print(default_tagger.evaluate(brown_tagged_sents))

from nltk.corpus import brown

from nltk.tag import DefaultTagger

brown_tagged_sents = brown.tagged_sents(categories='news')

default_tagger = DefaultTagger('NN')

print(default_tagger.evaluate(brown_tagged_sents))

Run the code and you can see that we have received poorly result. the accuracy is 13 percent.

0.13089484257215028

1	0.13089484257215028

There are different taggers that you can use for example Unigram tagger, A Unigram generally refers to a single token. so a unigram tagger only uses a single word as its context for determining the part-of-speech tag.

from nltk.tag import UnigramTagger
from nltk.corpus import treebank



train_sents = treebank.tagged_sents()[:2000]

tagger = UnigramTagger(train_sents)

print(treebank.sents()[0])
print(tagger.tag(treebank.sents()[0]))

from nltk.tag import UnigramTagger

from nltk.corpus import treebank

train_sents = treebank.tagged_sents()[:2000]

tagger = UnigramTagger(train_sents)

print(treebank.sents()[0])

print(tagger.tag(treebank.sents()[0]))

In the above example we have just used the 2000 tagged sentences from tree bank corpus as the training set to initialize the Unigram tagger class. if you run the code this is the result.

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 
'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'),
 ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'),
 ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'),
 ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), 
('29', 'CD'), ('.', '.')]

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join',

'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'),

('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'),

('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'),

('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'),

('29', 'CD'), ('.', '.')]

Now let’s check the accuracy.

test_sents = treebank.tagged_sents()[2000:]


print("Accuracy : ", tagger.evaluate(test_sents))

test_sents = treebank.tagged_sents()[2000:]

print("Accuracy : ", tagger.evaluate(test_sents))

If you see the accuracy, we are receiving 82 percent accuracy.

Accuracy :  0.8289714062852803

1	Accuracy : 0.8289714062852803