In this Python NLP lesson we are going to learn about Python NLP POS Tagging, for POS Tagging we can say Parts of Speech Tagging.
What is Parts of Speech ?
Parts-of-speech (POS) is one of the many tasks in NLP, you may have heard about Part of Speech (POS). so In English the main parts of speech are noun, pronoun, adjective, determiner, verb, adverb, preposition, conjunction, and interjection. Before this you will be familiar that what are adjectives or what are adverbs and what are difference between these. Now as a human you will know about this, but let’s think about the system where we can encode all this knowledge. The parts-of-speech tag identifies whether a word is a noun, verb, adjective, and so on. There are numerous applications of parts-of-speech tagging, such as information retrieval, machine translation and so on.
What is Parts of Speech Tagging (POS) ?
Parts-of-speech tagging is the process of assigning a category (for example, noun, verb, adjective, and so on) tag to individual tokens in a sentence. In NLTK, taggers are present in the nltk.tag package and it is inherited by the TaggerIbase class.
OK now let’s create a simple example in POS Tagging.
1 2 3 4 5 6 7 |
from nltk.tokenize import word_tokenize from nltk import pos_tag text=word_tokenize("python is a good language") print(pos_tag(text)) |
If you run the code you will see this result.
1 |
[('python', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('language', 'NN')] |
If you don’t know for example what is NN or what is VBZ, there is a function in NLTK that you can use.
1 2 3 4 5 6 7 8 9 10 |
from nltk.tokenize import word_tokenize from nltk import pos_tag, help text=word_tokenize("python is a good language") print(pos_tag(text)) print(help.upenn_tagset('NNS')) |
This will be the result
1 2 3 4 5 6 7 8 |
[('python', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('language', 'NN')] NNS: noun, common, plural undergraduates scotches bric-a-brac products bodyguards facets coasts divestitures storehouses designs clubs fragrances averages subjectivists apprehensions muses factory-jobs ... None |
Let’s create another example, this time we want to use Wikipedia library, we want to extract some data from Wikipedia, first of all you need to install this library using pip.
1 |
pip install wikipedia |
This is our example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from nltk.tokenize import word_tokenize from nltk import pos_tag import wikipedia grammer = wikipedia.summary('grammer') gra_token = word_tokenize(grammer) sample_pos = pos_tag(gra_token) print(sample_pos) |
Run the code and this will be the result.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
[('In', 'IN'), ('linguistics', 'NNS'), (',', ','), ('grammar', 'NN'), ('(', '('), ('from', 'IN'), ('Ancient', 'NNP'), ('Greek', 'NNP'), ('γραμματική', 'NNP'), (')', ')'), ('is', 'VBZ'), ('the', 'DT'), ('set', 'NN'), ('of', 'IN'), ('structural', 'J J'), ('rules', 'NNS'), ('governing', 'VBG'), ('the', 'DT'), ('composition', 'NN'), ('of', 'IN'), ('clauses', 'NNS'), (',', ','), ('phrases', 'NNS'), ('and', 'CC'), ('words', 'NNS'), ('in', 'IN'), ('a', 'DT'), ('natural', 'JJ'), ('language', 'NN'), ('.', '.'), ('The', 'DT'), ('term', 'NN'), ('refers', 'NNS'), ('also', 'RB'), ('to', 'TO'), ('the', 'DT'), ('study', 'NN'), ('of', 'IN'), ('such', 'JJ'), ('rules', 'NNS'), ('and', 'CC'), ('this', 'DT'), ('field', 'NN'), ('includes', 'VBZ'), ('phonology', 'NN'), (',', ','), ('morphology', 'NN'), ('and', 'CC'), ('syntax', 'NN'), (',', ','), ('often', 'RB'), ('complemented', 'VBN'), ('by', 'IN'), ] |
Now we are going to separate NN and NNP from our text, you can use this code for that.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from nltk.tokenize import word_tokenize from nltk import pos_tag import wikipedia grammer = wikipedia.summary('grammer') gra_token = word_tokenize(grammer) sample_pos = pos_tag(gra_token) all_noun = [word for word,pos in sample_pos if pos in ['NN','NNP'] ] print(all_noun) |