Ariel University

Course name: Deep learning and natural language processing

Course number: 7061510-1

Lecturer: Dr. Amos Azaria

Edited by: Moshe Hanukoglu

Date: First Semester 2018-2019

Based on presentations by Dr. Amos Azaria

Natural Language Tool-Kit (NLTK)

For the computer to understand what we are talking about, we need to simplify the text and analyze it.
One of the libraries that helps us a lot in this area is nltk.

You have to install nltk.
Write in the terminal: pip install nltk

Before you run the code you have to run the first and second command:

import nltk
nltk.download('all')

You have to run the second command only in the first time you use this notebook.

In [1]:
import nltk
In [2]:
nltk.download('all')
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/mcsa/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /home/mcsa/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to /home/mcsa/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package cess_esp is already up-to-date!
[nltk_data]    | Downloading package chat80 to /home/mcsa/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package city_database to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package city_database is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package comparative_sentences to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package comparative_sentences is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package comtrans to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package comtrans is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    | Downloading package conll2007 to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package conll2007 is already up-to-date!
[nltk_data]    | Downloading package crubadan to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package crubadan is already up-to-date!
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package dependency_treebank is already up-to-date!
[nltk_data]    | Downloading package dolch to /home/mcsa/nltk_data...
[nltk_data]    |   Package dolch is already up-to-date!
[nltk_data]    | Downloading package europarl_raw to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package europarl_raw is already up-to-date!
[nltk_data]    | Downloading package floresta to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package floresta is already up-to-date!
[nltk_data]    | Downloading package framenet_v15 to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package framenet_v15 is already up-to-date!
[nltk_data]    | Downloading package framenet_v17 to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package framenet_v17 is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package ieer to /home/mcsa/nltk_data...
[nltk_data]    |   Package ieer is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package indian to /home/mcsa/nltk_data...
[nltk_data]    |   Package indian is already up-to-date!
[nltk_data]    | Downloading package jeita to /home/mcsa/nltk_data...
[nltk_data]    |   Package jeita is already up-to-date!
[nltk_data]    | Downloading package kimmo to /home/mcsa/nltk_data...
[nltk_data]    |   Package kimmo is already up-to-date!
[nltk_data]    | Downloading package knbc to /home/mcsa/nltk_data...
[nltk_data]    |   Package knbc is already up-to-date!
[nltk_data]    | Downloading package lin_thesaurus to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package lin_thesaurus is already up-to-date!
[nltk_data]    | Downloading package mac_morpho to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package mac_morpho is already up-to-date!
[nltk_data]    | Downloading package machado to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package machado is already up-to-date!
[nltk_data]    | Downloading package masc_tagged to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package masc_tagged is already up-to-date!
[nltk_data]    | Downloading package moses_sample to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package moses_sample is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /home/mcsa/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package nombank.1.0 is already up-to-date!
[nltk_data]    | Downloading package nps_chat to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package nps_chat is already up-to-date!
[nltk_data]    | Downloading package omw to /home/mcsa/nltk_data...
[nltk_data]    |   Package omw is already up-to-date!
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package opinion_lexicon is already up-to-date!
[nltk_data]    | Downloading package paradigms to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package paradigms is already up-to-date!
[nltk_data]    | Downloading package pil to /home/mcsa/nltk_data...
[nltk_data]    |   Package pil is already up-to-date!
[nltk_data]    | Downloading package pl196x to /home/mcsa/nltk_data...
[nltk_data]    |   Package pl196x is already up-to-date!
[nltk_data]    | Downloading package ppattach to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package ppattach is already up-to-date!
[nltk_data]    | Downloading package problem_reports to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package problem_reports is already up-to-date!
[nltk_data]    | Downloading package propbank to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package propbank is already up-to-date!
[nltk_data]    | Downloading package ptb to /home/mcsa/nltk_data...
[nltk_data]    |   Package ptb is already up-to-date!
[nltk_data]    | Downloading package product_reviews_1 to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package product_reviews_1 is already up-to-date!
[nltk_data]    | Downloading package product_reviews_2 to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package product_reviews_2 is already up-to-date!
[nltk_data]    | Downloading package pros_cons to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package pros_cons is already up-to-date!
[nltk_data]    | Downloading package qc to /home/mcsa/nltk_data...
[nltk_data]    |   Package qc is already up-to-date!
[nltk_data]    | Downloading package reuters to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package reuters is already up-to-date!
[nltk_data]    | Downloading package rte to /home/mcsa/nltk_data...
[nltk_data]    |   Package rte is already up-to-date!
[nltk_data]    | Downloading package semcor to /home/mcsa/nltk_data...
[nltk_data]    |   Package semcor is already up-to-date!
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package senseval is already up-to-date!
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package sentiwordnet is already up-to-date!
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package sentence_polarity is already up-to-date!
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package sinica_treebank is already up-to-date!
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package smultron is already up-to-date!
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package state_union is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package subjectivity to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package subjectivity is already up-to-date!
[nltk_data]    | Downloading package swadesh to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package swadesh is already up-to-date!
[nltk_data]    | Downloading package switchboard to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package switchboard is already up-to-date!
[nltk_data]    | Downloading package timit to /home/mcsa/nltk_data...
[nltk_data]    |   Package timit is already up-to-date!
[nltk_data]    | Downloading package toolbox to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package toolbox is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package udhr to /home/mcsa/nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to /home/mcsa/nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package universal_treebanks_v20 to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package universal_treebanks_v20 is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package verbnet to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package verbnet is already up-to-date!
[nltk_data]    | Downloading package verbnet3 to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package verbnet3 is already up-to-date!
[nltk_data]    | Downloading package webtext to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package webtext is already up-to-date!
[nltk_data]    | Downloading package wordnet to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package wordnet is already up-to-date!
[nltk_data]    | Downloading package wordnet_ic to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package wordnet_ic is already up-to-date!
[nltk_data]    | Downloading package words to /home/mcsa/nltk_data...
[nltk_data]    |   Package words is already up-to-date!
[nltk_data]    | Downloading package ycoe to /home/mcsa/nltk_data...
[nltk_data]    |   Package ycoe is already up-to-date!
[nltk_data]    | Downloading package rslp to /home/mcsa/nltk_data...
[nltk_data]    |   Package rslp is already up-to-date!
[nltk_data]    | Downloading package maxent_treebank_pos_tagger to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package maxent_treebank_pos_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package universal_tagset is already up-to-date!
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package maxent_ne_chunker is already up-to-date!
[nltk_data]    | Downloading package punkt to /home/mcsa/nltk_data...
[nltk_data]    |   Package punkt is already up-to-date!
[nltk_data]    | Downloading package book_grammars to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package book_grammars is already up-to-date!
[nltk_data]    | Downloading package sample_grammars to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package sample_grammars is already up-to-date!
[nltk_data]    | Downloading package spanish_grammars to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package spanish_grammars is already up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package large_grammars to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package large_grammars is already up-to-date!
[nltk_data]    | Downloading package tagsets to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package tagsets is already up-to-date!
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package snowball_data is already up-to-date!
[nltk_data]    | Downloading package bllip_wsj_no_aux to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package bllip_wsj_no_aux is already up-to-date!
[nltk_data]    | Downloading package word2vec_sample to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package word2vec_sample is already up-to-date!
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package panlex_swadesh is already up-to-date!
[nltk_data]    | Downloading package mte_teip5 to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package mte_teip5 is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package perluniprops to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package perluniprops is already up-to-date!
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package nonbreaking_prefixes is already up-to-date!
[nltk_data]    | Downloading package vader_lexicon to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package vader_lexicon is already up-to-date!
[nltk_data]    | Downloading package porter_test to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package porter_test is already up-to-date!
[nltk_data]    | Downloading package wmt15_eval to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package wmt15_eval is already up-to-date!
[nltk_data]    | Downloading package mwa_ppdb to
[nltk_data]    |     /home/mcsa/nltk_data...
[nltk_data]    |   Package mwa_ppdb is already up-to-date!
[nltk_data]    | 
[nltk_data]  Done downloading collection all
Out[2]:
True

To separate the sentence into words there are several methods.
The first method is separation by spaces.

In [3]:
my_text = "Where is St. Paul located? I don't seem to find it. It isn't in my map."
my_text.split(" ") # or my_text.split()
Out[3]:
['Where',
 'is',
 'St.',
 'Paul',
 'located?',
 'I',
 "don't",
 'seem',
 'to',
 'find',
 'it.',
 'It',
 "isn't",
 'in',
 'my',
 'map.']

But this method is not good because there are many cases where it does not work well.
For example, "Dad went home." The point at the end of the sentence does not belong to the last word, but the above path does not separate the point from the last word.

So we'll use a second method that belongs to nltk, which separates the text into sentences and words correctly.

^ Contents

Tokenize

In [4]:
from nltk.tokenize import word_tokenize, sent_tokenize
sent_tokenize(my_text)
Out[4]:
['Where is St. Paul located?',
 "I don't seem to find it.",
 "It isn't in my map."]
In [5]:
word_tokenize(my_text)
Out[5]:
['Where',
 'is',
 'St.',
 'Paul',
 'located',
 '?',
 'I',
 'do',
 "n't",
 'seem',
 'to',
 'find',
 'it',
 '.',
 'It',
 'is',
 "n't",
 'in',
 'my',
 'map',
 '.']

In addition, we want to make all words into their basic form. For example, walking, walking, and walk are all different rows of walk.

^ Contents

Stemmer

In [6]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
my_text = "Whoever eats many cookies is regretting doing so"
stemmed_sentence = []
for word in word_tokenize(my_text):
    stemmed_sentence.append(ps.stem(word))
stemmed_sentence
Out[6]:
['whoever', 'eat', 'mani', 'cooki', 'is', 'regret', 'do', 'so']

After separating the text into single words and finding the "root" of the word, we would like to know the syntactic role of each of the words in the sentence.

^ Contents

Part of Speech Tagging (POS)

In [7]:
my_tokenized_text = word_tokenize(my_text)
nltk.pos_tag(my_tokenized_text)
Out[7]:
[('Whoever', 'NNP'),
 ('eats', 'VBZ'),
 ('many', 'JJ'),
 ('cookies', 'NNS'),
 ('is', 'VBZ'),
 ('regretting', 'VBG'),
 ('doing', 'VBG'),
 ('so', 'RB')]

The following command shows us all the part of speech tagging.

In [8]:
nltk.help.upenn_tagset()
$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
JJR: adjective, comparative
    bleaker braver breezier briefer brighter brisker broader bumper busier
    calmer cheaper choosier cleaner clearer closer colder commoner costlier
    cozier creamier crunchier cuter ...
JJS: adjective, superlative
    calmest cheapest choicest classiest cleanest clearest closest commonest
    corniest costliest crassest creepiest crudest cutest darkest deadliest
    dearest deepest densest dinkiest ...
LS: list item marker
    A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005
    SP-44007 Second Third Three Two * a b c d first five four one six three
    two
MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
    Apache Apaches Apocrypha ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
PDT: pre-determiner
    all both half many quite such sure this
POS: genitive marker
    ' 's
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
PRP$: pronoun, possessive
    her his mine my our ours their thy your
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
RBR: adverb, comparative
    further gloomier grander graver greater grimmer harder harsher
    healthier heavier higher however larger later leaner lengthier less-
    perfectly lesser lonelier longer louder lower more ...
RBS: adverb, superlative
    best biggest bluntest earliest farthest first furthest hardest
    heartiest highest largest least less most nearest second tightest worst
RP: particle
    aboard about across along apart around aside at away back before behind
    by crop down ever fast for forth from go high i.e. in into just later
    low more off on open out over per pie raising start teeth that through
    under unto up up-pp upon whole with you
SYM: symbol
    % & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***
TO: "to" as preposition or infinitive marker
    to
UH: interjection
    Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
    huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
    man baby diddle hush sonuvabitch ...
VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
VBD: verb, past tense
    dipped pleaded swiped regummed soaked tidied convened halted registered
    cushioned exacted snubbed strode aimed adopted belied figgered
    speculated wore appreciated contemplated ...
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...
WDT: WH-determiner
    that what whatever which whichever
WP: WH-pronoun
    that what whatever whatsoever which who whom whosoever
WP$: WH-pronoun, possessive
    whose
WRB: Wh-adverb
    how however whence whenever where whereby whereever wherein whereof why
``: opening quotation mark
    ` ``

After we have seen a way to obtain the syntactic role of each word in the text we can now improve the accuracy of finding the root of the words in the sentence.
This method is better than stemming but is more complex.

^ Contents

Lemmatization

The WordNetLemmatizer in NLTK uses a shorter list of part of speech tags than the Penn TreeBank, (so we use a short function to make the conversion).

In [9]:
from nltk.corpus import wordnet as wndef is_noun(tag):
    return tag in ['NN', 'NNS', 'NNP', 'NNPS']def is_verb(tag):
    return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']def is_adverb(tag):
    return tag in ['RB', 'RBR', 'RBS']def is_adjective(tag):
    return tag in ['JJ', 'JJR', 'JJS']def penn2wn(tag):
    if is_adjective(tag):
return wn.ADJ
    elif is_noun(tag):
return wn.NOUN
    elif is_adverb(tag):
return wn.ADV
    elif is_verb(tag):
return wn.VERB
    return wn.NOUN
In [10]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
lzr = WordNetLemmatizer()
my_text = "Whoever eats many cookies is regretting doing so"
lemed = []
for (word,pos) in nltk.pos_tag(word_tokenize(my_text)):
    lemed.append(lzr.lemmatize(word,penn2wn(pos)))
In [12]:
lemed
Out[12]:
['Whoever', 'eat', 'many', 'cooky', 'be', 'regret', 'do', 'so']

After we divided the sentence into words and analyzed them, we now want to find small parts of the sentence called Chunking. These sets of words allow us to treat the word not only as a single word but in a sentence so that we can better understand its context. In order to find such groups of words, regular expressions are constructed that represent the structure we want to find.
For example: "{< DT >? < JJ > * < NN >}" This expression means that we are looking for a sentence structure which will appear at most once DT each amount of JJ and once NN.

To learn about the signs of regular expressions see link

^ Contents

Chunking

In [13]:
my_text = "Dogs or small cats saw Sara, John, Tom, the pretty girl and the big bat"
tagged = nltk.pos_tag(nltk.tokenize.word_tokenize(my_text))
grammar =   """NP: {<DT>?<JJ>*<NN.?>}
                     NounList: {(<NP><,>?)+<CC><NP>}"""
cp = nltk.RegexpParser(grammar)
result = cp.parse(tagged)
In [14]:
print(result)
(S
  (NounList (NP Dogs/NNS) or/CC (NP small/JJ cats/NNS))
  saw/VBD
  (NounList
    (NP Sara/NNP)
    ,/,
    (NP John/NNP)
    ,/,
    (NP Tom/NNP)
    ,/,
    (NP the/DT pretty/JJ girl/NN)
    and/CC
    (NP the/DT big/JJ bat/NN)))

In order to see the analysis of the chunk in a visual way

In [15]:
result.draw()

There are cases where the methods we have learned so far are not enough to analyze the sentence because the sentence consists of the terms and names of people, places, organizations, and so on, which are checked only when they refer to the whole phrase together and not each word separately.
So we will use the following library to more accurately analyze these expressions

^ Contents

Named Entity Recognition (NER)

In [16]:
result = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize("Bill Clinton is the president of the United States")))
print (result)
(S
  (PERSON Bill/NNP)
  (PERSON Clinton/NNP)
  is/VBZ
  the/DT
  president/NN
  of/IN
  the/DT
  (GPE United/NNP States/NNPS))
In [17]:
result.draw()

^ Contents

N-Grams

Before we go any further, we'll get a bit ahead of the N-Gram function. This function creates groups of consecutive words the size of N.
By creating these sequences we can also get a perspective on the entire context of the word and not just look at the word in itself.

In [18]:
text = "It is a simple text this, this is a simple text, is it simple?"
list(nltk.ngrams(nltk.word_tokenize(text),3))
Out[18]:
[('It', 'is', 'a'),
 ('is', 'a', 'simple'),
 ('a', 'simple', 'text'),
 ('simple', 'text', 'this'),
 ('text', 'this', ','),
 ('this', ',', 'this'),
 (',', 'this', 'is'),
 ('this', 'is', 'a'),
 ('is', 'a', 'simple'),
 ('a', 'simple', 'text'),
 ('simple', 'text', ','),
 ('text', ',', 'is'),
 (',', 'is', 'it'),
 ('is', 'it', 'simple'),
 ('it', 'simple', '?')]

Create random story by N-Grams

In [19]:
import urllib
from random import randint
paragraph_len = 100
all_text = urllib.request.urlopen("https://s3.amazonaws.com/text-datasets/nietzsche.txt").read().decode("utf-8") #for Python 2 remove "request"
In [20]:
tokens = nltk.word_tokenize(all_text)
my_grams = list(nltk.ngrams(tokens,3))

random words for the beginning of the story.
We will execute a 3-gram function on the text.
We will propose the beginning of the story the first two words and then basically look within all three triangles created by the 3-gram function of the trios that contain the first two parts the last two words added to the story and add the third word from the trio to the array.
From the array we will select the next word in principle. Now we will perform the same actions again, but now we will take the new word that joined the story and the previous word and attached it as the two words we will search for in the triples of the 3-gram.
We will continue this way as many times as we want (with us it will be paragraph_len times).

In [21]:
sentence = ["It", "is"]
In [22]:
for i in range(paragraph_len):
    options = []
    for trig in my_grams:
if trig[0].lower() == sentence[len(sentence)-2].lower() and trig[1].lower() == sentence[len(sentence)-1].lower():
    options.append(trig[2])
    if len(options) > 0:
sentence.append(options[randint(0, len(options)-1)])
print(" ".join(sentence))
It is highly esteemed one ( for example , are of this contest should always be kept up in sitting still ; and whosoever has experienced any of it would nevertheless remain incontrovertible that of contributing its own eccentricities and sudden outbreaks and that at present sprawls in the distance , so that a certain narrowness , aridity , and with the anxious counsel of physicians , without a sort of INTUITIVE perception , like a gilded , blue , wanton , lightsome , and imperious -- all judgments of its `` intelligible character '' -- In order finally to enjoy life

^ Contents

Context Free-Grammar (CFG)

As you have learned in Automatons and Formal Languages. CFG are composed of four components:
T: terminal vocabulary (the words of the language being defined)
N: non-terminal vocabulary
P: a set of productions of the form a -> b, (a is a non-terminal and b is a sequence of one or more symbols from T U V)
S: the start symbol (member of N)


To read about the Context Free-Grammar, you can read the following link


As we have done so far, in this method, too, we will want to accept a certain structure of a sentence so that it will be easier for us to analyze it and understand the meaning of each of its parts.

CFG Parser

Similar to Chunking's search by regular expressions, so we can also find the structure of a sentence by Context Free-Grammar.
We will note our grammar and seek to find the form of construction of the sentence by the grammar we have given to the system.

In [23]:
grammar1 = nltk.CFG.fromstring(""" 
S -> NP VP 
VP -> V NP | V NP PP 
PP -> P NP 
V -> "saw" | "ate" | "walked" 
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP 
Det -> "a" | "an" | "the" | "my" 
N -> "man" | "dog" | "cat" | "telescope" | "park" 
P -> "in" | "on" | "by" | "with" """)
sentence = nltk.word_tokenize("Mary saw Bob")
In [24]:
rd_parser = nltk.RecursiveDescentParser(grammar1) 
print(list(rd_parser.parse(sentence))[0]) 
(S (NP Mary) (VP (V saw) (NP Bob)))
In [25]:
print(list(rd_parser.parse("Mary saw a dog with my telescope".split()))[0])
(S
  (NP Mary)
  (VP
    (V saw)
    (NP (Det a) (N dog) (PP (P with) (NP (Det my) (N telescope))))))
In [26]:
(list(rd_parser.parse("Mary saw a dog with my telescope".split()))[0]).draw()

The drawback of the grammatical is that twice there are several ways for the same sentence we will now represent one such case.
The following sentences are ambigious.

In [27]:
groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] 
parser = nltk.ChartParser(groucho_grammar) 
for tree in parser.parse(sent): 
    print(tree)
(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))
In [29]:
tree.draw()

^ Contents

CYK Algorithm

  • The CYK algorithm which uses dynamic programing is used in order to find a tree (a constituency parse)
  • Complexity of the CYK is O( $n^3 |G|$), where n is the length of the sentence, and G is the grammar size.
  • We will go over the details in the appendix classes


This method is built as a CFG Parser and has a probability addition to each of the transitions in grammar, in such a way we can see the probability of obtaining a given sentence by the structure of the suggested grammar.
Now even if there are two options to construct the same sentence, we can choose the more likely option ie the possibility that it has a greater probability

In [30]:
grammar = nltk.PCFG.fromstring("""
    S    -> NP VP              [1.0]
    VP   -> TV NP              [0.4]
    VP   -> IV                 [0.3]
    VP   -> DatV NP NP         [0.3]
    TV   -> 'saw'              [1.0]
    IV   -> 'ate'              [1.0]
    DatV -> 'gave'             [1.0]
    NP   -> 'telescopes'       [0.8]
    NP   -> 'Jack'             [0.2] """)
viterbi_parser = nltk.ViterbiParser(grammar)
In [31]:
for tree in viterbi_parser.parse(['Jack', 'saw', 'telescopes']):
    print(tree)
(S (NP Jack) (VP (TV saw) (NP telescopes))) (p=0.064)

^ Contents

CoreNLP

CoreNLP is more powerful than NLTK, and includes the following features, which we will examine next:

  • A large built-in grammar
  • Dependency Parsing
  • Coreference Resolution

A large built-in grammar

We can load grammars into our software. This does not restrict us from remaining with the defaults specified by the directory.

Dependency Parsing

We can check the dependence of another word in the sentence.

Before you run this code you sould read this link

In [ ]:
from nltk.parse.stanford import StanfordDependencyParser
path_to_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser.jar'
path_to_models_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser-	3.4.1-models.jar'
dependency_parser = StanfordDependencyParser(path_to_jar=path_to_jar, 	path_to_models_jar=path_to_models_jar)result = dependency_parser.raw_parse('I shot an elephant in my sleep')
dep = result.next()
list(dep.triples())

Coreference Resolution

In our speech we use a lot of references For example: The bus was full. It blew very fast.


One of the tools that this library gives us is an analysis of these references

^ Contents

Sentiment Analysis

There are many cases where we want to know the feelings of the sentence, so let us check through as the trial contains negative / positive content.

In [32]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sna = SentimentIntensityAnalyzer()

In case you get a warning to look at the link

Note: Compound is a normalized value between -1 (negative) to +1 (positive).

In [33]:
sna.polarity_scores("The movie was great!")
Out[33]:
{'compound': 0.6588, 'neg': 0.0, 'neu': 0.406, 'pos': 0.594}
In [34]:
sna.polarity_scores("I liked the book, especially the ending.")
Out[34]:
{'compound': 0.4215, 'neg': 0.0, 'neu': 0.641, 'pos': 0.359}
In [35]:
sna.polarity_scores("The staff were nice, but the food was terrible.")
Out[35]:
{'compound': -0.5023, 'neg': 0.318, 'neu': 0.536, 'pos': 0.146}