Playing with Grammar in Python

Suppose our goal is to have Python read a sentence and extract some content from it. The most common application is sentiment analysis, wherein Python scans over a sentence and tells us whether the sentence has a particular sentiment (e.g. "good" or "bad").

For example:

"We had an awful quarter, sales have been terrible."

has a negative tone. Python can detect this tone by being fed a list of negative words (which would include "awful" and "terrible") and then finding those words in the example sentence. This application is fairly straight-forward; the sample code below tells us the sentence is 100% negative.

In [1]:
# example sentence
sentence = "We had an awful quarter, sales have been terrible."

# example tone lists (real lists would be much longer than these)
positive_words = ["great", "tremendous", "amazing"]
negative_words = ["awful", "terrible", "horrific"]

# tone = num. neg. words / (num. neg. words + num. pos. words)
num_pos = len([word for word in sentence.split() if word in positive_words])
num_neg = len([word for word in sentence.split() if word in negative_words])
tone = num_neg / (num_neg + num_pos)
print(tone)
1.0

We can go deeper than this. Python has modules that allow us to unpack the grammar of a sentence. By doing so, we can look for more specific types of content. Here, we'll consider a search for news articles that report management issued guidance.

To begin, consider an obvious instance of management guidance:

XYZ announced that earnings will increase this year.

No sentence could be more plain than this. XYZ, the hypothetical company in the example above, announces that earnings are expected to increase this year. Because the topic ("earnings") pertains to a future period ("this year") rather than a prior period (e.g. "last quarter"), the statement is forward-looking.

The task for finding management issued guidance can be broken down into three parts:

  1. Does the sentence pertain to relevant financial information (e.g. "earnings")?
  2. Does the financial information pertain to a future period (e.g. "next quarter")?
  3. Is the forward-looking statement being made by a company representative?

Let's start with task (1). Given a sentence:

In [2]:
sent = 'XYZ announced that earnings will increase this year.'

Begin by looking for earnings-related words:

In [3]:
# list of financial words/phrases, the full list could be much longer
earnings_words = ['earnings', 'profitability', 'dollars per share']

# scan over earnings_words and check whether these words appear in the sentence of interest
[w in sent for w in earnings_words]
Out[3]:
[True, False, False]

Over the three words in the list earnings_words, the first of these ("earnings") appears in the sentence.

Next look for forward-looking language:

In [4]:
# list of forward-looking words, the full list could be much longer
forward_words = ['forecasted', 'estimated', 'will', 'expected']

# scan over forward_words and check whether these words appear in the sentence of interest
[w in sent for w in forward_words]
Out[4]:
[False, False, True, False]

Over the four words in forward_words, the third of these ("will") appears in the sentence.

We must be careful that the forward-looking language is being applied to the earnings-related word, rather than elsewhere in the sentence. For example, in the sentence below, the earnings word ("earnings") is in a separate and independent clause from the the forward word ("will").

In [5]:
bad_sent = '''XYZ stated that although earnings had fallen last year,
              the board remained confident in how the new CEO will manage the company.'''

To ensure that the forward-looking word and earnings-related word are connected in the sentence, the grammar of the sentence must be convered.

To do this, one can run the sentence through spaCy to analyze the text.

In [6]:
# load spaCy module
import spacy

# pass the sentence through spaCy's text-processing pipeline
nlp = spacy.load("en_core_web_lg")
doc = nlp(sent)

# display the grammar of the sentence
spacy.displacy.render(doc,
               style="dep", # show the dependency strcuture,
               options={'distance':110, # make the output smaller
                        'collapse_phrases':True}, # collapse noun phrases
               jupyter=True) # visualizer being run within Jupyter environment
XYZ PROPN announced VERB that ADP earnings NOUN will VERB increase VERB this DET year. NOUN nsubj mark nsubj aux ccomp det npadvmod

All words have a part of speech (e.g. VERB, NOUN) as well as a dependency. For example, "XYZ" is a proper noun and is the subject (dependency type) for the verb "announced" (the dependency word).

We can access all of this information from the doc object returned from nlp().

In [7]:
for w in doc:
    print(w.text, w.pos_, w.dep_, w.head.text)
XYZ PROPN nsubj announced
announced VERB ROOT announced
that ADP mark increase
earnings NOUN nsubj increase
will VERB aux increase
increase VERB ccomp announced
this DET det year
year NOUN npadvmod increase
. PUNCT punct announced

One simple way to verify that the earnings-related word and the forward-looking word are discussing the same component of a sentence is to ensure that each of the two words shares the same verb. This ignores more complicated sentence structures, and additional checks should be added in to the code.

The verb for the earnings-related word is found:

In [8]:
e_words = [w for w in doc if w.text in earnings_words]

def get_verb(w):
    h = w
    while True:
        if h.pos_ == 'VERB' and h.dep_ != 'aux':
            break
        h = h.head
    return h

e_verbs = {w:get_verb(w) for w in e_words}
    
for e, v in e_verbs.items():
    print(e.text, v.text)
earnings increase

The verb for the forward-looking word is similarly found:

In [9]:
f_words = [w for w in doc if w.text in forward_words]

f_verbs = {w:get_verb(w) for w in f_words}
    
for f, v in f_verbs.items():
    print(f.text, v.text)
will increase

Because "earnings" (the earnings-related word) and "will" (the forward-looking word) share the verb "increase", we can understand that the forward-looking language is being used to discuss the earnings-related topic.

Note that we ignored verbs with dependency "aux" in the above. Auxiliary verbs modify other verbs; they are not the principal verb of the subject-verb pair that we are looking for. However, auxiliary verbs are important because they help us verify forward-looking language. English does not have a formal future tense. Rather, future actions are indicated by auxiliary phrases. For instance, "this year's earnings increase" is in the present tense whereas "next year's earnings will increase". In the latter case, the verb "increase" is modified by the auxiliary verb "will". Auxiliary verbs do not always indicate a future tense; their presence is more nuanced. For example:

In [10]:
sent1 = 'XYZ had expected earnings to increase last year.'
sent2 = 'XYZ expected earnings to increase next year.'

In sent1, "had" modifies "expected" to place it in the past tense. In sent2, the lack of a auxiliary modifier on "expected" leaves it in the present tense; because "expected" is understood to be about future events, we know that the present tense of this word discusses future events.

What remains is to determine whether the forward-looking statement about an earnings-related topic is being given by management. We don't, for instance, wish to include forecasts made by analysts. To determine the speaker in the sentence, we need to find other subjects in the sentence. The word "earnings" in sent is the subject for "increase" whereas the noun phrase "XYZ" is the subject for "announced". These two verbs are linked together (they are causal compliments). We begin by mapping each verb to a subject:

In [11]:
def get_subjMap(doc):
    subj_map = {}
    for s in doc.sents:
        for w in s:
            if w.dep_ == 'nsubj':
                subj_map.update({w.head: w})
                
    return subj_map
        
subj_map = get_subjMap(doc)
for v, w in subj_map.items():
    print(w, v)
XYZ announced
earnings increase

Then, starting at the verb we discovered earlier (and saved in e_verbs), we look for related subject-verb phrases.

In [12]:
for e, v in e_verbs.items():
    subj = subj_map[v.head]
    print(e, subj)
earnings XYZ

This gives confirmation that the agent doing the forecasting is XYZ.

What about instances in which it is not immediately clear from the subject of the sentence what the affiliation of the speaker is? For example:

In [13]:
para = '''
XYZ announced strong results for the quarter.
Alice Smith, CEO of XYZ, remains optimistic.
Bob Johnson, an analyst covering XYZ pressured Smith for details on the latest earnings call.
Smith stated that she expected earnings growth over the next year.
'''

It is the last sentence that has a forecast. However, the subject doing the forcasting is "Smith". Absent any other context, it is unclear from that sentence alone whether "Smith" is affiliated with the company. Note that her affiliation is clairified two sentences earlier.

Because we've expanded the text to contain multiple setences, before going any further let's define a function to check each sentence for the information we've thus far been able to extract. If the function finds a forward-looking statement about an earnings-related item, it should return:

  1. the earnings-related word
  2. the forward-looking word
  3. the verb corresponding to the earnings-related word
  4. the sentence

A sentence may have multiple instances of items (1)-(3), so the function should be structured to return a list of those instances as well as item (4).

In [14]:
docp = nlp(para)

def find_sentence(doc):
    
    return_items = {}
    
    for s in doc.sents:

        # look for earnings-related words
        ep_words = [w for w in s if w.text in earnings_words]
        ep_verbs = {w:get_verb(w) for w in ep_words}

        # look for forward-looking words
        fp_words = [w for w in s if w.text in forward_words]
        fp_verbs = {w:get_verb(w) for w in fp_words}

        # verify that the forward and earnings word match
        for e, ev in ep_verbs.items():
            for f, fv in fp_verbs.items():
                if ev == fv:
                    if s not in return_items:
                        return_items.update({s: [[e.text, f.text, ev]]})
                    else:
                        return_items[s].append([e.text, f.text, ev])

    return return_items
    
found_sentences = find_sentence(docp)
for sentence, instances in found_sentences.items():
    print(sentence)
    for instance in instances:
        print('\t', instance)
Smith stated that she expected earnings growth over the next year.

	 ['earnings', 'expected', expected]

The map of subject-verb pairs in the paragraph is given by:

In [15]:
subj_map = get_subjMap(docp)
for v, w in subj_map.items():
    print(w, v)
XYZ announced
Smith remains
Johnson pressured
Smith stated
she expected

And so if we go looking for the speaker in our forecast sentence:

In [16]:
for instances in found_sentences.values():
    for instance in instances:
        e, f, v = instance
        subj = subj_map[v]
        print(e, f, v, subj)
earnings expected expected she

We find that the speaker is simply "she".

To figure out who the "she" refers to, utilize a co-reference tool. The tool is in the neuralcoref module and can be added to a spaCy pipeline.

(Technical note: neuralcoref requires spaCy==2.1.0, though a version for spaCy 3+ is in development.)

In [17]:
import neuralcoref

# create a new spaCy pipeline
nlp2 = spacy.load('en_core_web_lg')

# add neuralcoref to this pipeline
neuralcoref.add_to_pipe(nlp2)
Out[17]:
<spacy.lang.en.English at 0x7f311cddf0a0>

Now, when we pass the paragraph to spaCy, the output model includes a list of coreference clusters.

In [18]:
docp2 = nlp2(para)

for item in docp2._.coref_clusters:
    print(item.main, item.mentions)
XYZ [
XYZ, XYZ]
Smith [Alice Smith, CEO of XYZ, Smith, Smith, she]

The second coreference cluster shows us that the "she" we're interested is in the same co-reference cluster with "Alice Smith", indicating that the "she" refers to "Alice Smith". Also within this co-reference cluster is the phrase "CEO of XYZ". Given that "XYZ" is the company we are interested in, we can usually deduce that the "she" is representing XYZ.

In [19]:
found_sentences2 = find_sentence(docp2)
sent_list2 = find_sentence(docp2)
subj_map2 = get_subjMap(docp2)

for instances in found_sentences2.values():
    for instance in instances:
        e, f, v = instance       
        subj = subj_map2[v]
        print(e, f, v, subj, subj._.coref_clusters)
earnings expected expected she [Smith: [Alice Smith, CEO of XYZ, Smith, Smith, she]]

Hence, will a little bit of grammar-parsing, it is possible to find reports of management issued guidance in a news article. Obviously, the English language can be far more complex than what's shown above. A fully developed text-parser will need to consider a much richer array of problems (a text-parser I built for this sort of project needed about 800 lines of Python code just to read over the document and check various grammatical constructs). However, it's nice to see what Python can do in this simplified example.