CAP 5771 Spring 25

Logo

This is the web page for Introduction to Data Science at the University of Florida.

Text Retrieval and Extraction Demo



In this demo, we will explore the use of Spacy and sklearn to perform text processing tasks. We will see how to use Spacy to perform part of speech tagging and named entity recognition. We will also see how to use sklearn to vectorize text and perform a search. Finally, we will see how to use the KNN classifier to perform a search over the data. We will use a dataset of tweets that were collected from Twitter for the propose of identifying possible mentions of food borne illness symptoms. The data set contains real tweets so there is profanity and other inappropriate content.

Downloading the Necessary Libraries

We will use the following packages for this demo. Depending on your Python version you may need to downgrade numpy to version 1.26.4 in order to avoid a conflict between spacy and pandas.

pip install --upgrade pip pandas scikit-learn spacy numpy==1.26.4

For more information on spacy, visit the spacy documentation. We are going to use pretrained models from Spacy to do the next tasks. Lets download one of the smaller models that were trained and built over a text corpus for English documents and core tasks. Lets also use a multi lingual model for Name Entity Recognition.

python -m spacy download en_core_web_sm
python -m spacy download xx_ent_wiki_sm

We can also import the libraries we will be using.

import numpy as np
import pandas as pd
import sklearn
import spacy as sp
import spacy

If working with pandas in a notebook or iPython it is helpful to set the display options to show all columns and rows. Even though we have tweets the text content may be longer than the default display width.

pd.set_option('display.width', None)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_colwidth', 140)

Because we are we will be using a lot of text data, it is helpful to set the display of each column to a maxwidth. This will let us see all the text in a tweet.

pd.set_option('display.max_colwidth', None)

Loading the Data

Lets first grab a dataset of tweets (not exes). The tweets are in a compressed json line file. We can use the pandas library to read the file directly from the URL and decompress it on the fly. Lets also inspect the data. Also, lets look at only the first 200 lines.

df = pd.read_json("https://cise.ufl.edu/~cgrant/files/tweets.json.gz", compression='gzip', lines=True, nrows=200)
df.head()
df.describe()
df.columns

The data was collected using a service called gnip that was later acquired by Twitter. There is column called “gnip” that contains the text that was used to match the tweet. Lets see if we can use what we learned so far to see if we can find the important words in the tweets.

Note: Command line download and matching rules Click for more
If using the command line you can use the following one line command to get the gnip matching rules.

```bash
curl -s https://cise.ufl.edu/~cgrant/files/tweets.json.gz | zcat | jq '.gnip | .matching_rules[].value' | sort | uniq -c | sort -n

The following is the result.

   1 "\"abdominal pain\""
   1 "\"liquid shit\""
   1 "\"tummy pain\""
   1 "diarhea"
   1 "farted"
   1 "puked"
   2 "\"dolor de estomago\""
   2 "\"my stomach hurts\""
   2 "\"stomach pain\""
   2 "\"stomach virus\""
   2 "farting"
   2 "gastro"
   2 "gastroenteritis"
   2 "heartburn"
   2 "shat"
   3 "\"pepto bismol\""
   3 "\"the runs\""
   3 "anti-diarrheal"
   4 "constipation"
   4 "norovirus"
   5 "indigestion"
   6 "\"cant hold it in\""
   6 "\"tummy upset\""
   6 "pepto"
   7 "\"threw up\""
   7 "\"upset stomach\""
   7 "fart"
   7 "nauseous"
   7 "spew"
   7 "vomitaron"
   8 "shits"
   9 "cramps"
  10 "puking"
  13 "\"throw up\""
  13 "gas"
  15 "sickness"
  24 "puke"
  27 "vomitaste"
  44 "nausea"
  52 "\"stomach bug\""
 112 "diarrea"
 154 "sick"
 206 "vomite"
 263 "vomited"
 735 "\"food poisoning\""
 738 "vomiting"
 803 "diarrhea"
1251 "vomito"
2582 "vomitar"
4376 "vomit"


Part of Speech tagging

Now that we have access to the textual data, we can use the Spacy library to perform part of speech tagging.

We load SpaCy’s nlp object for one of the downloaded models. We can then pass in text to the nlp object to get a document object. Iterating over the document object gives us access to tokens wheere we can print the parts of speech.

nlp = spacy.load("en_core_web_sm")
doc = nlp("I have a stomach bug and I am sick.")
for token in doc:
    print(f"{token.text:7} --- {token.pos_:6} {spacy.explain(token.pos_)}")
Click for the output
I       --- PRON   pronoun
have    --- VERB   verb
a       --- DET    determiner
stomach --- NOUN   noun
bug     --- NOUN   noun
and     --- CCONJ  coordinating conjunction
I       --- PRON   pronoun
am      --- AUX    auxiliary
sick    --- ADJ    adjective
.       --- PUNCT  punctuation

Now that we know how to extrat the parts of speech from a document, we can use the Spacy library to extract the parts of speech from the Tweets.

nlp = spacy.load("en_core_web_sm")
df['pos'] = df.text.apply(lambda x: [(t.text, t.pos_) for t in nlp(x)])
print(df['pos'].to_dict()[0]) # Click for output
[('"', 'PUNCT'),
 ('His', 'PRON'),
 ('palms', 'NOUN'),
 ('are', 'AUX'),
 ('sweaty', 'NOUN'),
 (',', 'PUNCT'),
 ('knees', 'NOUN'),
 ('weak', 'ADJ'),
 (',', 'PUNCT'),
 ('arms', 'NOUN'),
 ('are', 'AUX'),
 ('heavyThere', 'PRON'),
 ("'s", 'PART'),
 ('vomit', 'NOUN'),
 ('on', 'ADP'),
 ('his', 'PRON'),
 ('sweater', 'NOUN'),
 ('already', 'ADV'),
 (',', 'PUNCT'),
 ('mom', 'NOUN'),
 ("'s", 'PART'),
 ('spaghetti', 'NOUN'),
 ('"', 'PUNCT')]


Named Entity Recognition

Similarly, we can use SpaCys build in models to do Named Entity Recognition. In SpaCy, we can iterate over the entities in a document by inspecting it’s ents attribute.

nlp = spacy.load("xx_ent_wiki_sm")
doc = nlp("My dad, Sean, and my Mom, Beyoncé, flew Delta to Walgreens in Gainesville for medicine.")
for token in doc.ents:
    print(f"{token.text:11} --- {token.label_:4} {spacy.explain(token.label_)}")
Click for the output
Sean        --- PER  Named person or family.
Beyoncé     --- PER  Named person or family.
Delta       --- ORG  Companies, agencies, institutions, etc.
Walgreens   --- LOC  Non-GPE locations, mountain ranges, bodies of water
Gainesville --- LOC  Non-GPE locations, mountain ranges, bodies of water

Let’s the recognized named entites to the tweets.

nlp = spacy.load("en_core_web_sm")
df['ner'] = df.text.apply(lambda x: [(t.text, t.label_) for t in nlp(x).ents])
print(df[‘ner’].to_dict()[1]) # Click for the output
[('@LaycNichole @tantanns', 'PER'), ('Haha', 'PER'), ('😟', 'LOC')]


Vectorizing text

The SpaCy library gives us powerful tools for working with text. Under the hood, each of the preloaded models have created a pipeline to transform the text from words in to a representation that can be used for machine learning. Let’s use sklearn to explore the process of vectorization.

CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

The CountVectorizer has several parameters that allow us to turn the tokens into text. We can specify additional stopwords to remove from the text. We can create vectors into ngrams instead of just words. We can also specify the maximum number of features to extract and the count vectorizer will select only the most frequent features.

Let’s create a CountVectorizer object that will remove the English stop words, create bigrams, and limit the number of features to 100.

cv = CountVectorizer(stop_words='english',  
                     ngram_range=(2, 2), 
                     max_features=100)

We can then fit the vectorizer to the text data and transform the text into a matrix of token counts.

X = cv.fit_transform(df['text'].to_list())
cv.get_feature_names_out()
print(X.shape) and feature names
X.shape 
(200, 100)
cv.get_feature_names_out()
array(['1dsedithgee diarrhea', 'amigas quieren', 'arms heavythere',
       'asco que', 'blood vomit', 'boyfriends daddy', 'cada vez',
       'chica llama', 'cosasdemujer vomitar', 'creo que', 'cuando una',
       'cólicos menstruales', 'daddy just', 'dan ganas', 'diarrhea face',
       'drogaron detergente', 'en la', 'en vos', 'food poisoning',
       'ganas vomitar', 'gas turns', 'girls boyfriends', 'got blood',
       'gt gt', 'heavythere vomit', 'hize la', 'hospitalised got',
       'just want', 'knees weak', 'la noche', 'les hize', 'lethal weapon',
       'llama su', 'lt lt', 'makes want', 'mariposas muertas',
       'maruhagopian mis', 'merlinadice vomito', 'mis amigas',
       'mom spaghetti', 'noche paran', 'non lethal', 'novio papi',
       'palms sweaty', 'papi dan', 'paran reirse', 'pd creo',
       'pepper gas', 'pienso en', 'por cólicos', 'que asco', 'que pienso',
       'que se', 'quieren vomite', 'quiero vomitar', 'reirse pd',
       'restaurants cause', 'ride hockey', 'rodriguezvanee_ cuando',
       'rositas comeré', 'rt adamlevine', 'rt arroia',
       'rt camiladenewells', 'rt captainallant', 'rt maruhagopian',
       'rt merlinadice', 'se drogaron', 'severe coughing', 'si te',
       'si vomito', 'sick hospitalised', 'siento caida', 'sin la',
       'slenge babe', 'sobre toda', 'su existencia', 'su novio',
       'sweater mom', 'sweaty knees', 'te vomito', 'tengo ganas',
       'toda su', 'turns sick', 'una chica', 'vez que', 'vomit existence',
       'vomit severe', 'vomit sweater', 'vomitar en', 'vomitar por',
       'vomitar sobre', 'vomite les', 'vomito en', 'vomito mariposas',
       'vos vomito', 'voy vomitar', 'want vomit', 'weak arms',
       'weapon pepper', 'word vomit'], dtype=object)


TF x IDF Vectorizer

CountVectorizer uses the bag of words approach to vectorization. To use the TF x IDF approach that uses the term frequency and inverse document frequency as a better weightage for each term we can use the TfidfVectorizer. We can use this to explicitly limit the rare document terms that are not useful for classification.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english',  
                         ngram_range=(2, 2), 
                         max_features=100)

X = tfidf.fit_transform(df['text'].to_list())  

We can then get the feature names and weights from the from the tfidf object by inspecting the vocabulary_ attribute for tf and the idf_ attribute for the inverse document frequency.

sorted([(int(v), k) for (k,v) in tfidf.vocabulary_.items()])

You can look at the tfidf scores by printing the X matrix, the result of the transform.

X.toarray()
# Printing Sorted Vocablary
[(0, '1dsedithgee diarrhea'),
 (1, 'amigas quieren'),
 (2, 'arms heavythere'),
 (3, 'asco que'),
 (4, 'blood vomit'),
 (5, 'boyfriends daddy'),
 (6, 'cada vez'),
 (7, 'chica llama'),
 (8, 'cosasdemujer vomitar'),
 (9, 'creo que'),
 (10, 'cuando una'),
 (11, 'cólicos menstruales'),
 (12, 'daddy just'),
 (13, 'dan ganas'),
 (14, 'diarrhea face'),
 (15, 'drogaron detergente'),
 (16, 'en la'),
 (17, 'en vos'),
 (18, 'food poisoning'),
 (19, 'ganas vomitar'),
 (20, 'gas turns'),
 (21, 'girls boyfriends'),
 (22, 'got blood'),
 (23, 'gt gt'),
 (24, 'heavythere vomit'),
 (25, 'hize la'),
 (26, 'hospitalised got'),
 (27, 'just want'),
 (28, 'knees weak'),
 (29, 'la noche'),
 (30, 'les hize'),
 (31, 'lethal weapon'),
 (32, 'llama su'),
 (33, 'lt lt'),
 (34, 'makes want'),
 (35, 'mariposas muertas'),
 (36, 'maruhagopian mis'),
 (37, 'merlinadice vomito'),
 (38, 'mis amigas'),
 (39, 'mom spaghetti'),
 (40, 'noche paran'),
 (41, 'non lethal'),
 (42, 'novio papi'),
 (43, 'palms sweaty'),
 (44, 'papi dan'),
 (45, 'paran reirse'),
 (46, 'pd creo'),
 (47, 'pepper gas'),
 (48, 'pienso en'),
 (49, 'por cólicos'),
 (50, 'que asco'),
 (51, 'que pienso'),
 (52, 'que se'),
 (53, 'quieren vomite'),
 (54, 'quiero vomitar'),
 (55, 'reirse pd'),
 (56, 'restaurants cause'),
 (57, 'ride hockey'),
 (58, 'rodriguezvanee_ cuando'),
 (59, 'rositas comeré'),
 (60, 'rt adamlevine'),
 (61, 'rt arroia'),
 (62, 'rt camiladenewells'),
 (63, 'rt captainallant'),
 (64, 'rt maruhagopian'),
 (65, 'rt merlinadice'),
 (66, 'se drogaron'),
 (67, 'severe coughing'),
 (68, 'si te'),
 (69, 'si vomito'),
 (70, 'sick hospitalised'),
 (71, 'siento caida'),
 (72, 'sin la'),
 (73, 'slenge babe'),
 (74, 'sobre toda'),
 (75, 'su existencia'),
 (76, 'su novio'),
 (77, 'sweater mom'),
 (78, 'sweaty knees'),
 (79, 'te vomito'),
 (80, 'tengo ganas'),
 (81, 'toda su'),
 (82, 'turns sick'),
 (83, 'una chica'),
 (84, 'vez que'),
 (85, 'vomit existence'),
 (86, 'vomit severe'),
 (87, 'vomit sweater'),
 (88, 'vomitar en'),
 (89, 'vomitar por'),
 (90, 'vomitar sobre'),
 (91, 'vomite les'),
 (92, 'vomito en'),
 (93, 'vomito mariposas'),
 (94, 'vos vomito'),
 (95, 'voy vomitar'),
 (96, 'want vomit'),
 (97, 'weak arms'),
 (98, 'weapon pepper'),
 (99, 'word vomit')]


Querying data

We can use the vectorized data as a document representation. If we want to perform a search we can do do a string look up or a regular expression.

query = "cough"
df[df['text'].str.contains(query, case=False, na=False)].text
Click for the output

{38: “<<<{(€(QNS)€)}>>>\n\n>> Non Lethal weapon ‘Pepper gas’ turns many sick, 8 hospitalised. Several got blood vomit by severe coughing..”, 54: “Non Lethal weapon ‘Pepper gas’ turns many sick, 8 hospitalised. Several got blood vomit by severe coughing.\n\n#owais”, 86: “======>SNA<======\nNon Lethal weapon ‘Pepper gas’ turns many sick, 8 hospitalised. Several got blood vomit by severe coughing….\n\nnps”}

We can also use the vectorized data to perform a search. Given the query, we can get the vector representation of the query using the model and then use the cosine similarity to find the most similar documents.

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import manhattan_distances
from sklearn.metrics.pairwise import cosine_distances

query = 'vomit sweater'
q = tfidf.transform([query])

a = pairwise_distances(q, X, metric="cosine")

# Use argsort to get the lowest (closest) cosine similarity

df.text.iloc[np.argsort(a)[0][0:4]]

Search results
4                                        His palms are sweaty, knees weak, arms are heavy. There's vomit on his sweater already. MOM'S SPAGHETTI.
0                                       " His palms are sweaty, knees weak, arms are heavyThere's vomit on his sweater already, mom's spaghetti "
50    LOL VINCENT!!! RT“@CaptainAllanT: " His palms are sweaty, knees weak, arms are heavyThere's vomit on his sweater already, mom's spaghett...
95                                       @admmoody Well I clearly think it's better than you do. I don't like his repetitive kaleidoscopic vomit.
Name: text, dtype: object


The KNN classifier is a simple and effective way to classify data. The method by creating a data structure to efficiently compute pairwise distances between the data points. After the classifer is fit, you can do a “query” to find the nearest neighbors with the .kneighbors method.

Below is the full code to create the classifier. To help effieincy we will use the sklearn Pipeline to create a pipeline of the vectorizer and the classifier. Using the pipeline, we do not have to remember to perform the vectorization steps. We can save the pipeline to a file and load it later instead of in parts. Also, there is no test train split because we are doing a search over the whole data set.

from sklearn.neighbors import KNeighborsClassifier

from sklearn.pipeline import Pipeline

tfidf = TfidfVectorizer(stop_words='english',  
                         ngram_range=(2, 2), 
                         max_features=100)
knn = KNeighborsClassifier(n_neighbors=5, metric='cosine', algorithm="auto", n_jobs=-1)

pipe = Pipeline([
    ('tfidf', tfidf),
    ('knn', knn)
])

# Note: the y is ignored
pipe.fit(df['text'].to_list(), df['text'].to_list())

query = 'vomit sweater'
q = pipe['tfidf'].transform([query])

distances, indices = pipe['knn'].kneighbors(q, n_neighbors=4, return_distance=True)

You can see that querying this was is a little clunky. Instead, consider using the KNTransformer class from the sklearn library. The transformers are sure to implement the fit_transform and transform methods. Note that by default the “connectivity” of the graph is returned that the weight is only given to the closest points. Instead, passing mode='distance' will return the distance to each point.

from sklearn.neighbors import KNeighborsTransformer

knn = KNeighborsTransformer(n_neighbors=4, metric='cosine', algorithm="auto", n_jobs=-1, mode="distance")
tfidf = TfidfVectorizer(stop_words='english',  
                         ngram_range=(2, 2), 
                         max_features=100)
pipe = Pipeline([('tfidf', tfidf), ('knn', knn)])
query = 'vomit sweater'
q = pipe['tfidf'].transform([query])
pipe.fit(df['text'].to_list(), df['text'].to_list())
Xt = pipe.transform([query])

Xt.todok().items()

We extract each of the non-zero items from the sparse matrix and see that the the results are the same as above.


dict_items([((0, 0), 0.6220355269907727), ((0, 4), 0.6220355269907727), ((0, 50), 0.6220355269907727), ((0, 134), 1.0), ((0, 137), 1.0)])
df.text.iloc[indices[0]]
4                                        His palms are sweaty, knees weak, arms are heavy. There's vomit on his sweater already. MOM'S SPAGHETTI.
0                                       " His palms are sweaty, knees weak, arms are heavyThere's vomit on his sweater already, mom's spaghetti "
50    LOL VINCENT!!! RT“@CaptainAllanT: " His palms are sweaty, knees weak, arms are heavyThere's vomit on his sweater already, mom's spaghett...
67                                                                                                        Me quiiero vomitar... Comí demasiadoo 🐷
Name: text, dtype: object


Conclusion

In this demo, we have explored the use of Spacy and sklearn to perform text processing tasks. We have seen how to use Spacy to perform part of speech tagging and named entity recognition. We have also seen how to use sklearn to vectorize text and perform a search. Finally, we have seen how to use the KNN classifier to perform a search over the data.

There are many more features available. For example, if using the en_core_web_sm model you can extract the word shape for each token with token.shape_. Below is an example of the word shape features.

Have fun!

Word Shape example
nlp = spacy.load("en_core_web_sm")
doc = nlp("I have a stomach bug and I am sick.")
for token in doc:
    print(f"{token.text:7} --- {token.shape_:6}}")

Back to BACK