Text Retrieval and Extraction Demo

Text Retrieval and Extraction Demo

This is a continuation of the previous demo. Here we will use networkx and the pagerank algorith to improve our ranking values for search.

Fetch our data

df = pd.read_json("https://cise.ufl.edu/~cgrant/files/tweets.json.gz", compression='gzip', nrows=200)
df.head()
df.describe()
df.columns

Page Rank

As discussed in the lecture, PageRank is a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of pages, with the purpose of “measuring” its relative importance within the set.

The algorithm was designed by Larry Page and Sergey Brin, the founders of Google, while they were Ph.D. students at Stanford University. The algorithm is based on the idea that important pages are likely to be linked to by other important pages.

The algorithm works by assigning a score to each page based on the number and quality of links pointing to it. The score is calculated iteratively, with each page’s score being influenced by the scores of the pages that link to it. The algorithm continues to iterate until the scores converge to a stable set of values.

A nice visualization of Page Rank is available online.

We will use networkx to calculate the page rank of our tweets. Because tweets do not have links, we will use the the tokens used in the tweets as the nodes in the page rank graph. The edges will be the co-occurrence of the tokens in the tweets.

Getting the top documents

We can check out the top document with the following code:

import numpy as np

# Step 1: Compute total PageRank-weighted score for each document
doc_scores = X_reweighted.sum(axis=1)  # This returns a matrix, not a flat array

# Step 2: Flatten to 1D
doc_scores = np.array(doc_scores).flatten()

# Step 3: Get sorted indices (descending)
top_doc_indices = np.argsort(doc_scores)[::-1]

# Step 4: Show top documents
top_n = 10
print(df['text'].iloc[top_doc_indices[:top_n]])

# Step 5: Show top documents with their scores
top_docs = df['text'].iloc[top_doc_indices[:top_n]].reset_index(drop=True)

df_docs = pd.DataFrame({'text': top_docs, 'score': doc_scores[top_doc_indices[:top_n]]})

This approach is called TextRank. This TextRank approach is a graph-based ranking algorithm for natural language processing tasks, such as keyword extraction and text summarization. If instead of words we used sentences, we could use it for text summarization. The algorithm works by constructing a graph where nodes represent words or sentences, and edges represent relationships between them. The importance of each node is determined by its connections to other nodes, similar to how PageRank works for web pages.

Tokenize and extract sentences from the document to be summarized.
Decide on the number of sentences k that we want in the final summary.
Build document term feature matrix using weights like TF-IDF or Bag of Words.
Compute a document similarity matrix by multiplying the matrix with its transpose.
Use these documents (sentences in our case) as the vertices and the similarities between each pair of documents as the weight or score coefficient mentioned earlier and feed them to the PageRank algorithm.
Get the score for each sentence.
Rank the sentences based on score and return the top k sentences.

Back to BACK