CIS 6930 Spring 24

Logo

This is the web page for Data Engineering at the University of Florida.

View the Project on GitHub ufdatastudio/cis6930fa24

Project 2

The Unredactor

Introduction

Whenever sensitive information is shared with the public, the data must go through a redaction process. That is, all sensitive names, places, and other sensitive information must be hidden. Documents such as police reports, court transcripts, and hospital records all containing sensitive information. Redacting this information is often expensive and time consuming.

For project 2, you will be creating an Unredactor. The unredactor will take redacted documents and return the most likely candidates to fill in the redacted location. The unredactor only needs to unredact people’s names. However, if there is another category that you would prefer to unredact, please discuss this with the instructor/ta.

As you can guess, discovering names is not easy. To discover the best names, may use NER approaches such as training a model to help us predict missing words. For this assignment, you are expected to use the Large Movie Review Data Set and a custom subset. Please take a look at the information about the dataset and download the data set using this link. The tar.gz file is also available at /blue/cis6930/share/aclImdb_v1.tar.gz. This is a data set of movie reviews from IMDB containing. The initial goal of the data set is to discover the sentiment of each review. For this project, we will only use the reviews for their textual content.

Unredactor Training Set

I have provided a data set of the redacted names from the corpus unredactor.tsv. The first column specifies whether the file is in training, testing, or validation. The second column contains the name of the entity that was redacted. The final column is the redaction context. Each of these examples comes from somewhere in the review dataset. Below is a table showing the dataset format.

split name context
validation Dan Duryea Lots of obvious symbolism about achieving manhood but mainly it’s the acting by Stewart, his partner Millard Mitchell, Shelly Winters and the Waco Johnny Dean- ██████████.
training Mari Honjo He appears in the best scene in this positively dreadful and near unwatchable crime drama about a Dragon Lady (██████████, who wisely hung up her acting spurs after completing this film) who controls the local syndicate.
validation Lawrence Fishburne ██████████████████ does an over-the-top performance as the sagacious Profesor Phipps.

Use the training example for your model and the valdation examples to evaluate the performance. We are with holding a list of testing examples we will releas in the coming days.

The Redaction Context Column

The redaction context has a single set of block characters representing the redacted text. The length of the redaction block is equal to the number of characters in the redacted name. There are no spaces included if the redaction is multiple words. The length of the redaction context window is reasonable and less than or equal to 1024 characters. You could use your previous redactor.py code to generate additional sample redaction from the original data set. You may also search and find the redaction context to create additional custom features. The redacted examples come from any test or train dataset in the IMDB dataset and may be added to any of the test, validation, or train examples in the test subset.

The Task

The task for this project is to train a model that learns from the training set and be evaluated on the validation set. You need to design your README.md to describe your code (as usual) and include instructions about your pipeline that are clear enough to be replicated. The key to this task is to (1) make it easy for a peer to use your code to execute the model on the valuation set and (2) generate a precision, recall, and f1-score of the code for the dataset. Note, that to do this, you will have to generate code that understands where the example redaction is in the training set, create features, and run the model.

**Due to the uncertainty about Gradescope submission capabilities, more information will be provided in the coming days.**

Helpful code

Below is sample code snippet that uses the default NLTK model to extract and print names from the movie reviews. (Errors may exist)

#!/usr/bin/python3
# -*- coding: utf-8 -*-
# grabthenames.py
# grabs the names from the movie review data set

import glob
import io
import os
import pdb
import sys

import nltk
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk import pos_tag
from nltk import ne_chunk


def get_entity(text):
    """Prints the entity inside of the text."""
    for sent in sent_tokenize(text):
        for chunk in ne_chunk(pos_tag(word_tokenize(sent))):
            if hasattr(chunk, 'label') and chunk.label() == 'PERSON':
                print(chunk.label(), ' '.join(c[0] for c in chunk.leaves()))


def doextraction(glob_text):
    """Get all the files from the given glob and pass them to the extractor."""
    for thefile in glob.glob(glob_text):
        with io.open(thefile, 'r', encoding='utf-8') as fyl:
            text = fyl.read()
            get_entity(text)


if __name__ == '__main__':
    # Usage: python3 entity-extractor.py 'train/pos/*.txt'
    doextraction(sys.argv[-1])

Note that in the code above, we restrict the type to PERSON. Given a new document such as the one below that contains at least one redated name. Create Python code to help you predict the most likely unredacted name. (If you are curious, the redacted name is Ashton Kutcher.)

'''This movie was sadly under-promoted but proved to be truly exceptional.
Entering the theatre I knew nothing about the film except that a friend wanted to see it.

I was caught off guard with the high quality of the film.
I couldn't image ██████████████ in a serious role, but his performance truly 
exemplified his character.
This movie is exceptional and deserves our monetary support, unlike so many other movies.
It does not come lightly for me to recommend any movie, 
but in this case I highly recommend that everyone see it.

This films is Truly Exceptional!'''

About 90K files are available in the IMDB data set divided into test and train folders. One of the most critical aspects of this project is creating the appropriate set of features. Features may include, n-grams in the document, number of letters in the redacted word, the number of spaces in the redacted word, the sentiment score of the review, previous word, next word. Also, we do not have infinite resources to complete train and or evaluate the model.

We are looking for well-thought out and reasoned approaches — use the tests and your README file to evaluate your code. There are several techniques for creating the unredactor. Many students will use spaCy for this assignment but use other Python libraries with approval from the instructor. Sklearn has a Dictionary Vectorizer that may be very useful. The Google Graph Search Api may also prove useful. Using this, you can find a list of all important entities (Entity Search). Given a set of candidate matches, you can use the information in the Google knowledge graph to give you a ranked set of people.

In this project, we want you to create a function to execute the unredation process. It is your task to create a reproducible method of training, testing, and validating your code. Use the README to explain all assumptions. Be sure to give examples of usage. Give clear directions. Add tests for each part of your code so we can better evaluate your code.

Submission

For this project, supply one README document, code package, and a less that 5-minute demonstration walkthrough of your code. We have given you several tools in class. You may choose how you would like to develop and present your approach. All code and a link to the video should be posted in a private GitHub repository cis6930fa24-project2; cegme and WillCogginsUFL‎ should be added as collaborators.

Peer evaluation assignments and evaluation rubrics will be given at a later date.

When ready to submit, create a tag on your repository using git tag on the latest commit:

git tag v1.0
git push origin v1.0

The version v1.0 lets us know when and what version of code you would like us to grade. If you need to submit an updated version, you can use the tag v1.1.

We will also ask you to submit all code files on Gradescope.

In summary, you should submit:

Grading

You will receive follow-up instructions with rubrics in the following week.

  Percentage
Correctness 40%
README and Evaluation 60%
  100%
**Due to the uncertainty about Gradescope submission capabilities, more information will be provided in the coming days.**

New: Submission Instructions

On the porject due date, I will release a test.tsv that holds examples. The test.tsv file will have two columns: id and context. It is your job to add the file submission.tsv to your GitHub repository in the top level directory. The files should have two columns: id and name. You should run your model on test.tsv to produce the submission.tsv.

** The test file is available **

file name link
test.tsv test.tsv

Extra links and Notes

To use some API, a larger instance may be needed. If you decide to use a larger instance, please let us know, and be sure to add this to your README. Ease of use is an important factor for designing this system.