CIS 6930, Spring 2024 Assignment 1

The Censoror

Introduction

Whenever sensitive information is shared with the public, the data must go through a redaction process. That is, all sensitive names, places, and other sensitive information must be hidden. Documents such as police reports, court transcripts, and hospital records all contain sensitive information. Censoring this information is often expensive and time consuming.

Task Overview

In this assignment, you will use your knowledge of data pipelines to design a system that accepts plain text documents then detects and censors “sensitive” items. Below is an example execution of the program. If your program cannot be run using this command you may lose points.

pipenv run python censoror.py --input '*.txt' \
                    --names --dates --phones --address\
                    --output 'files/' \
                    --stats stderr

Running the program with this command line argument should read all files given by the glob — in this case all the files ending in .txt in the current folder. All these files will run through your program’s censoring process. The program will look to censor all names and dates, and phone numbers.

Each censored file should be transformed into new files of the same name with the .censored extension, and written to the folder described by --output flag. The final parameter, --stats, describes the file or location to write the statistics of the censored files. Below we discuss each of the parameters in additional detail.

Parameters

–input

This parameter takes a glob that represents the files that can be accepted. More than one input flag may be used to specify groups of files. If a file cannot be read or censored an appropriate error message should be displayed to the user.

–output

This flag should specify a directory to store all the censored files. The censored files, regardless of their input type should be written to text files. Each file should have the same name as the original file with the extension .censored appended to the file name.

Censor flags

The censor flags list the entity types that should be extracted from all the input documents. The list of flags you are required to implements are:

--names corresponds to any type of name, it is up to you to define this.
--dates correspond to any written dates (4/9/2025, April 9th, 22/2/22, etc.)
--phones describes any phone number in its various forms.
--address corresponds to any physical (postal) address (not e-mail address).

In your README discussion file clearly give the parameters you apply to each of the flags. You are free to add you own flags! The censored characters in the document should be replaced with a character of your choice. Some popular characters include the Unicode full block character █ (U+2588). You may choose to censor both words and the whitespaces between phrases. If you believe that one should also censor whitespaces between words (e.g. in a first and last name) please discuss why in your README.md.

—stats

Stats take either the name of a file or special files (stderr, stdout), and writes a summary of the censorion process. **You need to support all three of the cases.** Some statistics to include are the types and counts of censored terms and the statistics of each censored file. Be sure to describe the format of your outfile to in your README file. Stats should help you while developing your code. Stats may also include the begining and end index of each censored item. It is up to you to design your stats output.

Dataset

For example documents we are asking you to use the Enron Email Dataset. The dataset contains 1.7 Gb (compressed) and 500,000 email messages. Scour the dataset for example text files that you can censor. The full data set is too large for your GitHub repository. You can use a portion of the data set that you find useful. Use the following command in your /tmp directory.

wget https://www.cs.cmu.edu/~enron/enron_mail_20150507.tar.gz
tar xvzf enron_mail_20150507.tar.gz

You can then use ls, or a hand tool called ranger to examine all of the files.

Submission

README.md

The README file name should be uppercase with an .md extension. You should write your name in it, an example of how to run it, and a list of any web or external resources that you used for help. The README file should also contain a list of any bugs or assumptions made while writing the program. Note that you should not be copying code from any website not provided by the instructor. You should include directions on how to install and use the code. You should describe any known bugs and cite any sources or people you used for help. Be sure to include any assumptions you make for your solution.

COLLABORATORS file

This file should contain a comma-separated list describing who you worked with and a small text description describing the nature of the collaboration. This information should be listed in three fields as in the example is below:

Katherine Johnson, kj@nasa.gov, Helped me understand calculations
Dorothy Vaughan, doro@dod.gov, Helped me with multiplexed time management

Assignment Descriptions

Your code structure should be in a directory with something similar to the following format:

cis6930sp24-assignment1/
├── COLLABORATORS
├── LICENSE
├── README
├── Pipfile
├── assignment1
│   └── main.py
│   └── ... 
├── docs/
├── censoror.py
├── setup.cfg
├── setup.py
└── tests/
    ├── test_names.py
    └── test_phones.py
    └── test_address.py
    └── ... 

setup.py

from setuptools import setup, find_packages

setup(
	name='assignment1',
	version='1.0',
	author='You Name',
	authour_email='your ufl email',
	packages=find_packages(exclude=('tests', 'docs')),
	setup_requires=['pytest-runner'],
	tests_require=['pytest']	
)

Note, the setup.cfg file should have at least the following text inside:

[aliases]
test=pytest

[tool:pytest]
norecursedirs = .*, CVS, _darcs, {arch}, *.egg, venv

Tests

The general rule is you should aim to have a test for each feature. Including tests help people understand how your code works, in addition to verifying assumptions during development. Tests should be runnable by using pipenv run python -m pytest. If your tests cannot be run with the command above, you may lose points. You should discuss your tests in the README.

Extra links and Notes

It is expected that you will use Spacy or Sklearn to complete this assignment. However, you are welcome to use other popular APIs. Some of these APIs a larger instance may require a larger instance, please let us know if you plan to do this. Also, some APIs require specialized keys, please let the TA know how you plan to use your keys. Below is the information for using Google tools.

Creating API Keys Docs
spaCy
Google NLP
Using Google Natural Language Client GCP
NLTK
Huggingface 🤗

Create a repository for your assignment on GitHub

Create a private repository GitHub called cis6930sp24-assignment1. Note that is the repository is public you may be considered academic dishonesty. We will also compare code contents to investigate cases of academic dishonesty.

Add collaborators cegme, wyfunique by going to Settings > Collaborators.

We will be testing your code on the VM instances described in class and through the autograder. Please ensure your code runs correctly on the instances. If any extra information is necessary such as an extra large instance size you must specify that in your instances.

You should regularly git add <file>, git commit -m, and git push origin main your code changes to GitHub.

Submitting your code

When ready to submit, create a tag on your repository using git tag on the latest commit:

git tag v1.0
git push origin v1.0

Version v1.0 lets us know when and what version of code you would like us to grade. If you need to submit an updated version, you can use the tag v1.1. If you would like to submit a second version, use the tag v2.0.

If you need to update a tag, view the commands in the following StackOverflow post.

Grading

Grades will be assessed according to the following distribution:

60%: Correctness.
- This will be assessed by giving your code a range of inputs and checking the output.
- Use the creation of tests to prove correctness.
40%: Documentation.
- Your README file should fully explain your process for developing your code.
- All other commands should be well-documented.
5%: If you use snorkel labeling functionality for each category. (extra credit)

Note we will be running your code in batch it is important that you follow directions closely.

Make sure you

When submitting your assignment please quadruple check that you have done the following items. These will ensure you do not mistakenly lose points.

Submit your code to both GradeScope and add cegme and wfyunique as collaborators on GitHub
Be sure to have a working Pipfile to allow us to run your code
Be sure to use Python 3.11
Ensure your code can be run with pipenv run python censoror.py
Ensure your pytest can be run with the command pipenv run python -m pytest from the main directory
If using external models (e.g., pretrained models, NLTK trained models) make sure your code or Pipfile configuration file will download your code or Pipfile configuration file will download them
If you are using an external API where a Key is required, please email us to discuss it.
For output files, you only need to add .censored. (e.g., sample.txt -> sample.txt.censored; 1. -> 1..censored)
Here is a tip on how to make sure your model is installed with Pipfile https://github.com/explosion/spaCy/issues/1099#issuecomment-357615551. For example:

[packages]
spacy = "*"
pytest = "*"
en_core_web_md = {file = "https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl"}

In the code you could use the model as follows:

import spacy
import en_core_web_md
# ...

nlp = en_core_web_md.load()
# ...

To silence warnings from updated versions of SpaCy use the following code

import en_core_web_md
from warning import filterwarnings
filterwarnings('ignore')

pipeline = en_core_web_md.load()

Addendum

Back to Assignment List