Project 2 - Redacting statements

CIS 6930 Spring 2025

After your time as an MIB analyst collecting data you have been reassigned to the Witness Protection and Processing team. Documents about the witness that are sensitive must be redacted from all statements and documents. In this project, you will be given a series of documents and parameters. It will be your job to build a pythonic system to redact the sensitive information from the documents.

Project 2 - Redacting statements

Project Overview

In this project, you will be building a system that redacts sensitive information from a set of documents. The documents will be in PDF format. You will use a python package to identify named entities in the text. You will create redacted version of the document. You will also identify the coreferece entities in the text and redact those as well.

Below is an example executon of the system.

uv run python main.py --input "data/*.pdf" --output myoutput/ --names Bill --names Carter --entities --coref

Here is an example output of the command line help function is run.

uv run python main.py -h
usage: main.py [-h] [--input INPUT [INPUT ...]] [--output OUTPUT] [--names NAMES [NAMES ...]]
                  [--entities] [--coref] [--stats STATS]

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT [INPUT ...]
                        input files globs
  --output OUTPUT       output directory for pdf files
  --names NAMES [NAMES ...]
                        Takes one or more case sensitive tokens as input
  --entities            Get all `named` entities
  --coref               Redact all coreferences
  --stats STATS         Specify the location of the stats file

Commandline Parameters

Below we discuss the expected command line parameters and their purpose.

--input - This is a glob pattern that specifies the local input files to be redacted. This command should support both regular expressions strings (e.g., --input "saturn*crafts.pdf") or multiple inputs (e.g., --input "saturn80crafts.pdf" --input "saturn90crafts.pdf").
--output - This is the directory where the redacted files will be saved. This file will be saved with the same name as the input file. There will be no duplicate file names.
--names - This is a list of names that should be redacted from the text. This should be case sensitive. This flag can be used multiple times to redact multiple names. Names will only contain ASCII characters and multi-character names will be quoted. For example, --names "Rick Scott" --names "Adam Silver".
--entities - This flag will redact all entities from the text. By default only PER entities will be redacted. This flag will redact all entities. (You may optionally add other entities.)
--coref - This flag will redact all coreferences from the text.
--stats - This is the location of the stats file that will be generated. The default location is standard error.

Format for `--stats`

Takes a file string as an argument and returns the metadata and log of the program run. The format of the files should be a <tab>-separated file with the following columns and content. Note the stats output should not contain a header row. The table below list the column name, in order, the desciption of the content in the column, and an example of the content.

Column Name	Description	Example
File	The name of the file	saturn80crafts.pdf
Location	The location of the token redacted file	10 or 25x39
Token	The token redacted	Harold Pryor
Length	The length of the token redacted	12
Type	The type of token redacted	Name

Where are the documents to test?

The goal is to use the dataset on documents from MIB Witness Protection and Processing. These are the document related to unidentified anomalous phenomena (UAP) that have been declassified. For now, you can use any PDF to evaluate your system.

Here is an example input and output output file. Develop a solution that you are sure works for your set of documents.

Reading and editing PDFs

To read and edit PDFs, you can use the PyMuPDF package. This package can be installed with pymupdf or fitz. Explore the documentation to use the library to read and edit pdf files. You can use the package to identify tokens or names and replace them in the pdf text.

{Submit you rubrics to the submission form by the deadline.}

Extra Credit: Cross Document ER

For 10 point extra credit, implement a cross document entity resolution system for your code by adding a --crossdoc flag. Demonstrate that you are able to identify entities that appear across documents. If you are interesting in attempting this portion of the assignment please speak to Dr. Grant for demonstration instructions.

A rest API will be made availale for you to submit coref requests.

curl -X POST "gpu002.cm.cluster:65535/resolve_coref" -H "Content-Type: application/json" -d '{"text": "John said he would help Mary. She was grateful."}'

The URL for the server may be between gpu001 and gpu022.

The output is a json object with the key “coreference mapping”. The key has an array of dictionaries for each entity. These dictionaries are key value pairs where the key is the token index of the document and the value is the token of the entity discovered.

{"coreference_mapping":[{"0":"John","2":"he"},{"5":"Mary","7":"She"}]}

Packaging

In this project, you will be using the uv tool to package your project. This is different that the Pipfile you have been using in the past. It is a more modern way to package your project and is more in line with the current best practices in the Python community. I have been experiencing issues with Pipenv, enough to warrant the change in project 2.

Pin your python version uv python pin 3.10. This updates the .python-version file in your project directory. You should use Python 3.10 or 3.12 to ensure the best compatibility for this project.

When inside your repository directory, create your pyproject.toml file with the commend uv init . This will create a pyproject.toml file in your project directory. To include the dependencies you need, you can use the uv add command.

Using uv instead of `pipenv`

The uv tool is a modern Python packaging tool that is more in line with the current best practices in the Python community. You can install the dependencies to the virtual environment managed by uv using:

uv add <package>

uv pip install <package>

The former command will add the package to the project’s pyproject.toml file. The latter command will install the package to the virtual environment. You should use uv add when possible to ensure that the project’s dependencies are properly managed. When working with a project (application or library) managed by uv, the following command might be used uv add -r requirements.txt. This will also add the requirements to the project’s pyproject.toml.

Project Submission

Create a private repository called cis6930sp25-project2. Please ensure you use this exact repository name, all lowercase. Add collaborators cegme, tzhan024, and abbasidaniyal by going to Settings > Collaborators and teams > add people.

Setting up the repository

Create an empty gitub repository names cis6930sp25-project2. Clone the repository to your machine (cloud or local). Ensure the .python-version file is set to your desired python version. If it is not, you can manually change the file or use the command uv python pin 3.10. Create a python project using uv init . in the repository directory. Create a virtual environment using uv venv in the repository directory. Activate your environment using the comments source .venv/bin/activate on Linux and MacOS.

You can add all necessary packages using the uv add command.

uv add --no-cache pip setuptools wheel pymupdf spacy-experimental tqdm
uv run -m spacy download en_core_web_sm
uv run -m spacy download en_core_web_trf

Directory Structure

Here is an example directory strusture for your project.

cis6930sp25-project2/
├── COLLABORATORS.md
├── LICENSE
├── README.md
├── main.py
├── pyproject.toml
├── resources
│   ├── test1in.pdf
│   ├── test2in.pdf
│   ├── test1out.pdf
│   └── test2out.pdf
├── docs
├── tests
│   ├── tests_coref.py
│   ├── tests_ner.py
│   └── tests_token.py
└── resources

pyproject.toml

This is a template for the standard pyptoject file. Please adjust as needed.

[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[project]
authors = [{name = "Your name", email = "your.name@ufl.edu"}]
name = "project2"
description = "Project One from <github id> -- Spring 2025"
version = "1.0"
readme = "README.md"
dependencies = ...

[project.urls]
Task = "https://ufdatastudio.com/cis6930sp25/project/2"
repository = "https://github.com/<userid>/cis6930sp25-project2"

[tool.setuptools]
py-modules = []

[tool.pytest.ini_options]
testpaths = ["tests"]

Submission

README.md

The README.md file should be all uppercase with .md extension. You should write your name in it, and an example of how to run it including any bugs that should be expected. You should describe all features of your code. The README.md file should contain a list of any bugs or assumptions made while writing the program. You should include directions on how to install and use the Python package. We know your code will not be perfect, be sure to include any assumptions you make for your solution. Note: You should not be copying code from any website not provided by the instructor.

Below is an example template:

# cis6930sp25 -- Project 2

Name:

## Assignment Description

In your own words...

## Installing models
`uv add pip`
`uv run -m spacy download en_core_web_sm`
uv run -m spacy download en_core_web_trf`

## How to run

...

## How to run

...

## Example

![video](video)

## Features and functions

####main.py

Files and functions

## Bugs and Assumptions

COLLABORATORS.md

This file should contain a pipe-separated list describing who you worked with and a small text description describing the nature of the collaboration. If you visited a website for inspiration, including the website. This information should be listed in three fields as in the example is below:

Katherine Johnson | kj@nasa.gov | Helped me understand calculations
Dorothy Vaughan | doro@dod.gov | Helped me with multiplexed time management
Stackoverflow | https://example | helped me with a compilation of python test

The collaborator file is mainly used to ensure that code similarities are coincidental. Be sure to abide by the acadenmic integrity guidelines outlined in the syllabus. Generative AI tools may result in code that is very similar to other student submission and should be avoided.

Tests

You should have your own test data set that you can use to test your code. Add test flags as appropriate for you. Tests should be runnable by using uv run python -m pytest -v. The tests should show that all the functionality works. We are not necessarily looking for bullet proof code. Visit the pytest docs for details.

All tests should go in the tests/ folder. The files names containing the tests functions should be prefixed with the word test. For example, data size tests could go in a file with the name test_download.py. Functions in the test file that should run as tests must be prefixed with the string test. We will run your tests from the root directory with the line uv run python -m pytest -v .. It is important to know that running pytest using the method in the previous sentence adds the current path to the sys.path and so you do not have to hack the run path in your test files.

Consider installing the pytest-cov package to measure the code coverage of your tests.

Evaluation

You will be assigned two student code to try out and score (Blind Peer Review). Create instructions for the students to follow. You will evaluate your peers using the set of documents you worked on. You will be asked to invite the students to your repository. Direction for peer evaluation will be be provided later.

Submission

For this project, supply one README document, code package, and a less that 5-minute demonstration walkthrough of your code. We have given you several tools in class. You may choose how you would like to develop and present your approach. All code and a link to the video should be posted in a private GitHub repository cis6930sp25-project2; cegme, tzhan024, and abbasidaniyal should be added as collaborators.

Peer evaluation assignments and evaluation rubrics will be given at a later date.

When ready to submit, create a tag on your repository using git tag on the latest commit:

git tag v1.0
git push origin v1.0

The version v1.0 lets us know when and what version of code you would like us to grade. If you need to submit an updated version, you can use the tag v1.1.

We will also ask you to submit all code files on Gradescope.

In summary, you should submit:

A shared GitHub repo with all code contents.
All source documents to Gradescope from a linked GitHub account.
A less than 5-minute video describing the approach (Video on README) (Optional)
A README.md with description and instructions

Grading

You will receive follow-up instructions with rubrics in the following week.

	Percentage
Correctness	40%
README and Evaluation	60%
	100%

Tips

Coreference resolution blog: https://explosion.ai/blog/coref

If you are having numpy versioning issues try to change the numpy version uv add numpy==1.26.4. This will pin the numpy version to 1.26.4. This is a common issue with the current version of spacy.

% uv add numpy==1.26.4
Resolved 47 packages in 561ms
Prepared 1 package in 518ms
Uninstalled 1 package in 84ms
Installed 1 package in 12ms
 - numpy==2.2.3
 + numpy==1.26.4

OSError: [E050] Can’t find model ‘en_core_web_sm’. It doesn’t seem to be a Python package or a valid path to a data directory.

This is a common error when the model is not installed. You can install the model using the command uv run -m spacy download en_core_web_sm.

ImportError: Coref requires PyTorch: pip install thinc[torch]

To fix this error, you can (re)install the thinc package with the torch dependency using the command uv add 'thinc[torch]'. The single quotes are necessary to prevent the shell from interpreting the brackets as a glob pattern.

uv add 'thinc[torch]'

ValueError: [E109] Component ‘experimental_coref’ could not be run. Did you forget to call initialize()?

You are using an old or untrained coreference model. Instead use the following pretrained model.

Retreive the pretrained coref model)

pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.1/en_coreference_web_trf-3.4.0a2-py3-none-any.whl

Addenda

2025-04-07: Emphasized the possible urls for the coref server (GPU001 to GPU022).
2025-04-02: Note: --entitiesmeans all PER entities, the other entities are optional.
2025-03-24: Submitted version updated

Back to CIS6930

Project 2 - Redacting statements

CIS 6930 Spring 2025

Project Overview

Commandline Parameters

Format for --stats

Where are the documents to test?

Reading and editing PDFs

Extra Credit: Cross Document ER

Packaging

Using uv instead of pipenv

Project Submission

Setting up the repository

Directory Structure

pyproject.toml

Submission

README.md

COLLABORATORS.md

Tests

Evaluation

Submission

Grading

Tips

Addenda

Format for `--stats`

Using uv instead of `pipenv`