This is the web page for Data Engineering at the University of Florida.
After your time as an MIB analyst collecting data you have been reassigned to the Witness Protection and Processing team. Documents about the witness that are sensitive must be redacted from all statements and documents. In this project, you will be given a series of documents and parameters. It will be your job to build a pythonic system to redact the sensitive information from the documents.
In this project, you will be building a system that redacts sensitive information from a set of documents. The documents will be in PDF format. You will use a python package to identify named entities in the text. You will create redacted version of the document. You will also identify the coreferece entities in the text and redact those as well.
Below is an example executon of the system.
uv run python main.py --input "data/*.pdf" --output myoutput/ --names Bill --names Carter --entities --coref
Here is an example output of the command line help function is run.
uv run python main.py -h
usage: main.py [-h] [--input INPUT [INPUT ...]] [--output OUTPUT] [--names NAMES [NAMES ...]]
[--entities] [--coref] [--stats STATS]
optional arguments:
-h, --help show this help message and exit
--input INPUT [INPUT ...]
input files globs
--output OUTPUT output directory for pdf files
--names NAMES [NAMES ...]
Takes one or more case sensitive tokens as input
--entities Get all entities
--coref Redact all coreferences
--stats STATS Specify the location of the stats file
Below we discuss the expected command line parameters and their purpose.
--input
- This is a glob pattern that specifies the local input files to be redacted. This command should support both regular expressions strings (e.g., --input "saturn*crafts.pdf"
) or multiple inputs (e.g., --input "saturn80crafts.pdf" --input "saturn90crafts.pdf"
).--output
- This is the directory where the redacted files will be saved. This file will be saved with the same name as the input file. There will be no duplicate file names.--names
- This is a list of names that should be redacted from the text. This should be case sensitive. This flag can be used multiple times to redact multiple names. Names will only contain ASCII characters and multi-character names will be quoted. For example, --names "Rick Scott" --names "Adam Silver"
.--entities
- This flag will redact all entities from the text.--coref
- This flag will redact all coreferences from the text.--stats
- This is the location of the stats file that will be generated. The default location is standard error.--stats
Takes a file string as an argument and returns the metadata and log of the program run.
The format of the files should be a <tab>
-separated file with the following columns and content.
Note the stats output should not contain a header row.
The table below list the column name, in order, the desciption of the content in the column, and an example of the content.
Column Name | Description | Example |
File | The name of the file | saturn80crafts.pdf |
Location | The location of the token redacted file | TBD |
Token | The token redacted | Harold Pryor |
Length | The length of the token redacted | 12 |
Type | The type of token redacted | Name |
The goal is to use the dataset on documents from MIB Witness Protection and Processing. These are the document related to unidentified anomalous phenomena (UAP) that have been declassified. For now, you can use any PDF to evaluate your system.
Here is an example input and output output file. Develop a solution that you are sure works for your set of documents.
To read and edit PDFs, you can use the PyMuPDF
package.
This package can be installed with pymupdf
or fitz
.
Explore the documentation to use the library to read and edit pdf files.
You can use the package to identify tokens or names and replace them in the pdf text.
{Submit you rubrics to the submission form by the deadline.}
For 10 point extra credit, implement a cross document entity resolution system for your code by adding a --crossdoc
flag.
Demonstrate that you are able to identify entities that appear across documents.
If you are interesting in attempting this portion of the assignment please speak to Dr. Grant for demonstration instructions.
In this project, you will be using the uv
tool to package your project.
This is different that the Pipfile you have been using in the past.
It is a more modern way to package your project and is more in line with the current best practices in the Python community.
I have been experiencing issues with Pipenv, enough to warrant the change in project 2.
Pin your python version uv python pin 3.10
.
This updates the .python-version
file in your project directory.
You should use Python 3.10 or 3.12 to ensure the best compatibility for this project.
When inside your repository directory, create your pyproject.toml file with the commend uv init .
This will create a pyproject.toml
file in your project directory.
To include the dependencies you need, you can use the uv add
command.
pipenv
The uv
tool is a modern Python packaging tool that is more in line with the current best practices in the Python community.
You can install the dependencies to the virtual environment managed by uv using:
uv add <package>
uv pip install <package>
The former command will add the package to the project’s pyproject.toml file.
The latter command will install the package to the virtual environment.
You should use uv add
when possible to ensure that the project’s dependencies are properly managed.
When working with a project (application or library) managed by uv, the following command might be used uv add -r requirements.txt
.
This will also add the requirements to the project’s pyproject.toml.
Create a private repository called cis6930sp25-project2
.
Please ensure you use this exact repository name, all lowercase.
Add collaborators cegme
, tzhan024
, and abbasidaniyal
by going to Settings > Collaborators and teams > add people
.
Create an empty gitub repository names cis6930sp25-project2
.
Clone the repository to your machine (cloud or local).
Ensure the .python-version
file is set to your desired python version.
If it is not, you can manually change the file or use the command uv python pin 3.10
.
Create a python project using uv init .
in the repository directory.
Create a virtual environment using uv venv
in the repository directory.
Activate your environment using the comments source .venv/bin/activate
on Linux and MacOS.
You can add all necessary packages using the uv add
command.
uv add --no-cache pip setuptools wheel pymupdf spacy-experimental tqdm
uv run -m spacy download en_core_web_sm
uv run -m spacy download en_core_web_trf
Here is an example directory strusture for your project.
cis6930sp25-project2/
├── COLLABORATORS.md
├── LICENSE
├── README.md
├── main.py
├── pyproject.toml
├── resources
│ ├── test1in.pdf
│ ├── test2in.pdf
│ ├── test1out.pdf
│ └── test2out.pdf
├── docs
├── tests
│ ├── tests_coref.py
│ ├── tests_ner.py
│ └── tests_token.py
└── resources
This is a template for the standard pyptoject file. Please adjust as needed.
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
[project]
authors = [{name = "Your name", email = "your.name@ufl.edu"}]
name = "project2"
description = "Project One from <github id> -- Spring 2025"
version = "1.0"
readme = "README.md"
dependencies = ...
[project.urls]
Task = "https://ufdatastudio.com/cis6930sp25/project/2"
repository = "https://github.com/<userid>/cis6930sp25-project2"
[tool.setuptools]
py-modules = []
[tool.pytest.ini_options]
testpaths = ["tests"]
The README.md file should be all uppercase with .md
extension.
You should write your name in it, and an example of how to run it including any bugs that should be expected.
You should describe all features of your code.
The README.md file should contain a list of any bugs or assumptions made while writing the program.
You should include directions on how to install and use the Python package.
We know your code will not be perfect, be sure to include any assumptions you make for your solution.
Note: You should not be copying code from any website not provided by the instructor.
Below is an example template:
# cis6930sp25 -- Project 2
Name:
## Assignment Description
In your own words...
## Installing models
`uv add pip`
`uv run -m spacy download en_core_web_sm`
uv run -m spacy download en_core_web_trf`
## How to run
...
## How to run
...
## Example

## Features and functions
####main.py
Files and functions
## Bugs and Assumptions
This file should contain a pipe-separated list describing who you worked with and a small text description describing the nature of the collaboration. If you visited a website for inspiration, including the website. This information should be listed in three fields as in the example is below:
Katherine Johnson | kj@nasa.gov | Helped me understand calculations
Dorothy Vaughan | doro@dod.gov | Helped me with multiplexed time management
Stackoverflow | https://example | helped me with a compilation of python test
The collaborator file is mainly used to ensure that code similarities are coincidental. Be sure to abide by the acadenmic integrity guidelines outlined in the syllabus. Generative AI tools may result in code that is very similar to other student submission and should be avoided.
You should have your own test data set that you can use to test your code.
Add test flags as appropriate for you.
Tests should be runnable by using uv run python -m pytest -v
.
The tests should show that all the functionality works.
We are not necessarily looking for bullet proof code.
Visit the pytest docs for details.
All tests should go in the tests/
folder.
The files names containing the tests functions should be prefixed with the word test
.
For example, data size tests could go in a file with the name test_download.py
.
Functions in the test file that should run as tests must be prefixed with the string test
.
We will run your tests from the root directory with the line uv run python -m pytest -v .
.
It is important to know that running pytest using the method in the previous sentence adds the current path to the sys.path and so you do not have to hack the run path in your test files.
Consider installing the pytest-cov
package to measure the code coverage of your tests.
You will be assigned two student code to try out and score (Blind Peer Review). Create instructions for the students to follow. You will evaluate your peers using the set of documents you worked on. You will be asked to invite the students to your repository. Direction for peer evaluation will be be provided later.
For this project, supply one README document, code package, and a less that 5-minute demonstration walkthrough of your code.
We have given you several tools in class.
You may choose how you would like to develop and present your approach.
All code and a link to the video should be posted in a private GitHub repository cis6930sp25-project2; cegme
, tzhan024
, and abbasidaniyal
should be added as collaborators.
Peer evaluation assignments and evaluation rubrics will be given at a later date.
When ready to submit, create a tag on your repository using git tag on the latest commit:
git tag v1.0
git push origin v1.0
The version v1.0 lets us know when and what version of code you would like us to grade.
If you need to submit an updated version, you can use the tag v1.1
.
We will also ask you to submit all code files on Gradescope.
In summary, you should submit:
You will receive follow-up instructions with rubrics in the following week.
Percentage | |
---|---|
Correctness | 40% |
README and Evaluation | 60% |
100% |
If you are having numpy versioning issues try to change the numpy version uv add numpy==1.26.4
. This will pin the numpy version to 1.26.4. This is a common issue with the current version of spacy.
% uv add numpy==1.26.4
Resolved 47 packages in 561ms
Prepared 1 package in 518ms
Uninstalled 1 package in 84ms
Installed 1 package in 12ms
- numpy==2.2.3
+ numpy==1.26.4
OSError: [E050] Can’t find model ‘en_core_web_sm’. It doesn’t seem to be a Python package or a valid path to a data directory.
This is a common error when the model is not installed. You can install the model using the command uv run -m spacy download en_core_web_sm
.
ImportError: Coref requires PyTorch: pip install thinc[torch]
To fix this error, you can (re)install the thinc package with the torch dependency using the command uv add 'thinc[torch]'
. The single quotes are necessary to prevent the shell from interpreting the brackets as a glob pattern.
uv add 'thinc[torch]'
ValueError: [E109] Component ‘experimental_coref’ could not be run. Did you forget to call initialize()
?
You are using an old or untrained coreference model. Instead use the following pretrained model.
Retreive the pretrained coref model)
pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.1/en_coreference_web_trf-3.4.0a2-py3-none-any.whl
Coreference resolution blog: https://explosion.ai/blog/coref
Back to CIS6930