This is the web page for Data Engineering at the University of Florida.
Congratulations! You have just been hired as a data engineer for the Men in Black (MIB), the top-secret organization responsible for monitoring extraterrestrial activity on Earth. Your role? To develop a data system that can detect near-paranormal activity—strange disturbances in public safety data that might indicate alien interactions. Once identified, the MIB can deploy agents to investigate and, if necessary, use a Neuralyzer to erase the memories of any witnesses. A Neuralyzer is a device used by MIB agents to erase short-term memory. It emits a bright flash that wipes the recollection of recent events from a person’s mind, ensuring that classified extraterrestrial encounters remain hidden from the public.
Your task is to develop a Python package that extracts public safety data from the city of Gainesville. This will help the MIB identify potential alien encounters by analyzing arrest reports, traffic crashes, and crime responses. Your package will dynamically should store this data in a DuckDB database and provide a command-line interface (CLI) for querying incidents based on specific timeframes.
Your package will take three manditory command line paramters --year
, --month
, and --day
.
The code will look for all the sets of Crime Responses, Arrests, and Traffic Crashes that occured on that day.
Find the event of that took involved the modest people.
Identify all the incidents that day that are 1 kilometer away from the incident that involved the most people.
Print to STDOUT the number of people in each of the incident then the case number, separated by a tab.
The list should be ordered by the number of people, then the case number.
If no incidents are given for the time frame, write nothing.
MIB Headquarters will use this information to identify potential alien encounters and deploy agents to investigate.
Your Python package must retrieve and store data from the following sources:
Data shoud be extracted from API on-the-fly
instead of being saved because the data is updated frequently.
The data can be cached during development and you can add other debugging flags but we will only test the ones above.
The main.py
file will be used to run the main portion of the code.
The code takes three paramters --year
, --month
, and --day
and will print to STDOUT the number of people in each of the incident then the case number, separated by a tab.
The list should be ordered by the number of people, then the case number.
If no incidents are found during the time frame, output nothing.
You will also need to ensure that your code works with linux based systems.
We will use the pipenv environment to run your code.
Below is an example of how to run the code followed by example (ficticious) output.
You do not have to output the geographic distances.
pipenv run python main.py --year 2025 --month 1 --day 1
25 211009186
17 211012687
17 211012686
2 211012571
...
The pseudocode for your code is roughly as follows:
tab
.Please note that the data is not perfect and you may need to do some data cleaning. Also, you can assume that all the resulting records from Arrests, Traffic Crashes, and Crime Responses will have linked identifiers. Include any other assumptions you make in your README.md file.
You have several choices for calculating the distance between two points on the Earth’s surface.
The Haversine formula is a popular choice measuring the distance along the surface of a sphere.
To be more precise, you can also use the geopy
library, which provides a variety of distance calculation methods.
You have many options for calculating the geodesic distance with the library.
If you are storing the data or using the a database such as duckdb you can use the spatial functions to calculate the geodesic distance using duckdb -c "INSTALL SPATIAL"
.
Create a private repository called cis6930sp25-project1
.
Please ensure you use this exact repository name, all lowercase.
Add collaborators cegme
, tzhan024
, and abbasidaniyal
by going to Settings > Collaborators and teams > add people
.
Please follow the packaging structure below.
We explicitly use the flat layout instead of the src layout because we will be running the submission as top-level code through the main[^1].
The docs folder can remain empty or with only a place holder file.
This direcory could be used to hold autogenerate docs; auto generate docs are not required for this project.
Please include your Pipfile but you do not need to commit a Pipfile.lock
.
The entry point for your code should be the main.py file.
cis6930sp25-project1/
├── COLLABORATORS.md
├── LICENSE
├── Pipfile
├── README.md
├── main.py
├── docs/
├── pyproject.toml
└── tests
├── test_joindata.py
├── ...
└── test_geoquery.py
Do not alter your submission code or hard your solutions based on test cases. This will be considered academic dishonesty.
This is a template for the standard pyptoject file. Please adjust as needed.
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
[project]
authors = [{name = "Your name", email = "your.name@ufl.edu"}]
name = "project1"
description = "Project One from <github id> -- Spring 2025"
version = "1.0"
readme = "README.md"
[project.urls]
Task = "https://ufdatastudio.com/cis6930sp25/project/1"
repository = "https://github.com/<userid>/cis6930sp25-project1"
[tool.setuptools]
py-modules = []
[tool.pytest.ini_options]
testpaths = ["tests"]
The README.md file should be all uppercase with .md
extension.
You should write your name in it, and an example of how to run it including any bugs that should be expected.
You should describe all features of your code.
The README.md file should contain a list of any bugs or assumptions made while writing the program.
You should include directions on how to install and use the Python package.
We know your code will not be perfect, be sure to include any assumptions you make for your solution.
Note: You should not be copying code from any website not provided by the instructor.
Below is an example template:
# cis6930sp25 -- Project 1
Name:
## Assignment Description
In your own words...
## How to install
pipenv install -e .
## How to run
pipenv run python ...
## Example

## Features and functions
#### main.py
downloaddata() - this function...
...other functions
#### arrests.py
dojsonparse() - this function ...
## Bugs and Assumptions
...
This file should contain a pipe-separated list describing who you worked with and a small text description describing the nature of the collaboration. If you visited a website for inspiration, including the website. This information should be listed in three fields as in the example is below:
Katherine Johnson | kj@nasa.gov | Helped me understand calculations
Dorothy Vaughan | doro@dod.gov | Helped me with multiplexed time management
Stackoverflow | https://example | helped me with a compilation of python test
The collaborator file is mainly used to ensure that code similarities are coincidental. Be sure to abide by the acadenmic integrity guidelines outlined in the syllabus. Generative AI tools may result in code that is very similar to other student submission and should be avoided.
You should have your own test data set that you can use to test your code.
Add test flags as appropriate for you.
Tests should be runnable by using pipenv run python -m pytest -v
.
The tests should show that all the functionality works.
We are not necessarily looking for bullet proof code.
Visit the pytest docs for details.
All tests should go in the tests/
folder.
The files names containing the tests functions should be prefixed with the word test
.
For example, data size tests could go in a file with the name test_download.py
.
Functions in the test file that should run as tests must be prefixed with the string test
.
We will run your tests from the root directory with the line pipenv run python -m pytest -v .
.
It is important to know that running pytest using the method in the previous sentence adds the current path to the sys.path and so you do not have to hack the run path in your test files.
An example test file is below
# -*- coding: utf-8 -*-
# Example test_first_sanity.py
import main
def test_sanity() -> None
assert main.somefunction() == EXPECTED RESULT
...
Consider installing the pytest-cov
package to measure the code coverage of your tests.
When ready to submit, create a tag on your repository using git tag on the latest commit:
git tag v1.0
git push origin v1.0
Version v1.0 lets us know when and what version of code you would like us to grade.
If you would like to submit a second version before, use the tag v2.0
.
If you need to update a tag, view the commands in the following StackOverflow post.
You will also submit your repository to GradeScope. You will have to submit your whole repository – do not upload the files. A link will appear on canvas when submissions are available. Consider also adding a yaml file to your repository file for continuous integration in your GitHub project.
pipenv run python main.py --year 2024 --month 12 --day 12
5 224018560
4 224018592
3 224018581
3 224018555
2 224018578
2 224018577
2 224018570
2 224018566
2 224018565
2 224018557
2 224018556
2 224018552
1 224018583
Grades will be assessed according to the following distribution:
We would like to note that this data set contains real people and incidents and we ask you to respect identities and incidents referred to within the data.
2025-03-01: There are 2 type of dates in crime response, the testcase was generated with report_date, please use report_date to make sure you can pass the testcases.
Back to CIS6930