CIS 6930 Spring 25

Logo

This is the web page for Data Engineering at the University of Florida.

Project 1 - Canvassing the Scene

CIS 6930 Spring 2025

Your Mission as a Data Engineer for the MIB

Congratulations! You have just been hired as a data engineer for the Men in Black (MIB), the top-secret organization responsible for monitoring extraterrestrial activity on Earth. Your role? To develop a data system that can detect near-paranormal activity—strange disturbances in public safety data that might indicate alien interactions. Once identified, the MIB can deploy agents to investigate and, if necessary, use a Neuralyzer to erase the memories of any witnesses. A Neuralyzer is a device used by MIB agents to erase short-term memory. It emits a bright flash that wipes the recollection of recent events from a person’s mind, ensuring that classified extraterrestrial encounters remain hidden from the public.



Here's the updated image featuring two MIB agents, each holding their own pen-like Neuralyzer, with a small crowd gathered in the background reacting to the bright flashes.

Project Overview

Your task is to develop a Python package that extracts public safety data from the city of Gainesville. This will help the MIB identify potential alien encounters by analyzing arrest reports, traffic crashes, and crime responses. Your package will dynamically should store this data in a DuckDB database and provide a command-line interface (CLI) for querying incidents based on specific timeframes.

Your package will take three manditory command line paramters --year, --month, and --day. The code will look for all the sets of Crime Responses, Arrests, and Traffic Crashes that occured on that day. Find the event of that took involved the modest people. Identify all the incidents that day that are 1 kilometer away from the incident that involved the most people. Print to STDOUT the number of people in each of the incident then the case number, separated by a tab. The list should be ordered by the number of people, then the case number. If no incidents are given for the time frame, write nothing. MIB Headquarters will use this information to identify potential alien encounters and deploy agents to investigate.

Project Requirements

Data Sources

Your Python package must retrieve and store data from the following sources:

Data shoud be extracted from API on-the-fly instead of being saved because the data is updated frequently. The data can be cached during development and you can add other debugging flags but we will only test the ones above.

main.py

The main.py file will be used to run the main portion of the code. The code takes three paramters --year, --month, and --day and will print to STDOUT the number of people in each of the incident then the case number, separated by a tab. The list should be ordered by the number of people, then the case number. If no incidents are found during the time frame, output nothing. You will also need to ensure that your code works with linux based systems. We will use the pipenv environment to run your code. Below is an example of how to run the code followed by example (ficticious) output. You do not have to output the geographic distances.

pipenv run python main.py --year 2025 --month 1 --day 1
25	211009186
17	211012687 
17	211012686 
2	211012571 
...

Pseudocode

The pseudocode for your code is roughly as follows:

  1. Find all Arrests, Traffic Crashes, and Crime Respones that correspond to the given date.
  2. Find the traffic incident that affected the most Total People called x.
  3. Extract the location of the incident x.
  4. Compare the location of x to all incidents that occur on the same day.
  5. Remove all incidents that are not within 1 kilometer of x.
  6. Print the sorted number and incident pair to standard out, separated by a tab.

Please note that the data is not perfect and you may need to do some data cleaning. Also, you can assume that all the resulting records from Arrests, Traffic Crashes, and Crime Responses will have linked identifiers. Include any other assumptions you make in your README.md file.

Geographic distance

You have several choices for calculating the distance between two points on the Earth’s surface. The Haversine formula is a popular choice measuring the distance along the surface of a sphere. To be more precise, you can also use the geopy library, which provides a variety of distance calculation methods. You have many options for calculating the geodesic distance with the library.

If you are storing the data or using the a database such as duckdb you can use the spatial functions to calculate the geodesic distance using duckdb -c "INSTALL SPATIAL".

Project Submission

Create a private repository called cis6930sp25-project1. Please ensure you use this exact repository name, all lowercase. Add collaborators cegme, tzhan024, and abbasidaniyal by going to Settings > Collaborators and teams > add people.

Create a Python package

Please follow the packaging structure below. We explicitly use the flat layout instead of the src layout because we will be running the submission as top-level code through the main[^1]. The docs folder can remain empty or with only a place holder file. This direcory could be used to hold autogenerate docs; auto generate docs are not required for this project. Please include your Pipfile but you do not need to commit a Pipfile.lock. The entry point for your code should be the main.py file.

cis6930sp25-project1/
├── COLLABORATORS.md
├── LICENSE
├── Pipfile
├── README.md
├── main.py
├── docs/
├── pyproject.toml
└── tests
    ├── test_joindata.py
    ├── ...
    └── test_geoquery.py

Do not alter your submission code or hard your solutions based on test cases. This will be considered academic dishonesty.

pyproject.toml

This is a template for the standard pyptoject file. Please adjust as needed.

[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[project]
authors = [{name = "Your name", email = "your.name@ufl.edu"}]
name = "project1"
description = "Project One from <github id> -- Spring 2025"
version = "1.0"
readme = "README.md"

[project.urls]
Task = "https://ufdatastudio.com/cis6930sp25/project/1"
repository = "https://github.com/<userid>/cis6930sp25-project1"

[tool.setuptools]
py-modules = []

[tool.pytest.ini_options]
testpaths = ["tests"]

README.md

The README.md file should be all uppercase with .md extension. You should write your name in it, and an example of how to run it including any bugs that should be expected. You should describe all features of your code. The README.md file should contain a list of any bugs or assumptions made while writing the program. You should include directions on how to install and use the Python package. We know your code will not be perfect, be sure to include any assumptions you make for your solution. Note: You should not be copying code from any website not provided by the instructor.

Below is an example template:

# cis6930sp25 -- Project 1 

Name:

## Assignment Description

In your own words...

## How to install

pipenv install -e .

## How to run

pipenv run python ...

## Example

![video](video)

## Features and functions

#### main.py

downloaddata() - this function...
...other functions

#### arrests.py

dojsonparse() - this function ...

## Bugs and Assumptions

...

COLLABORATORS.md

This file should contain a pipe-separated list describing who you worked with and a small text description describing the nature of the collaboration. If you visited a website for inspiration, including the website. This information should be listed in three fields as in the example is below:

Katherine Johnson | kj@nasa.gov | Helped me understand calculations
Dorothy Vaughan | doro@dod.gov | Helped me with multiplexed time management
Stackoverflow | https://example | helped me with a compilation of python test

The collaborator file is mainly used to ensure that code similarities are coincidental. Be sure to abide by the acadenmic integrity guidelines outlined in the syllabus. Generative AI tools may result in code that is very similar to other student submission and should be avoided.

Tests

You should have your own test data set that you can use to test your code. Add test flags as appropriate for you. Tests should be runnable by using pipenv run python -m pytest -v. The tests should show that all the functionality works. We are not necessarily looking for bullet proof code. Visit the pytest docs for details.

All tests should go in the tests/ folder. The files names containing the tests functions should be prefixed with the word test. For example, data size tests could go in a file with the name test_download.py. Functions in the test file that should run as tests must be prefixed with the string test. We will run your tests from the root directory with the line pipenv run python -m pytest -v .. It is important to know that running pytest using the method in the previous sentence adds the current path to the sys.path and so you do not have to hack the run path in your test files.

An example test file is below

# -*- coding: utf-8 -*-
# Example test_first_sanity.py
import main

def test_sanity() -> None
    assert main.somefunction() == EXPECTED RESULT

...

Consider installing the pytest-cov package to measure the code coverage of your tests.

Submitting your code

When ready to submit, create a tag on your repository using git tag on the latest commit:

git tag v1.0
git push origin v1.0

Version v1.0 lets us know when and what version of code you would like us to grade. If you would like to submit a second version before, use the tag v2.0. If you need to update a tag, view the commands in the following StackOverflow post.

You will also submit your repository to GradeScope. You will have to submit your whole repository – do not upload the files. A link will appear on canvas when submissions are available. Consider also adding a yaml file to your repository file for continuous integration in your GitHub project.

Some Running Examples

pipenv run python main.py --year 2024 --month 12 --day 12
5       224018560
4       224018592
3       224018581
3       224018555
2       224018578
2       224018577
2       224018570
2       224018566
2       224018565
2       224018557
2       224018556
2       224018552
1       224018583

Grading

Grades will be assessed according to the following distribution:

We would like to note that this data set contains real people and incidents and we ask you to respect identities and incidents referred to within the data.

Addenda


Back to CIS6930