CIS 6930 Spring 24

Logo

This is the web page for Data Engineering at the University of Florida.

View the Project on GitHub ufdatastudio/cis6930sp24

CIS 6930, Spring 2024 Assignment 2

Augmenting Data

Introduction

In assignment zero, you wrote code to extract records from from a public police department website. Each pdf allows people to view incidents. The code we created is in a structured formati and helpfule for analysis. For further end stream purposes, we need to perform data augmentation on the extracted records. To perform augmentation we will need to keep fairness and bias issues in mind.

Task Overview

In this assignment, we will perform a subsequent task for the data pipeline. Using the submission from assignment 0 you will take records from several instances of pdf files and augment the data. You will also create a Datasheet for the dataset you creating. Review the discussion from the litrature to guide your creation of the data sheet.

Your code should be executable via the command line. The main head of the code should be in a file names assignment2.py . This should tke one parameter --urls <filename> which points to a file with a list of incidents. Each line contains only a url and no other information

pipenv run python assignment2.py --urls files.csv

Your code should extract URLS and other information read each file listed in the file passed in. Then you are going to preform data augmentation to increase the ability of the data to be passed on to another process in the pipeline. The output, tab-separated content should be printed to stdout.

Below we describe the output format.

Data Augmentation

Given each file, you are to produce the following tab separated rows. Each file should be processed in the order it is listed in the order it is added in the --urls file that is passed and each record should be ordered by its appearance in the corresponding pdf.

Day of the Week Time of Day Weather Location Rank Side of Town Incident Rank Nature EMSSTAT
integer integer integer integer string integer string boolean integer

Day of Week

The day of week is a numeric value in the range 1-7. Where 1 corresponds to Sunday and 7 corresonds of Saturday.

Time of Data

The time of data is a numeric code from 0 to 24 describing the hour of the incident.

Weather

Determine the weather at the time and location of the incident. The weather is determined by the WMO CODE. The code is an integer that represents a weather position..

Location Rank

Sort all listed locatiions. Give an integer ranking of the frequency of locations with ties preserved. For instance, if there is a three-way tie for the most popular location, each location will be ranked 1; the next most popular location should be ranked 4. You can use the exact text of the location.

Side of Town

The side of town is one of eight items {N, S, E, W, NW, NE, SW, SE}. Side of town is determined by approximate orientation of the center of town 35.220833, -97.443611. You can use the geopy library for assistance.

Incident Rank

Sort all of the Natures. Give an integer ranking of the frequency of natures with ties preserved. For instance, if there is a three-way tie for the most popular incident, each incident will be ranked 1; the next most popular nature should be ranked 4.

Nature

The Nature is the direct text of the Nature from the source record.

EMSSTAT

This is a boolean value that is True in two cases. First, if the Incident ORI was EMSSTAT or if the subsequent record or two contain an EMSSTAT at the same time and locaton.

Submission

DATASHEET.md

Use the template from the datasheets for datasets paper or from a more recent location to create the daata sheet for this data set. Your answers should be completed to the best of your ability. Be sure you work on this portion individually because we will example submissions for academic dishonesy. We understand that not all answers are possible but you should still fill out each question as much as possible.

README.md

The README file name should be uppercase with an .md extension. You should write your name in it, an example of how to run it, and a list of any web or external resources that you used for help. The README file should also contain a list of any bugs or assumptions made while writing the program. Note that you should not be copying code from any website not provided by the instructor. You should include directions on how to install and use the code. You should describe any known bugs and cite any sources or people you used for help. Be sure to include any assumptions you make for your solution.

COLLABORATORS file

This file should contain a comma-separated list describing who you worked with and a small text description describing the nature of the collaboration. This information should be listed in three fields as in the example is below:

Katherine Johnson, kj@nasa.gov, Helped me understand calculations
Dorothy Vaughan, doro@dod.gov, Helped me with multiplexed time management

Assignment Descriptions

Your code structure should be in a directory with something similar to the following format:

cis6930sp24-assignment2/
├── COLLABORATORS
├── DATASHEET
├── LICENSE
├── README
├── Pipfile
├── src
│   └── ...
├── docs/
├── assignment2.py
├── setup.cfg
├── setup.py
└── tests/
    ├── test_time.py
    └── test_geo.py
    └── test_nature.py
    └── ... 

setup.py

from setuptools import setup, find_packages

setup(
	name='assignment2',
	version='1.0',
	author='You Name',
	authour_email='your ufl email',
	packages=find_packages(exclude=('tests', 'docs')),
	setup_requires=['pytest-runner'],
	tests_require=['pytest']	
)

Note, the setup.cfg file should have at least the following text inside:

[aliases]
test=pytest

[tool:pytest]
norecursedirs = .*, CVS, _darcs, {arch}, *.egg, venv

Grading

Grades will be assessed according to the following distribution:

Note we will be running your code in batch it is important that you follow directions closely.

Resources


Back to Assignment List