This is the web page for Data Engineering at the University of Florida.
In assignment zero, you wrote code to extract records from from a public police department website. Each pdf allows people to view incidents. The code we created is in a structured formati and helpfule for analysis. For further end stream purposes, we need to perform data augmentation on the extracted records. To perform augmentation we will need to keep fairness and bias issues in mind.
In this assignment, we will perform a subsequent task for the data pipeline. Using the submission from assignment 0 you will take records from several instances of pdf files and augment the data. You will also create a Datasheet for the dataset you creating. Review the discussion from the litrature to guide your creation of the data sheet.
Your code should be executable via the command line.
The main head of the code should be in a file names assignment2.py
.
This should tke one parameter --urls <filename>
which points to a file with a list of incidents.
Each line contains only a url and no other information
pipenv run python assignment2.py --urls files.csv
Your code should extract URLS and other information read each file listed in the file passed in. Then you are going to preform data augmentation to increase the ability of the data to be passed on to another process in the pipeline. The output, tab-separated content should be printed to stdout.
Below we describe the output format.
Given each file, you are to produce the following tab separated rows. Each file should be processed in the order it is listed in the order it is added in the --urls file
that is passed and each record should be ordered by its appearance in the corresponding pdf.
Day of the Week | Time of Day | Weather | Location Rank | Side of Town | Incident Rank | Nature | EMSSTAT |
---|---|---|---|---|---|---|---|
integer | integer | integer | integer | string | integer | string | boolean integer |
The day of week is a numeric value in the range 1-7. Where 1 corresponds to Sunday and 7 corresonds of Saturday.
The time of data is a numeric code from 0 to 24 describing the hour of the incident.
Determine the weather at the time and location of the incident. The weather is determined by the WMO CODE. The code is an integer that represents a weather position..
Sort all listed locatiions. Give an integer ranking of the frequency of locations with ties preserved.
For instance, if there is a three-way tie for the most popular location, each location will be ranked 1
; the next most popular location should be ranked 4
.
You can use the exact text of the location.
The side of town is one of eight items {N, S, E, W, NW, NE, SW, SE}.
Side of town is determined by approximate orientation of the center of town 35.220833, -97.443611
.
You can use the geopy library for assistance.
Sort all of the Natures. Give an integer ranking of the frequency of natures with ties preserved.
For instance, if there is a three-way tie for the most popular incident, each incident will be ranked 1
; the next most popular nature should be ranked 4
.
The Nature
is the direct text of the Nature from the source record.
This is a boolean value that is True in two cases.
First, if the Incident ORI was EMSSTAT
or if the subsequent record or two contain an EMSSTAT
at the same time and locaton.
Use the template from the datasheets for datasets paper or from a more recent location to create the daata sheet for this data set. Your answers should be completed to the best of your ability. Be sure you work on this portion individually because we will example submissions for academic dishonesy. We understand that not all answers are possible but you should still fill out each question as much as possible.
The README file name should be uppercase with an .md
extension.
You should write your name in it, an example of how to run it, and a list of any web or external resources that you used for help.
The README file should also contain a list of any bugs or assumptions made while writing the program.
Note that you should not be copying code from any website not provided by the instructor.
You should include directions on how to install and use the code.
You should describe any known bugs and cite any sources or people you used for help.
Be sure to include any assumptions you make for your solution.
This file should contain a comma-separated list describing who you worked with and a small text description describing the nature of the collaboration. This information should be listed in three fields as in the example is below:
Katherine Johnson, kj@nasa.gov, Helped me understand calculations
Dorothy Vaughan, doro@dod.gov, Helped me with multiplexed time management
Your code structure should be in a directory with something similar to the following format:
cis6930sp24-assignment2/
├── COLLABORATORS
├── DATASHEET
├── LICENSE
├── README
├── Pipfile
├── src
│ └── ...
├── docs/
├── assignment2.py
├── setup.cfg
├── setup.py
└── tests/
├── test_time.py
└── test_geo.py
└── test_nature.py
└── ...
from setuptools import setup, find_packages
setup(
name='assignment2',
version='1.0',
author='You Name',
authour_email='your ufl email',
packages=find_packages(exclude=('tests', 'docs')),
setup_requires=['pytest-runner'],
tests_require=['pytest']
)
Note, the setup.cfg
file should have at least the following text inside:
[aliases]
test=pytest
[tool:pytest]
norecursedirs = .*, CVS, _darcs, {arch}, *.egg, venv
Grades will be assessed according to the following distribution:
Note we will be running your code in batch it is important that you follow directions closely.