CIS 6930 Spring 24

Logo

This is the web page for Data Engineering at the University of Florida.

View the Project on GitHub ufdatastudio/cis6930fa24

Project 0 - CIS 6930 Fall 2024

This project will be practicing extracting data from an online source and reformatting the data. Use your knowledge of Python3, SQL, regular expressions, and the Linux command line tools to extract information from a CSV file on the web.

The Norman, Oklahoma police department regularly reports incidents, arrests, and other activities. This data is hosted on their website. This data is distributed to the public in the form of PDF files.

The website contains three types of summaries arrests, incidents, and case summaries. Your assignment is to build a function that collects only the incidents. To do so, you need to write Python (3) function(s) to do each of the following:

Below we describe the assignment structure and each required function. Please read through this whole document before starting!

README.md

The README file should be all uppercase with .md extension. You should write your name in it, and an example of how to run it, include a demo (gif/video) in the readme demonstrating the execution, and any bugs that should be expected. You should describe all functions and your approach to developing the database. The README file should contain a list of any bugs or assumptions made while writing the program. You should include directions on how to install and use the Python package. We know your code will not be perfect, be sure to include any assumptions you make for your solution. Note: You should not be copying code from any website not provided by the instructor.

Below is an example template:

# cis6930fa24 -- Project 0 -- Template

Name:

# Project Description (in your own words)


# How to install
pipenv install

## How to run
pipenv run ...
![video](video)


## Functions
#### main.py \
extractincidents() - this functions has these parameters, does this process, returns answer.
...other functions

## Database Development
...

## Bugs and Assumptions
...

COLLABORATORS.md file

This file should contain a pipe-separated list describing who you worked with and a small text description describing the nature of the collaboration. If you visited a website for inspiration, including the website. This information should be listed in three fields as in the example is below:

Katherine Johnson | kj@nasa.gov | Helped me understand calculations
Dorothy Vaughan | doro@dod.gov | Helped me with multiplexed time management
Stackoverflow | https://example | helped me with a compilation of python test

The collaborator file is mainly used to ensure that code similarities are coincidental. Be sure to abide by the acadenmic integrity guidelines outlined in the syllabus. Generative AI tools should not be used for this assignment.

Assignment Description

Create a private repository called cis6930fa24-project0 Add collaborators cegme and WillCogginsUFL by going to Settings > Collaborators and teams > Add people. Your code structure should be in a directory with the following format:

Feel free to combine or optimize functions as long as your code preserves the behavior of main.py. You may have more or fewer files in your directory as needed. You may have several additional tests and modules in your code.

Create a Python package

cis6930fa24-project0/
├── COLLABORATORS.md
├── LICENSE
├── Pipfile
├── README.md
├── project0
│   └── main.py
├── docs
├── resources
├── setup.cfg
├── setup.py
└── tests
    ├── test_download.py
    └── test_random.py

setup.py / setup.cfg

from setuptools import setup, find_packages

setup(
	name='project0',
	version='1.0',
	author='You Name',
	author_email='your ufl email',
	packages=find_packages(exclude=('tests', 'docs', 'resources')),
	setup_requires=['pytest-runner'],
	tests_require=['pytest']	
)

Note, the setup.cfg file should have at least the following text inside:

[aliases]
test=pytest

[tool:pytest]
norecursedirs = .*, CVS, _darcs, {arch}, *.egg, venv

main.py

Here is an example main.py file we expect that yours will be different. This snippet shows an outline of the expected functionality. Calling the main function should download data insert it into a database and print a summary of the incidents. Your code will likely differ significantly. You may have more or less individual steps. Below is an outline; it shows how to use argparse to pass parameters to the code. You will also need to ensure that your code works with linux based systems. We will use the pipenv environment to run your code.

# -*- coding: utf-8 -*-
# Example main.py
import argparse

import project0

def main(url):
    # Download data
    incident_data = project0.fetchincidents(url)

    # Extract data
    incidents = project0.extractincidents(incident_data)
	
    # Create new database
    db = project0.createdb()
	
    # Insert data
    project0.populatedb(db, incidents)
	
    # Print incident counts
    project0.status(db)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--incidents", type=str, required=True, 
                         help="Incident summary url.")
     
    args = parser.parse_args()
    if args.incidents:
        main(args.incidents)

Your code should take a URL from the command line and perform each operation. After the code is installed, you should be able to run the code using the command below. We will use Pipfile to manage the package installation (more on Pipefiles here) .

pipenv run python project0/main.py --incidents <url>

Each run of the above command should create a new normandb database file. You can add other command line parameters to test each operation but the --incidents <url> flag is required.

Below is a discussion of each interface. Note, the function names are suggestions and should be changed to suit your programmer.

Download Data

The function fetchincidents(url) takes a URL string and uses the Python urllib.request library to grab one incident pdf for the Norman Police Report Webpage.

Below is an example snippet below to grab an incident pdf document from the URL.

import urllib

url = ("https://www.normanok.gov/sites/default/files/documents/"
       "2024-01/2024-01-01_daily_incident_summary.pdf")
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"                          

data = urllib.request.urlopen(urllib.request.Request(url, headers=headers)).read()                                                                               

The locations of files can be stored in the file system perhaps in the /tmp directory, as a local variable, a config file, any other method of your choosing. As long as the next method can read the data from the incident page. For official ways of handling config and temporary files, see the importlib library. Please discuss your choice in your README file.

Extract Data

The function extractincidents(incident_data) takes data from a pdf file and extracts the incidents. Each incident includes a date_time, incident number, incident location, nature, and incident ori. To extract the data from the pdf files, use the pypdf.pdf.PdfFileReader class. It will allow you to extract pages and pdf files and search for the rows. Extract each row and add it to a list. To install the module, use the command pipenv install pypdf. You must use the pypdf package for this assignment.


import pypdf
from pypdf import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text()) # Shows the extracted text


You can return the information from the pdf to insert the data into a database. For this assignment, you will need to consider EACH PAGE of the linked pdf.

Create Database

The createdb() function creates an SQLite database file named normanpd.db and inserts a table with the schema below. You should save this database file into the resources/ directory.

CREATE TABLE incidents (
    incident_time TEXT,
    incident_number TEXT,
    incident_location TEXT,
    nature TEXT,
    incident_ori TEXT
);

Note, some “cells” have information on multiple lines, your code should take care of these edge cases.

You will need to access sqlite3 from python, be sure to look at the official docs https://docs.python.org/3.12/library/sqlite3.html .

Insert Data

The function populatedb(db, incidents) takes the rows created in the extractincidents() function and adds it to the normanpd.db database. Again, the signature of this function can be changed as needed.

Status Print

The status() function prints to standard out, a list of the nature of incidents and the number of times they have occurred. The list should be sorted alphabetically and case sensitively by the nature. Each field of the row should be separated by the pipe character (|). Each row is terminated solely by the \n character.

Abdominal Pains/Problems|2
Alarm|14
Animal at Large|2
Animal Complaint|2
Animal Inured|1
Animal Vicious|1
Assult EMS Needed|2
...

Test

We expect you to create your own test files to test each function. Some tests involve downloading and processing data. To create your own test you can download and save test files locally. This is recommended, particularly because Norman PD will irregularly remove the arrest files. Tests should be runnable by using pipenv run python -m pytest. You should test at least each function. You should discuss your tests in your README.

Submitting your code

When ready to submit, create a tag on your repository using git tag on the latest commit:

git tag v1.0
git push origin v1.0

Version v1.0 lets us know when and what version of code you would like us to grade. If you would like to submit a second version before, use the tag v2.0.

If you need to update a tag, view the commands in the following StackOverflow post.

You will also submit your repository to GradeScope. You will have to submit your whole repository – do not upload the files. A link will appear on canvas when submissions are available.

Grading

Grades will be assessed according to the following distribution:

Adendum


Back to Assignment List