CIS 6930 Spring 24

Logo

This is the web page for Data Engineering at the University of Florida.

View the Project on GitHub ufdatastudio/cis6930fa24

Assignment 0 - CIS 6930 Fall 2024

This assignment will be practicing extracting data from an online source and reformatting the data. You will create a python package that takes a search paramter and outputs information about the wanted people. The code you produce should, given a page, retreive the data from the FBI’s most wanted list and return a CSV files with the title, subjects, and field offices linked as a thorn separated output. The output should be printed to standard out.

About the Wanted API

Please view the FBI Wanted API page at its source https://www.fbi.gov/wanted/api. The FBI API exposes a REST endpoint and suggest you use the Python request library to retrieve information about people who are wanted by the FBI. You can make a REST GET request to https://api.fbi.gov/wanted/v1/list to retrieve the first 20 records of information. You may add a paramter page=N where N is an integer to specify the N*20th page of records. Each record on a page contains several pieces of information. For this assignment, we are concerned about title which describes the name of the wanted person or an event; subjects which describes a reason for the wanted post; field_offices which describes the FBI office assigned to the case.

Assignment Task

Your assignment is to build a Python function that collects elements from a given page and represents them in the requested format.

In your Python package, you should create a command line Python file to that takes a page parameter. The page parameter corresponds to the FBI api page. The system will use the --page to fetch data from the FBI api. The --file-location parameter should contain the location of a json file that can be used for testing. Only one parameter can be specified.

pipenv run python main.py --page <integer>

or

pipenv run python main.py --file <file-location>

Your code should then extract and format the data on the page and print to stdout the data. Each row should have the following format:

{title}{thorn}{subjects}{thorn}{field_offices}

In the output, you will use the lowercase thorn character (þ) to separate data fields. Fields with multiple entries should be separated by commas. Fields with null or empty entries should remain blank. For example, output from a page could looks like the following:

Extreme lossþsebastian,Pit BullþMiami
Dissapointing teamþDJþTallahassee,Dublin
Florida ManþSeeking InformationþGainesville
Data Engineerþþall over

You will also create simple test cases for each of the features. See the check list below for possible test cases.

Tasks Check list

You should create Python tests for the following components

Project submission

Create a private repository called cis6930fa24-assignment0. Please ensure you use this exact repository name, all lowercase. Add collaborators cegme and WillCogginsUFL by going to Settings > Collaborators and teams > add people.

Create a Python package

cis6930fa24-assignment0/
├── COLLABORATORS.md
├── LICENSE
├── Pipfile
├── README.md
├── main.py
├── docs
├── setup.cfg
├── setup.py
└── tests
    ├── test_download.py
    └── test_randompage.py

setup.py / setup.cfg

from setuptools import setup, find_packages

setup(
	name='assignment0',
	version='1.0',
	author='You Name',
	author_email='your UFL email',
	packages=find_packages(exclude=('tests', 'docs')),
	setup_requires=['pytest-runner'],
	tests_require=['pytest']	
)

Note, the setup.cfg file should have at least the following text inside:

[aliases]
test=pytest

[tool:pytest]
norecursedirs = .*, CVS, _darcs, {arch}, *.egg, venv

README.md

The README.md file should be all uppercase with .md extension. You should write your name in it, and an example of how to run it including any bugs that should be expected. You should describe all functions and your approach to developing the database. The README.md file should contain a list of any bugs or assumptions made while writing the program. You should include directions on how to install and use the Python package. We know your code will not be perfect, be sure to include any assumptions you make for your solution. Note: You should not be copying code from any website not provided by the instructor.

Below is an example template:

# cis6930fa24 -- Assignment0 -- Template

Name:

# Assignment Description 

In your own words...


# How to install
pipenv install -e .

## How to run
pipenv run ...

## Example (optional)
![video](video)


## Functions

#### main.py
downloaddata() - this function...
...other functions

#### parsefile.py
dojsonparse() - this function ...

## Bugs and Assumptions
...

COLLABORATORS.md

This file should contain a pipe-separated list describing who you worked with and a small text description describing the nature of the collaboration. If you visited a website for inspiration, including the website. This information should be listed in three fields as in the example is below:

Katherine Johnson | kj@nasa.gov | Helped me understand calculations
Dorothy Vaughan | doro@dod.gov | Helped me with multiplexed time management
Stackoverflow | https://example | helped me with a compilation of python test

The collaborator file is mainly used to ensure that code similarities are coincidental. Be sure to abide by the acadenmic integrity guidelines outlined in the syllabus. Generative AI tools should not be used for this assignment.

main.py

Here is an example main.py file we expect that yours will be different. This snippet shows an outline of the expected functionality. Calling the main function should initiate the project functionality. Below is an outline; it shows how to use argparse to pass parameters to the code. You will also need to ensure that your code works with linux based systems. We will use the pipenv environment to run your code.

# -*- coding: utf-8 -*-
# Example main.py
import argparse
import sys

def main(page=None, thefile=None):
    # Download data
    if page is not None:
        # TODO Download data function
    
    elif thefile is not None:
        # TODO Retrieve the file contents
	
    # TODO call formating data function

    # TODO print the thron separated fule


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--file", type=str, required=False, help="An Example API file.")
    parser.add_argument("--page", type=int, required=False, help="An Example API file.")
     
    args = parser.parse_args()
    if args.page:
        main(page=args.page)
    elif args.file:
        main(file=args.file)
    else:
        parser.print_help(sys.stderr)

The request library is a nice way to download data but instead of installing that function you can use urllib from the standard library. Below is an example snippet below to grab an incident pdf document from the URL. It is not necessary that you use it.

import urllib

def fetch_data():
    url = ("https://api.fbi.gov/wanted/v1/list")
    headers = {}

    # Random user agent
    headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"                          

    formatted_url = urllib.request.Request(url, headers=headers)
    data = urllib.request.urlopen().read(formatted_url)

    return data

Tests

Tests should be runnable by using pipenv run python -m pytest -v. The tests should show that all the functionality works. We are not necessarily looking for bullet proof code. Visit the pytest docs for details.

All tests should go in the tests/ folder. The files names containing the tests functions should be prefixed with the word test. For example, data size tests could go in a file with the name test_download.py. Functions in the test file that should run as tests must be prefixed with the string test. We will run your tests from the root directory with the line pipenv run python -m pytest -v ..

An example test file is below

# -*- coding: utf-8 -*-
# Example test_first_sanity.py
import main

def test_sanity() -> None
    assert main.somefunction() == EXPECTED RESULT

...

Submitting your code

When ready to submit, create a tag on your repository using git tag on the latest commit:

git tag v1.0
git push origin v1.0

Version v1.0 lets us know when and what version of code you would like us to grade. If you would like to submit a second version before, use the tag v2.0. If you need to update a tag, view the commands in the following StackOverflow post.

You will also submit your repository to GradeScope. You will have to submit your whole repository – do not upload the files. A link will appear on canvas when submissions are available. Consider also adding a yaml file to your repository file for continuous integration in your github project.

Grading

Grades will be assessed according to the following distribution:


Back to Assignment List