3.1.1. Internet Access#

One of the simplest and most common ways to acquire data is by downloading it from the internet. Many datasets are publicly available and can be downloaded directly from websites as CSV, JSON, ZIP, or other file formats. This is often the first step in many data science projects.

You can download these files to your local machine or to cloud storage, and then load them into your data analysis environment.

3.1.1.1. Downloading from websites#

A lot of datasets can be downloaded directly from websites using your browser. Let’s look at a few examples:

Example: Wine Quality Dataset#

Let’s start with a popular example - the Wine Quality Dataset published by UC Irvine. It contains details about red wine samples and various physicochemical properties that affect their quality.

You can manually download the dataset by clicking the “Download” button on the page. This gives you a ZIP file that contains a CSV file. Once extracted, you can view it using Excel, Google Sheets, or any text editor.

Example: Kaggle Dataset#

Platforms like Kaggle also host a wide range of datasets. After creating a free account, you can search for any dataset and download it with one click.

For example, try downloading the Melbourne Housing Dataset. It will provide a CSV file that you can open using spreadsheet software or programmatically load using Python.

3.1.1.2. Downloading Files Programmatically#

Manually downloading datasets is fine for one-off tasks, but automation becomes essential when you’re working with multiple files or regularly updated data.

Here’s how to download the Wine Quality dataset using Python:

import requests

wine_dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
response = requests.get(wine_dataset_url)

with open("winequality-red.csv", "wb") as file:
    file.write(response.content)

print("Wine Quality Dataset downloaded successfully!")
Wine Quality Dataset downloaded successfully!

Note

We use Python’s requests library to make an HTTP request and download the file - the same way your browser does it. Read more about the requests library here.

You can now load this downloaded dataset using pandas. Pandas is another Python library that makes managing and interacting with tabular data very easy. You can learn more about pandas here. We’ll also cover some basic functionalities in later chapters.

import pandas as pd

wine_data = pd.read_csv("winequality-red.csv", sep=';')
wine_data.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

3.1.1.4. Exercise#

In the cell below, choose a small dataset from either Kaggle or HuggingFace and download it programmatically using Python. Load it with pandas and display the first few rows.

Note: Avoid large datasets, as memory and execution limits may apply in your environment.

# Your code here