Internet Access

3.1.1. Internet Access#

One of the simplest and most common ways to acquire data is by downloading it from the internet. Many datasets are publicly available and can be downloaded directly from websites as CSV, JSON, ZIP, or other file formats. This is often the first step in many data science projects.

You can download these files to your local machine or to cloud storage, and then load them into your data analysis environment.

3.1.1.1. Downloading from websites#

A lot of datasets can be downloaded directly from websites using your browser. Let’s look at a few examples:

Example: Wine Quality Dataset#

Let’s start with a popular example - the Wine Quality Dataset published by UC Irvine. It contains details about red wine samples and various physicochemical properties that affect their quality.

You can manually download the dataset by clicking the “Download” button on the page. This gives you a ZIP file that contains a CSV file. Once extracted, you can view it using Excel, Google Sheets, or any text editor.

Example: Kaggle Dataset#

Platforms like Kaggle also host a wide range of datasets. After creating a free account, you can search for any dataset and download it with one click.

For example, try downloading the Melbourne Housing Dataset. It will provide a CSV file that you can open using spreadsheet software or programmatically load using Python.

3.1.1.2. Downloading Files Programmatically#

Manually downloading datasets is fine for one-off tasks, but automation becomes essential when you’re working with multiple files or regularly updated data.

Here’s how to download the Wine Quality dataset using Python:

import requests

wine_dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
response = requests.get(wine_dataset_url)

with open("winequality-red.csv", "wb") as file:
    file.write(response.content)

print("Wine Quality Dataset downloaded successfully!")

Wine Quality Dataset downloaded successfully!

Note

We use Python’s requests library to make an HTTP request and download the file - the same way your browser does it. Read more about the requests library here.

You can now load this downloaded dataset using pandas. Pandas is another Python library that makes managing and interacting with tabular data very easy. You can learn more about pandas here. We’ll also cover some basic functionalities in later chapters.

import pandas as pd

wine_data = pd.read_csv("winequality-red.csv", sep=';')
wine_data.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

3.1.1.3. Popular Dataset Sources#

As data-driven work has gained prominence, a large number of datasets are now available and continuously being generated. In response, several platforms have emerged to host these datasets. These platforms not only relieve dataset publishers from the burden of managing infrastructure for data distribution but also provide users with a streamlined and consistent way to access datasets.

Two of the most commonly used platforms for dataset hosting today are:

Kaggle#

Kaggle, a Google-owned platform, offers thousands of datasets across various domains. You can use the Kaggle API to download datasets directly into your workspace.

Here’s how to download the Melbourne Housing dataset:

Warning

You’ll need to set up API credentials for this to work.

import pandas as pd
import kagglehub

download_path = kagglehub.dataset_download("anthonypino/melbourne-housing-market")

housing_data = pd.read_csv(download_path + "/Melbourne_housing_FULL.csv")
housing_data.head()

Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.12), please consider upgrading to the latest version (1.0.0).

Downloading from https://www.kaggle.com/api/v1/datasets/download/anthonypino/melbourne-housing-market?dataset_version_number=27...

  0%|          | 0.00/2.28M [00:00<?, ?B/s]

100%|██████████| 2.28M/2.28M [00:00<00:00, 76.5MB/s]

Extracting files...

	Suburb	Address	Rooms	Type	Price	Method	SellerG	Date	Distance	Postcode	...	Bathroom	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount
0	Abbotsford	68 Studley St	2	h	NaN	SS	Jellis	3/09/2016	2.5	3067.0	...	1.0	1.0	126.0	NaN	NaN	Yarra City Council	-37.8014	144.9958	Northern Metropolitan	4019.0
1	Abbotsford	85 Turner St	2	h	1480000.0	S	Biggin	3/12/2016	2.5	3067.0	...	1.0	1.0	202.0	NaN	NaN	Yarra City Council	-37.7996	144.9984	Northern Metropolitan	4019.0
2	Abbotsford	25 Bloomburg St	2	h	1035000.0	S	Biggin	4/02/2016	2.5	3067.0	...	1.0	0.0	156.0	79.0	1900.0	Yarra City Council	-37.8079	144.9934	Northern Metropolitan	4019.0
3	Abbotsford	18/659 Victoria St	3	u	NaN	VB	Rounds	4/02/2016	2.5	3067.0	...	2.0	1.0	0.0	NaN	NaN	Yarra City Council	-37.8114	145.0116	Northern Metropolitan	4019.0
4	Abbotsford	5 Charles St	3	h	1465000.0	SP	Biggin	4/03/2017	2.5	3067.0	...	2.0	0.0	134.0	150.0	1900.0	Yarra City Council	-37.8093	144.9944	Northern Metropolitan	4019.0

5 rows × 21 columns

HuggingFace Datasets#

HuggingFace Datasets is another rich resource, especially for natural language processing and machine learning. You can access datasets easily using the datasets library.

Let’s load the IMDB movie review dataset:

from datasets import load_dataset

dataset = load_dataset("imdb", split="train", trust_remote_code=True)
dataset.to_pandas().head()

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'imdb' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.

	text	label
0	I rented I AM CURIOUS-YELLOW from my video sto...	0
1	"I Am Curious: Yellow" is a risible and preten...	0
2	If only to avoid making this type of film in t...	0
3	This film was probably inspired by Godard's Ma...	0
4	Oh, brother...after hearing about this ridicul...	0

3.1.1.4. Exercise#

In the cell below, choose a small dataset from either Kaggle or HuggingFace and download it programmatically using Python. Load it with pandas and display the first few rows.

Note: Avoid large datasets, as memory and execution limits may apply in your environment.

# Your code here