3.1.1. Internet Access#
One of the simplest and most common ways to acquire data is by downloading it from the internet. Many datasets are publicly available and can be downloaded directly from websites as CSV, JSON, ZIP, or other file formats. This is often the first step in many data science projects.
You can download these files to your local machine or to cloud storage, and then load them into your data analysis environment.
3.1.1.1. Downloading from websites#
A lot of datasets can be downloaded directly from websites using your browser. Let’s look at a few examples:
Example: Wine Quality Dataset#
Let’s start with a popular example - the Wine Quality Dataset published by UC Irvine. It contains details about red wine samples and various physicochemical properties that affect their quality.
You can manually download the dataset by clicking the “Download” button on the page. This gives you a ZIP file that contains a CSV file. Once extracted, you can view it using Excel, Google Sheets, or any text editor.
Example: Kaggle Dataset#
Platforms like Kaggle also host a wide range of datasets. After creating a free account, you can search for any dataset and download it with one click.
For example, try downloading the Melbourne Housing Dataset. It will provide a CSV file that you can open using spreadsheet software or programmatically load using Python.
3.1.1.2. Downloading Files Programmatically#
Manually downloading datasets is fine for one-off tasks, but automation becomes essential when you’re working with multiple files or regularly updated data.
Here’s how to download the Wine Quality dataset using Python:
import requests
wine_dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
response = requests.get(wine_dataset_url)
with open("winequality-red.csv", "wb") as file:
file.write(response.content)
print("Wine Quality Dataset downloaded successfully!")
Wine Quality Dataset downloaded successfully!
Note
We use Python’s requests library to make an HTTP request and download the file - the same way your browser does it. Read more about the requests library here.
You can now load this downloaded dataset using pandas. Pandas is another Python library that makes managing and interacting with tabular data very easy. You can learn more about pandas here. We’ll also cover some basic functionalities in later chapters.
import pandas as pd
wine_data = pd.read_csv("winequality-red.csv", sep=';')
wine_data.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
3.1.1.3. Popular Dataset Sources#
As data-driven work has gained prominence, a large number of datasets are now available and continuously being generated. In response, several platforms have emerged to host these datasets. These platforms not only relieve dataset publishers from the burden of managing infrastructure for data distribution but also provide users with a streamlined and consistent way to access datasets.
Two of the most commonly used platforms for dataset hosting today are:
Kaggle#
Kaggle, a Google-owned platform, offers thousands of datasets across various domains. You can use the Kaggle API to download datasets directly into your workspace.
Here’s how to download the Melbourne Housing dataset:
Warning
You’ll need to set up API credentials for this to work.
import pandas as pd
import kagglehub
download_path = kagglehub.dataset_download("anthonypino/melbourne-housing-market")
housing_data = pd.read_csv(download_path + "/Melbourne_housing_FULL.csv")
housing_data.head()
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.12), please consider upgrading to the latest version (0.3.13).
Downloading from https://www.kaggle.com/api/v1/datasets/download/anthonypino/melbourne-housing-market?dataset_version_number=27...
0%| | 0.00/2.28M [00:00<?, ?B/s]
100%|██████████| 2.28M/2.28M [00:00<00:00, 30.7MB/s]
Extracting files...
| Suburb | Address | Rooms | Type | Price | Method | SellerG | Date | Distance | Postcode | ... | Bathroom | Car | Landsize | BuildingArea | YearBuilt | CouncilArea | Lattitude | Longtitude | Regionname | Propertycount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Abbotsford | 68 Studley St | 2 | h | NaN | SS | Jellis | 3/09/2016 | 2.5 | 3067.0 | ... | 1.0 | 1.0 | 126.0 | NaN | NaN | Yarra City Council | -37.8014 | 144.9958 | Northern Metropolitan | 4019.0 |
| 1 | Abbotsford | 85 Turner St | 2 | h | 1480000.0 | S | Biggin | 3/12/2016 | 2.5 | 3067.0 | ... | 1.0 | 1.0 | 202.0 | NaN | NaN | Yarra City Council | -37.7996 | 144.9984 | Northern Metropolitan | 4019.0 |
| 2 | Abbotsford | 25 Bloomburg St | 2 | h | 1035000.0 | S | Biggin | 4/02/2016 | 2.5 | 3067.0 | ... | 1.0 | 0.0 | 156.0 | 79.0 | 1900.0 | Yarra City Council | -37.8079 | 144.9934 | Northern Metropolitan | 4019.0 |
| 3 | Abbotsford | 18/659 Victoria St | 3 | u | NaN | VB | Rounds | 4/02/2016 | 2.5 | 3067.0 | ... | 2.0 | 1.0 | 0.0 | NaN | NaN | Yarra City Council | -37.8114 | 145.0116 | Northern Metropolitan | 4019.0 |
| 4 | Abbotsford | 5 Charles St | 3 | h | 1465000.0 | SP | Biggin | 4/03/2017 | 2.5 | 3067.0 | ... | 2.0 | 0.0 | 134.0 | 150.0 | 1900.0 | Yarra City Council | -37.8093 | 144.9944 | Northern Metropolitan | 4019.0 |
5 rows × 21 columns
HuggingFace Datasets#
HuggingFace Datasets is another rich resource, especially for natural language processing and machine learning. You can access datasets easily using the datasets library.
Let’s load the IMDB movie review dataset:
from datasets import load_dataset
dataset = load_dataset("imdb", split="train", trust_remote_code=True)
dataset.to_pandas().head()
`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'imdb' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
| text | label | |
|---|---|---|
| 0 | I rented I AM CURIOUS-YELLOW from my video sto... | 0 |
| 1 | "I Am Curious: Yellow" is a risible and preten... | 0 |
| 2 | If only to avoid making this type of film in t... | 0 |
| 3 | This film was probably inspired by Godard's Ma... | 0 |
| 4 | Oh, brother...after hearing about this ridicul... | 0 |
3.1.1.4. Exercise#
In the cell below, choose a small dataset from either Kaggle or HuggingFace and download it programmatically using Python. Load it with pandas and display the first few rows.
Note: Avoid large datasets, as memory and execution limits may apply in your environment.
# Your code here