Data Exploration Demo: Inclement Weather

Data Exploration Demo: Inclement Weather
Footnotes

We are going to explore a weather api.

The data we will look at comes from weather.gov

Lets chose five cities to investigate. We need to get the latitude and longiture of the cities. We can use the Nominatim API.

JSON: Get city information

Open-source geocoding website: Nominatim. The API can get lots of information about cities and other places¹.

Use the search API https://nominatim.org/release-docs/develop/api/Search/

We can use the UI to search https://nominatim.openstreetmap.org/ui/search.html

Gainesville, Florida

UI: https://nominatim.openstreetmap.org/ui/search.html?q=Gainesville%2C+Florida

API: https://nominatim.openstreetmap.org/search?q=Gainesville,Florida&format=json

[
  {
    "place_id": 280324401,
    "licence": "Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright",
    "osm_type": "relation",
    "osm_id": 118870,
    "lat": "29.6519684",
    "lon": "-82.3249846",
    "category": "boundary",
    "type": "administrative",
    "place_rank": 16,
    "importance": 0.606121342054947,
    "addresstype": "city",
    "name": "Gainesville",
    "display_name": "Gainesville, Alachua County, Florida, United States",
    "boundingbox": [
      "29.5978621",
      "29.7783723",
      "-82.4222531",
      "-82.2223771"
    ]
  }
]

Norman, Oklahoma

UI: https://nominatim.openstreetmap.org/ui/search.html?q=Norman%2C+Oklahoma

API: https://nominatim.openstreetmap.org/search.php?q=Norman%2C+Oklahoma&format=jsonv2

[
  {
    "place_id": 307249622,
    "licence": "Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright",
    "osm_type": "relation",
    "osm_id": 185049,
    "lat": "35.2225717",
    "lon": "-97.4394816",
    "category": "boundary",
    "type": "administrative",
    "place_rank": 16,
    "importance": 0.578488326850917,
    "addresstype": "city",
    "name": "Norman",
    "display_name": "Norman, Cleveland County, Oklahoma, United States",
    "boundingbox": [
      "35.1453180",
      "35.3483240",
      "-97.5474454",
      "-97.1769180"
    ]
  }
]

Compton, California

UI: https://nominatim.openstreetmap.org/ui/search.html?q=Compton%2C+California

API: https://nominatim.openstreetmap.org/search.php?q=Compton%2C+California&format=jsonv2

[
  {
    "place_id": 292329971,
    "licence": "Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright",
    "osm_type": "relation",
    "osm_id": 112057,
    "lat": "33.894927",
    "lon": "-118.226624",
    "category": "boundary",
    "type": "administrative",
    "place_rank": 16,
    "importance": 0.564093251307199,
    "addresstype": "city",
    "name": "Compton",
    "display_name": "Compton, Los Angeles County, California, United States",
    "boundingbox": [
      "33.8630465",
      "33.9231979",
      "-118.2636590",
      "-118.1799460"
    ]
  }
]

Geting API data with Python

import json
import urllib

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
    'Connection': 'keep-alive',
    'Referer': 'https://www.google.com/',
    'DNT': '1',  # Do Not Track request header
    'Upgrade-Insecure-Requests': '1',
    'Cache-Control': 'no-cache',
    'Pragma': 'no-cache',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
}

url = "https://nominatim.openstreetmap.org/search.php?q=Compton%2C+California&format=json"

req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
    data = response.read().decode('utf-8')
    json_data = json.loads(data)


print(json_data, file=open('mydata.json', 'a'))

Note: that it is good to have a header that you can use to fetch data when needed. We can make this a function².

We coud do this same request from the command line and save it to a file with the following commands

curl -s ${URL} > citydata.json
wget -O citydata.json ${URL}

JSON: Get weather information

Now that we have the lat long info lets get the weather data.

You should manually explore the weather service api. Weather is given using 2.5km x 2.5km forcasts. To get a forcase for our cities we will need to grab the grid location that correspond to our city location.

The request needs to be of the form https://api.weather.gov/points/{latitude},{longitude}.

Fetch Lat Long

# From previous JSON output

lat = json_data[0]['lat']
lon = json_data[0]['lon']

# Temlate request

requesturl = f"https://api.weather.gov/points/{lat},{lon}" 

Fetch temperatures

Looking at a snippet of the result, we can retrieve the grid information

{
# ...
        "forecastOffice": "https://api.weather.gov/offices/LOX",
        "gridId": "LOX",
        "gridX": 155,
        "gridY": 38,
        "forecast": "https://api.weather.gov/gridpoints/LOX/155,38/forecast",
        "forecastHourly": "https://api.weather.gov/gridpoints/LOX/155,38/forecast/hourly",
        "forecastGridData": "https://api.weather.gov/gridpoints/LOX/155,38",
        "observationStations": "https://api.weather.gov/gridpoints/LOX/155,38/stations",
# ...
}

We can will skip a few steps and use the returned link to one of the three forcast types.

forecast - forecast for 12h periods over the next seven days
forecastHourly - forecast for hourly periods over the next seven days
forecastGridData - raw forecast data over the next seven days

To get the raw data we can directly fetch using the url provided forecastGridData https://api.weather.gov/gridpoints/LOX/155,38

We can observe the data see inside properties, there is a list of temperature values and times. The units are wmoUnit:degC, which accorind to the documentation is degrees Celcius³.

# See all the keys
fetchdata("https://api.weather.gov/gridpoints/LOX/155,38")["properties"].keys()

# Lets look at temperature
fetchdata("https://api.weather.gov/gridpoints/LOX/155,38")["properties"]["temperature"]["values"]

Load `pandas`

Pandas read_json function does not read an object or string. We will have to either write the data to a file or wrap it in StringIO

import io
import pandas as pd

rawtemps = fetchdata("https://api.weather.gov/gridpoints/LOX/155,38")["properties"]["temperature"]["values"]

df = pd.read_json(io.StringIO( json.dumps(rawtemps) ))

We now have a dataframe with two types.

DataFrame manipulation

Let’s describe the dataframe and get a few statistics from the data.

df = pd.read_json(io.StringIO(json.dumps(rawtemps)))
print(df.describe())

Let’s rename value to Celcius and create a Farenheit column.

df.rename(columns={'value': 'Celsius'}, inplace=True)
df['Fahrenheit'] = df['Celsius'] * 9/5 + 32
df.head()

Countries that Farenheit

Lets query and group by the hour

df['datetime'] = pd.to_datetime(df['validTime'].str.split('/').str[0])
df['hour'] = df['datetime'].dt.hour

df.groupby('hour')['Fahrenheit'].mean() # The avg temp across that hour of the day

# Which hour is most unstable?
df.groupby('hour')['Fahrenheit'].std()

PT1H means a period of time or duration and one hour (source).

To Excel

Lets write this data frame to a spreadsheet with one sheet per city.

If we create dataframs for each city we could do something like the following.

Note: pip install -U openpyxl it is needed.

gnv_df = df.copy()
nor_df = df.copy()
com_df = df.copy()
with pd.ExcelWriter('weather_data.xlsx') as writer:
    gnv_df.to_excel(writer, sheet_name="Gainesville")
    nor_df.to_excel(writer, sheet_name="Norman")
    com_df.to_excel(writer, sheet_name="Compton")

Note there are timezones errors with writing to excel.

df['validTime'] = df['validTime'].dt.tz_localize(None)
df['datetime'] = df['datetime'].dt.tz_localize(None)

# Below is over kill
df['validTime'] = pd.to_datetime(df['validTime'].str.split('/').str[0], errors='coerce')
df['validTime'] = pd.to_datetime(df['validTime'].astype(str))
df['datetime'] = pd.to_datetime(df['datetime'].str.split('/').str[0], errors='coerce')
df['datetime'] = pd.to_datetime(df['datetime'].astype(str))

# Hopefully not needed to fix timezone issue
df['validTime'] = df['validTime'].apply(lambda x: x.replace(tzinfo=None) if pd.notnull(x) and hasattr(x, 'tzinfo') else x)

Sqlite

Lets dump the data to sqlite

db_connection = sqlite3.connect('weather_data.db')
df.to_sql('weather', db_connection, if_exists='replace', index=False)

Lets view the schema using python

cursor = db_connection.cursor()
cursor.execute("PRAGMA table_info(weather)")
print(cursor.fetchall())

By the sqlite command line. With a query check for the size ⁴

sqlite> .schema
CREATE TABLE IF NOT EXISTS "weather" (
"validTime" TEXT,
  "Celsius" REAL,
  "Fahrenheit" REAL,
  "datetime" TIMESTAMP,
  "hour" INTEGER
);
sqlite> select count(rowid) from weather; -- count all 

Lets do an sql query over the data from python

import sqlite3
import pandas as pd

db_connection = sqlite3.connect('weather_data.db')
df.to_sql('weather', db_connection, if_exists='replace', index=False)
query = "SELECT hour, AVG(Celsius) as avg_ctemp, AVG(Fahrenheit), avg_ftemp FROM weather GROUP BY hour"
df_sql = pd.read_sql_query(query, db_connection)
print(df_sql)

We can use read_sql_query to perform a query sqlite through pandas

Note: In Python, if you would like to access each column by name you will need to add a factory method to the connection object.

db_connection = sqlite3.connect('weather_data.db')
db_connection.row_factory = sqlite3.Row
cursor = db_connection.cursor()
cursor.execute("SELECT * FROM weather")
rows = cursor.fetchall()
for row in rows:
    print(row['hour'], row['Celsius'], row['Fahrenheit'])

Lets Visualize the data

Lets ue matplot lib

%pip install -U matplotlib
import matplotlib.pyplot as plt

Scatter Plot

plt.scatter(df['Celsius'], df['Fahrenheit'])
plt.title('Celsius vs Fahrenheit')
plt.xlabel('Celsius')
plt.ylabel('Fahrenheit')
plt.show()

Histogram

plt.hist(df['Celsius'], bins=20, edgecolor='black')
plt.title('Temperature Distribution')
plt.xlabel('Celsius')
plt.ylabel('Frequency')
plt.show()

Boxplot

df[['Celsius']].boxplot()
plt.title('Temperature Boxplot')
plt.show()

Complex Queries

1. Find the Hourly Temperature Range (Max - Min)

Find the range of temperatures (difference between max and min) for each hour.

SELECT hour, 
       MAX(Celsius) - MIN(Celsius) AS temp_range_celsius, 
       MAX(Fahrenheit) - MIN(Fahrenheit) AS temp_range_fahrenheit
FROM weather
GROUP BY hour
ORDER BY temp_range_celsius DESC;

2. Calculate the Running Average Temperature Over Time

Using window functions to compute a running average of Celsius temperatures over time.

SELECT datetime, Celsius,
       AVG(Celsius) OVER (ORDER BY datetime ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS running_avg_celsius
FROM weather;

3. Identify Temperature Spikes (Sudden Increase by More than 11°C)

Find instances where the temperature rises by more than 11°C compared to the previous hour.

SELECT w1.datetime, w1.Celsius, w1.hour,
       w1.Celsius - w2.Celsius AS temp_difference
FROM weather w1
JOIN weather w2 
ON w1.hour = w2.hour + 1
WHERE (w1.Celsius - w2.Celsius) > 11
ORDER BY w1.datetime;

4. Find the Hottest and Coldest Hours for Each Day

Using window functions, determine the hottest and coldest hours per day.

WITH RankedTemps AS (
    SELECT *, 
           RANK() OVER (PARTITION BY DATE(datetime) ORDER BY Celsius DESC) AS hottest_rank,
           RANK() OVER (PARTITION BY DATE(datetime) ORDER BY Celsius ASC) AS coldest_rank
    FROM weather
)
SELECT datetime, Celsius, Fahrenheit, hour
FROM RankedTemps
WHERE hottest_rank = 1 OR coldest_rank = 1
ORDER BY datetime;

5. Find the Average Temperature During Peak and Off-Peak Hours

Categorize hours into peak (8 AM - 8 PM) and off-peak hours and calculate their respective average temperatures.

SELECT 
    CASE 
        WHEN hour BETWEEN 8 AND 20 THEN 'Peak Hours'
        ELSE 'Off-Peak Hours'
    END AS time_period,
    AVG(Celsius) AS avg_temp_celsius,
    AVG(Fahrenheit) AS avg_temp_fahrenheit
FROM weather
GROUP BY time_period;

6. Find Consecutive Hours with Rising Temperatures

Identify periods where the temperature continuously rises;for at least 5 consecutive hours.

WITH TempTrends AS (
    SELECT datetime, hour, Celsius,
           Celsius - LAG(Celsius, 1) OVER (ORDER BY datetime) AS temp_diff
    FROM weather
),
ConsecutiveIncreases AS (
    SELECT datetime, hour, Celsius,
           SUM(CASE WHEN temp_diff > 0 THEN 1 ELSE 0 END) 
               OVER (ORDER BY datetime ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS rising_count
    FROM TempTrends
)
SELECT datetime, hour, Celsius
FROM ConsecutiveIncreases
WHERE rising_count = 5
ORDER BY datetime;

7. Find Days with the Largest Temperature Fluctuations

Find days where the difference between the highest and lowest temperatures was the greatest.

SELECT DATE(datetime) AS date, 
       MAX(Celsius) - MIN(Celsius) AS temp_fluctuation
FROM weather
GROUP BY date
ORDER BY temp_fluctuation DESC;

8. Find the Average Temperature for Each Hour Across All Days

Determine the average temperature for each hour of the day across all recorded days.

SELECT hour, 
       AVG(Celsius) AS avg_celsius, 
       AVG(Fahrenheit) AS avg_fahrenheit
FROM weather
GROUP BY hour
ORDER BY hour;

9. Find Temperature Readings Above the Daily Average

Identify records where the temperature exceeded the daily average.

WITH DailyAvg AS (
    SELECT DATE(datetime) AS date, AVG(Celsius) AS avg_temp
    FROM weather
    GROUP BY date
)
SELECT w.datetime, w.Celsius, d.avg_temp
FROM weather w
JOIN DailyAvg d ON DATE(w.datetime) = d.date
WHERE w.Celsius > d.avg_temp
ORDER BY w.datetime;

10. Temperature Trends Over the Week (Weekly Averages)

Aggregate the temperatures over a weekly period to analyze trends.

SELECT strftime('%Y-%W', datetime) AS week, 
       AVG(Celsius) AS avg_weekly_temp_celsius, 
       AVG(Fahrenheit) AS avg_weekly_temp_fahrenheit
FROM weather
GROUP BY week
ORDER BY week;

Footnotes

Back to BACK

A city is just one place rank. ↩

This is a function for fetching data.

def fetchdata(url, isjson=True):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
        'Connection': 'keep-alive',
        'Referer': 'https://www.google.com/',
        'DNT': '1',  # Do Not Track request header
        'Upgrade-Insecure-Requests': '1',
        'Cache-Control': 'no-cache',
        'Pragma': 'no-cache',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
    }


    req = urllib.request.Request(url, headers=headers)
    with urllib.request.urlopen(req) as response:
        data = response.read().decode('utf-8')
        if isjson:
            return json.loads(data)
        else:
            return data

↩

The WMO code says it is degrees Celsius. ↩
ROWID is a special default ordering attribute in sqlite ↩