This is the web page for Introduction to Data Science at the University of Florida.
We are going to explore a weather api.
The data we will look at comes from weather.gov
Lets chose five cities to investigate. We need to get the latitude and longiture of the cities. We can use the Nominatim API.
Open-source geocoding website: Nominatim. The API can get lots of information about cities and other places1.
Use the search API https://nominatim.org/release-docs/develop/api/Search/
We can use the UI to search https://nominatim.openstreetmap.org/ui/search.html
UI: https://nominatim.openstreetmap.org/ui/search.html?q=Gainesville%2C+Florida
API: https://nominatim.openstreetmap.org/search?q=Gainesville,Florida&format=json
[
{
"place_id": 280324401,
"licence": "Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright",
"osm_type": "relation",
"osm_id": 118870,
"lat": "29.6519684",
"lon": "-82.3249846",
"category": "boundary",
"type": "administrative",
"place_rank": 16,
"importance": 0.606121342054947,
"addresstype": "city",
"name": "Gainesville",
"display_name": "Gainesville, Alachua County, Florida, United States",
"boundingbox": [
"29.5978621",
"29.7783723",
"-82.4222531",
"-82.2223771"
]
}
]
UI: https://nominatim.openstreetmap.org/ui/search.html?q=Norman%2C+Oklahoma
API: https://nominatim.openstreetmap.org/search.php?q=Norman%2C+Oklahoma&format=jsonv2
[
{
"place_id": 307249622,
"licence": "Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright",
"osm_type": "relation",
"osm_id": 185049,
"lat": "35.2225717",
"lon": "-97.4394816",
"category": "boundary",
"type": "administrative",
"place_rank": 16,
"importance": 0.578488326850917,
"addresstype": "city",
"name": "Norman",
"display_name": "Norman, Cleveland County, Oklahoma, United States",
"boundingbox": [
"35.1453180",
"35.3483240",
"-97.5474454",
"-97.1769180"
]
}
]
UI: https://nominatim.openstreetmap.org/ui/search.html?q=Compton%2C+California
API: https://nominatim.openstreetmap.org/search.php?q=Compton%2C+California&format=jsonv2
[
{
"place_id": 292329971,
"licence": "Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright",
"osm_type": "relation",
"osm_id": 112057,
"lat": "33.894927",
"lon": "-118.226624",
"category": "boundary",
"type": "administrative",
"place_rank": 16,
"importance": 0.564093251307199,
"addresstype": "city",
"name": "Compton",
"display_name": "Compton, Los Angeles County, California, United States",
"boundingbox": [
"33.8630465",
"33.9231979",
"-118.2636590",
"-118.1799460"
]
}
]
import json
import urllib
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
'Connection': 'keep-alive',
'Referer': 'https://www.google.com/',
'DNT': '1', # Do Not Track request header
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'no-cache',
'Pragma': 'no-cache',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
}
url = "https://nominatim.openstreetmap.org/search.php?q=Compton%2C+California&format=json"
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
data = response.read().decode('utf-8')
json_data = json.loads(data)
print(json_data, file=open('mydata.json', 'a'))
Note: that it is good to have a header that you can use to fetch data when needed. We can make this a function2.
We coud do this same request from the command line and save it to a file with the following commands
curl -s ${URL} > citydata.json
wget -O citydata.json ${URL}
Now that we have the lat long info lets get the weather data.
You should manually explore the weather service api.
Weather is given using 2.5km x 2.5km forcasts
.
To get a forcase for our cities we will need to grab the grid location that correspond to our city location.
The request needs to be of the form https://api.weather.gov/points/{latitude},{longitude}
.
# From previous JSON output
lat = json_data[0]['lat']
lon = json_data[0]['lon']
# Temlate request
requesturl = f"https://api.weather.gov/points/{lat},{lon}"
Looking at a snippet of the result, we can retrieve the grid information
{
# ...
"forecastOffice": "https://api.weather.gov/offices/LOX",
"gridId": "LOX",
"gridX": 155,
"gridY": 38,
"forecast": "https://api.weather.gov/gridpoints/LOX/155,38/forecast",
"forecastHourly": "https://api.weather.gov/gridpoints/LOX/155,38/forecast/hourly",
"forecastGridData": "https://api.weather.gov/gridpoints/LOX/155,38",
"observationStations": "https://api.weather.gov/gridpoints/LOX/155,38/stations",
# ...
}
We can will skip a few steps and use the returned link to one of the three forcast types.
To get the raw data we can directly fetch using the url provided forecastGridData https://api.weather.gov/gridpoints/LOX/155,38
We can observe the data see inside properties, there is a list of temperature values and times. The units are wmoUnit:degC, which accorind to the documentation is degrees Celcius3.
# See all the keys
fetchdata("https://api.weather.gov/gridpoints/LOX/155,38")["properties"].keys()
# Lets look at temperature
fetchdata("https://api.weather.gov/gridpoints/LOX/155,38")["properties"]["temperature"]["values"]
pandas
Pandas read_json function does not read an object or string.
We will have to either write the data to a file or wrap it in StringIO
import io
import pandas as pd
rawtemps = fetchdata("https://api.weather.gov/gridpoints/LOX/155,38")["properties"]["temperature"]["values"]
df = pd.read_json(io.StringIO( json.dumps(rawtemps) ))
We now have a dataframe with two types.
Let’s describe the dataframe and get a few statistics from the data.
df = pd.read_json(io.StringIO(json.dumps(rawtemps)))
print(df.describe())
Let’s rename value to Celcius and create a Farenheit column.
df.rename(columns={'value': 'Celsius'}, inplace=True)
df['Fahrenheit'] = df['Celsius'] * 9/5 + 32
df.head()
Lets query and group by the hour
df['datetime'] = pd.to_datetime(df['validTime'].str.split('/').str[0])
df['hour'] = df['datetime'].dt.hour
df.groupby('hour')['Fahrenheit'].mean() # The avg temp across that hour of the day
# Which hour is most unstable?
df.groupby('hour')['Fahrenheit'].std()
PT1H means a period of time or duration and one hour (source).
Lets write this data frame to a spreadsheet with one sheet per city.
If we create dataframs for each city we could do something like the following.
Note: pip install -U openpyxl
it is needed.
gnv_df = df.copy()
nor_df = df.copy()
com_df = df.copy()
with pd.ExcelWriter('weather_data.xlsx') as writer:
gnv_df.to_excel(writer, sheet_name="Gainesville")
nor_df.to_excel(writer, sheet_name="Norman")
com_df.to_excel(writer, sheet_name="Compton")
Note there are timezones errors with writing to excel.
df['validTime'] = df['validTime'].dt.tz_localize(None)
df['datetime'] = df['datetime'].dt.tz_localize(None)
# Below is over kill
df['validTime'] = pd.to_datetime(df['validTime'].str.split('/').str[0], errors='coerce')
df['validTime'] = pd.to_datetime(df['validTime'].astype(str))
df['datetime'] = pd.to_datetime(df['datetime'].str.split('/').str[0], errors='coerce')
df['datetime'] = pd.to_datetime(df['datetime'].astype(str))
# Hopefully not needed to fix timezone issue
df['validTime'] = df['validTime'].apply(lambda x: x.replace(tzinfo=None) if pd.notnull(x) and hasattr(x, 'tzinfo') else x)
Lets dump the data to sqlite
db_connection = sqlite3.connect('weather_data.db')
df.to_sql('weather', db_connection, if_exists='replace', index=False)
Lets view the schema using python
cursor = db_connection.cursor()
cursor.execute("PRAGMA table_info(weather)")
print(cursor.fetchall())
By the sqlite command line. With a query check for the size 4
sqlite> .schema
CREATE TABLE IF NOT EXISTS "weather" (
"validTime" TEXT,
"Celsius" REAL,
"Fahrenheit" REAL,
"datetime" TIMESTAMP,
"hour" INTEGER
);
sqlite> select count(rowid) from weather; -- count all
Lets do an sql query over the data from python
import sqlite3
import pandas as pd
db_connection = sqlite3.connect('weather_data.db')
df.to_sql('weather', db_connection, if_exists='replace', index=False)
query = "SELECT hour, AVG(Celsius) as avg_ctemp, AVG(Fahrenheit), avg_ftemp FROM weather GROUP BY hour"
df_sql = pd.read_sql_query(query, db_connection)
print(df_sql)
We can use read_sql_query to perform a query sqlite through pandas
Note: In Python, if you would like to access each column by name you will need to add a factory method to the connection object.
db_connection = sqlite3.connect('weather_data.db')
db_connection.row_factory = sqlite3.Row
cursor = db_connection.cursor()
cursor.execute("SELECT * FROM weather")
rows = cursor.fetchall()
for row in rows:
print(row['hour'], row['Celsius'], row['Fahrenheit'])
Lets ue matplot lib
%pip install -U matplotlib
import matplotlib.pyplot as plt
plt.scatter(df['Celsius'], df['Fahrenheit'])
plt.title('Celsius vs Fahrenheit')
plt.xlabel('Celsius')
plt.ylabel('Fahrenheit')
plt.show()
plt.hist(df['Celsius'], bins=20, edgecolor='black')
plt.title('Temperature Distribution')
plt.xlabel('Celsius')
plt.ylabel('Frequency')
plt.show()
df[['Celsius']].boxplot()
plt.title('Temperature Boxplot')
plt.show()
Find the range of temperatures (difference between max and min) for each hour.
SELECT hour,
MAX(Celsius) - MIN(Celsius) AS temp_range_celsius,
MAX(Fahrenheit) - MIN(Fahrenheit) AS temp_range_fahrenheit
FROM weather
GROUP BY hour
ORDER BY temp_range_celsius DESC;
Using window functions to compute a running average of Celsius temperatures over time.
SELECT datetime, Celsius,
AVG(Celsius) OVER (ORDER BY datetime ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS running_avg_celsius
FROM weather;
Find instances where the temperature rises by more than 11°C compared to the previous hour.
SELECT w1.datetime, w1.Celsius, w1.hour,
w1.Celsius - w2.Celsius AS temp_difference
FROM weather w1
JOIN weather w2
ON w1.hour = w2.hour + 1
WHERE (w1.Celsius - w2.Celsius) > 11
ORDER BY w1.datetime;
Using window functions, determine the hottest and coldest hours per day.
WITH RankedTemps AS (
SELECT *,
RANK() OVER (PARTITION BY DATE(datetime) ORDER BY Celsius DESC) AS hottest_rank,
RANK() OVER (PARTITION BY DATE(datetime) ORDER BY Celsius ASC) AS coldest_rank
FROM weather
)
SELECT datetime, Celsius, Fahrenheit, hour
FROM RankedTemps
WHERE hottest_rank = 1 OR coldest_rank = 1
ORDER BY datetime;
Categorize hours into peak (8 AM - 8 PM) and off-peak hours and calculate their respective average temperatures.
SELECT
CASE
WHEN hour BETWEEN 8 AND 20 THEN 'Peak Hours'
ELSE 'Off-Peak Hours'
END AS time_period,
AVG(Celsius) AS avg_temp_celsius,
AVG(Fahrenheit) AS avg_temp_fahrenheit
FROM weather
GROUP BY time_period;
Identify periods where the temperature continuously rises;for at least 5 consecutive hours.
WITH TempTrends AS (
SELECT datetime, hour, Celsius,
Celsius - LAG(Celsius, 1) OVER (ORDER BY datetime) AS temp_diff
FROM weather
),
ConsecutiveIncreases AS (
SELECT datetime, hour, Celsius,
SUM(CASE WHEN temp_diff > 0 THEN 1 ELSE 0 END)
OVER (ORDER BY datetime ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS rising_count
FROM TempTrends
)
SELECT datetime, hour, Celsius
FROM ConsecutiveIncreases
WHERE rising_count = 5
ORDER BY datetime;
Find days where the difference between the highest and lowest temperatures was the greatest.
SELECT DATE(datetime) AS date,
MAX(Celsius) - MIN(Celsius) AS temp_fluctuation
FROM weather
GROUP BY date
ORDER BY temp_fluctuation DESC;
Determine the average temperature for each hour of the day across all recorded days.
SELECT hour,
AVG(Celsius) AS avg_celsius,
AVG(Fahrenheit) AS avg_fahrenheit
FROM weather
GROUP BY hour
ORDER BY hour;
Identify records where the temperature exceeded the daily average.
WITH DailyAvg AS (
SELECT DATE(datetime) AS date, AVG(Celsius) AS avg_temp
FROM weather
GROUP BY date
)
SELECT w.datetime, w.Celsius, d.avg_temp
FROM weather w
JOIN DailyAvg d ON DATE(w.datetime) = d.date
WHERE w.Celsius > d.avg_temp
ORDER BY w.datetime;
Aggregate the temperatures over a weekly period to analyze trends.
SELECT strftime('%Y-%W', datetime) AS week,
AVG(Celsius) AS avg_weekly_temp_celsius,
AVG(Fahrenheit) AS avg_weekly_temp_fahrenheit
FROM weather
GROUP BY week
ORDER BY week;
Back to BACK
A city is just one place rank. ↩
This is a function for fetching data.
def fetchdata(url, isjson=True):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
'Connection': 'keep-alive',
'Referer': 'https://www.google.com/',
'DNT': '1', # Do Not Track request header
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'no-cache',
'Pragma': 'no-cache',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
}
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
data = response.read().decode('utf-8')
if isjson:
return json.loads(data)
else:
return data
The WMO code says it is degrees Celsius. ↩