A data viasualization app to map the City of San Francisco's Envista construction projects using data from their Open Data portal.
Note: this is a work in progress.
Two databases exist, which contain Envista projects:
1) Envista Projects for Utility Excavation and Paving
API Endpoint: https://data.sfgov.org/resource/dsie-4zfb.json
2) Envista Projects for Paving
API Endpoint: https://data.sfgov.org/resource/vad7-rtjc.json
Start by importing needed Python libraries:
%matplotlib inline
import datetime
import requests
import json
import pandas as pd
import seaborn as sns
Query the utility API and store response in a dictionary:
url_util = ('https://data.sfgov.org/resource/dsie-4zfb.json')
r_util = requests.get(url_util)
response_dict_util = json.loads(r_util.text)
Look at the first record in the response dictionary:
response_dict_util[0]
Load into a Pandas dataframe, df_util, and delete the first 5 columns which are not needed:
df_util = pd.DataFrame.from_dict(response_dict_util)
df_util = df_util.drop(df_util.columns[[0, 1, 2, 3, 4]], axis=1)
Look at the new dataframe:
df_util.head()
df_util.tail()
Key fields:
create_date
description_full (incomplete data)
dwp_project_class (StreetScape, Sewer, Gas, Paving, ...)
owner (PG&E, SF DPW IDC - Hydraulic, SF DPW IDC - Streets & Highways, ...)
project_id
project_name_full
project_status (Planned, Committed, Started, Completed)
start_date
end_date
duration
street_name
latitude
longitude
Perform the same steps for the other Envista database:
url = ('https://data.sfgov.org/resource/vad7-rtjc.json')
r = requests.get(url)
response_dict = json.loads(r.text)
Look at the first record of response_dict to see if it has the same structure as the utility response dictionary:
response_dict[0]
It does, so can delete the first 5 columns as part of the conversion to a dataframe as before:
df = pd.DataFrame.from_dict(response_dict)
df = df.drop(df.columns[[0, 1, 2, 3, 4]], axis=1)
df.head()
df.tail()
For testing purposes, turn the "df" dataframe into a groups of geojson objects to plot on a map (since it has fewer records than "df_util"). For now only 6 columns of data sent to the geojson file: "project_id", "owner", "dpw_project_class", "project_status", "start_date", "end_date".
def df_2_geojson(df, properties, lat='latitude', lon='longitude'):
geojson = {"type":"FeatureCollection", "features":[]}
for _, row in df.iterrows():
feature = {"type":"Feature",
"properties":{},
"geometry":{"type":"Point",
"coordinates":[]}}
feature["geometry"]["coordinates"] = [row[lon],row[lat]]
for prop in properties:
feature["properties"][prop] = row[prop]
geojson['features'].append(feature)
return geojson
cols = ["project_id", "owner", "dpw_project_class", "project_status", "start_date", "end_date"]
geojson = df_2_geojson(df, cols)
output_filename = 'envista.geojson'
with open(output_filename, 'wb') as output_file:
output_file.write('')
json.dump(geojson, output_file, indent=2)
The map that plots the geojson points is located here: https://hdavis.github.io/Envista/envista.html
Now that a preliminary map works, let's check for duplicates between two databases. A quick visual inspection shows many duplicates. For now, let's work with the Utility Excavation and Paving database ("df_util" dataframe) since it appears to be more comprehensive.
A more detailed check will need to be done to confirm that everything in the "Envista Projects for Paving" database ("df" dataframe) with 5354 records is in the "Envista Projects for Utility Excavation and Paving" database ("df_util" dataframe) with 16093 records (as of 8/7/2016).
Working with the "df_util" dataframe, let's subset it into slices based on the values/categories in the "dpw_project_class" field. These categories are:
df_util.dpw_project_class.value_counts()
Creating the slices based on these categories:
df_comm = df_util[df_util.dpw_project_class == 'Communication']
df_curb = df_util[df_util.dpw_project_class == 'Curb Ramps']
df_electric = df_util[df_util.dpw_project_class == 'Electric']
df_gas = df_util[df_util.dpw_project_class == 'Gas']
df_landUse = df_util[df_util.dpw_project_class == 'Land-Use']
df_paving = df_util[df_util.dpw_project_class == 'Paving']
df_ped = df_util[df_util.dpw_project_class == 'Pedestrian Safety']
df_roadway = df_util[df_util.dpw_project_class == 'Roadway']
df_sewer = df_util[df_util.dpw_project_class == 'Sewer']
df_storm = df_util[df_util.dpw_project_class == 'Storm Drain']
df_streetScape = df_util[df_util.dpw_project_class == 'StreetScape']
df_trafficImprove = df_util[df_util.dpw_project_class == 'Traffic Improvement']
df_transit = df_util[df_util.dpw_project_class == 'Transit']
df_water = df_util[df_util.dpw_project_class == 'Water']
Checking to make sure there are no missing/stray records or rows:
numRows_df_util = len(df_util.index) # total number of records in "df_util"
numRows_df_util
numRows_all_slices = len(df_comm.index) + len(df_curb.index) + \
len(df_electric.index) + len(df_gas.index) + len(df_landUse.index) + \
len(df_paving.index) + len(df_ped.index) + len(df_roadway.index) + \
len(df_sewer.index) + len(df_storm.index) + len(df_streetScape.index) + \
len(df_trafficImprove.index) + len(df_transit.index) + len(df_water.index)
numRows_all_slices
Can convert these different slices based on project class into geoJSON files, though might be best to load that data for the map directly from the SF City API and display the different categories on the map using gueries.
Looking at the breakdown of project owners to see if that might be a good way to display the map data:
df_util.owner.value_counts()
There are 15 different owners, including the individual departments within a given owner (e.g. SF DPW IDC). The project class field has 14 categories. Using different colors to display the status of the project (started, completed, etc.) is the easiest way to divide up the data on the map and a good place to start.
Looking at the distribution of the 4 categories in the project_status field:
df_util.project_status.value_counts()