Neighbordhood Venues Categories Pattern vs Neighborhood Prevailing Social Classes
1. Introduction
For an entrepreneur, choosing the location of a new establishment within a city can be a very important and also very difficult task. For this, it is advisable to have as much information as possible from each neighborhood.
Similarly, the city government also needs as much information as possible from each neighborhood to manage them properly.
Amongst all information about neighborhoods, one that stands out is its prevailing social class. The needs and opportunities of a neighborhood are often associated with this information.
Here, we will seek to develop a model capable of predicting the prevailing social class of each neighborhood, based on the categories of venues there. This model will be trained with data from the set of reports of venues in each neighborhood, retrievable from the Foursquare API and with the data a report from UFMG (University of Minas Gerais) that informs the majority social class of each neighborhood.
If the model works well, we may use it to find out a valuable information of neighborhood on cities similar to Belo Horizonte that hasn’t a report about their neighborhood prevailing social class.
2. Data acquisition and cleaning
The report about the Belo Horizonte neighborhood prevailing social classes is published in PDF format. Fortunately, it is very easy to copy the data contents and past into a csv file. The resulting columns are “Neighborhood” and “Class”. Let’s see the head of this data set.
import pandas as pd
bairros = pd.read_csv('bh_bairros_classes.csv') # it was previously saved
bairros.head()
Neighborhood | Class | |
---|---|---|
0 | AARAO REIS | low |
1 | ALTO DOS PINHEIROS | low |
2 | ALTO PARAISO | low |
3 | ALVARO CAMARGOS | low |
4 | ALVORADA | low |
We will retrieve the Neighborhood venues with Foursquare API, by calling the “query” endpoint for each Neighborhood, which requires localization data. We use geopy library to retrieve localization data of each Neighborhood.
from geopy import Nominatim
from geopy.exc import GeocoderUnavailable, GeocoderTimedOut, GeocoderServiceError
geolocator = Nominatim(user_agent="ny_explorer")
def geolocator_belohorizonte(neighborhood):
try:
locator = geolocator.geocode(neighborhood +', Belo Horizonte')
except (GeocoderUnavailable, GeocoderTimedOut, GeocoderServiceError):
# print('Geocoder unavailable or timed out... will try again!')
locator = geolocator_belohorizonte(neighborhood)
return locator
print(bairros.shape)
bairros.head()
(245, 4)
Neighborhood | Class | Latitude | Longitude | |
---|---|---|---|---|
0 | AARAO REIS | low | -19.847221 | -43.919508 |
1 | ALTO DOS PINHEIROS | low | -19.932567 | -44.004875 |
2 | ALVARO CAMARGOS | low | -19.916339 | -44.007857 |
3 | ALVORADA | low | -30.031715 | -51.049711 |
4 | ANA LUCIA | low | -19.887783 | -43.906368 |
The geopy library sometimes can’t be accurate, so we had to remove the Neighborhoods that couldn’t have its localization data accurately retrieved. The identification of such cases was map manually with a map vizualization support.
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
classes = {'low':0, 'regular':1, 'high':2, 'luxury':3}
ilat = -19.9227318
ilon = -43.9450948
# create map
map_classes = folium.Map(location=[ilat, ilon], zoom_start=11)
# set color scheme for the clusters
x = np.arange(4)
ys = [i + x + (i*x)**2 for i in range(4)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, socialclass in zip(bairros['Latitude'], bairros['Longitude'], bairros['Neighborhood'], bairros['Class']):
label = folium.Popup(str(poi) + ' Class ' + socialclass, parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[classes[socialclass]-1],
fill=True,
fill_color=rainbow[classes[socialclass]-1],
fill_opacity=0.7).add_to(map_classes)
map_classes
By viewing the map it’s clear that geopy defined many points out of bounds of Belo Horizonte. Besides that, as I live in Belo Horizonte, I could detect some neighborhoods far from downtown that were inacurately defined by geopy.
So, we chose to restrict the analysis to neighborhoods not far from downtown.
Also, the location of “Pindorama” neighborhood, near from downtown, is remarkably wrong. So it will be removed too.
df = bairros[(bairros['Latitude']<ilat+0.09) & (bairros['Latitude']>ilat-0.09) & (bairros['Longitude']>ilon-0.06) & (bairros['Longitude']<ilon+0.06)]
df = df[df['Neighborhood'] != 'PINDORAMA']
Now we have the Latitude and Longitude for each neighborhood, so we are ready to make Foursquare API calls on “query” endpoint.
Actually, for each neighborhood we made 4 API calls, one for each category: “food”, “stores and services” and “professional”, and one for all categories combined.
So there will be 4 resulting datasets, one for each category, plus one for all categories combined, and it will look like this:
categories = {
'food': '4d4b7105d754a06374d81259',
'stores': '4d4b7105d754a06378d81259',
'profissional': '4d4b7105d754a06375d81259'
}
# function that extracts the category of the venue
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
# function that returns nearby venues by accessing foursquare
def getNearbyVenues(names, latitudes, longitudes, category, radius=500, LIMIT=150):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT,
category
)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['id'],
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'ID',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
bh_food_venues = getNearbyVenues(names=df['Neighborhood'],
latitudes=df['Latitude'],
longitudes=df['Longitude'],
categories['food']
)
bh_pro_venues = getNearbyVenues(names=df['Neighborhood'],
latitudes=df['Latitude'],
longitudes=df['Longitude'],
categories['professional']
)
bh_stores_venues = getNearbyVenues(names=df['Neighborhood'],
latitudes=df['Latitude'],
longitudes=df['Longitude'],
categories['stores']
)
bh_all_venues = getNearbyVenues(names=df['Neighborhood'],
latitudes=df['Latitude'],
longitudes=df['Longitude'],
''
)
bh_food_venues.head()
Neighborhood | ID | Venue | Latitude | Longitude | Venue Category | |
---|---|---|---|---|---|---|
0 | AARAO REIS | 4eb1d86e77c814d925751c99 | Chapa Mágica | -19.845448 | -43.921754 | BBQ Joint |
1 | AARAO REIS | 5bf1cc4275eee40039f91adf | Burger King | -19.846823 | -43.919360 | Fast Food Restaurant |
2 | AARAO REIS | 4daba4b84b22f071ead33715 | Celo Burguer | -19.847524 | -43.919394 | Burger Joint |
3 | AARAO REIS | 516dc84d498e618c69124919 | bobs | -19.846710 | -43.917326 | Burger Joint |
4 | AARAO REIS | 539991af498ea6a823188d29 | Padaria Vila Verde | -19.847362 | -43.921778 | Bakery |
3. Methodology
So now we have 4 clean datasets:
- bh_food_venues
- bh_stores_venues
- bh_pro_venues
- bh_all_venues
They will be our asset for training our model using classification algorithms - the venues categories will be its features (after an onehot) and the social class will be the target variable.
The resulting models will be evaluated and we will show the best dataset and best classification algorithm for our goal.
Let’s pick one dataset and see if it’s ready to go:
bh_all_venues = pd.read_csv('bh_food_venues.csv')
bh_all_venues.head()
Neighborhood | ID | Venue | Latitude | Longitude | Venue Category | |
---|---|---|---|---|---|---|
0 | AARAO REIS | 4eb1d86e77c814d925751c99 | Chapa Mágica | -19.845448 | -43.921754 | BBQ Joint |
1 | AARAO REIS | 5bf1cc4275eee40039f91adf | Burger King | -19.846823 | -43.919360 | Fast Food Restaurant |
2 | AARAO REIS | 4daba4b84b22f071ead33715 | Celo Burguer | -19.847524 | -43.919394 | Burger Joint |
3 | AARAO REIS | 516dc84d498e618c69124919 | bobs | -19.846710 | -43.917326 | Burger Joint |
4 | AARAO REIS | 539991af498ea6a823188d29 | Padaria Vila Verde | -19.847362 | -43.921778 | Bakery |
As we are going to use the Venues categories as features of our clustering algorithms, it’s appropriate to avoid neighborhoods with small number of venues, because it’s high potential to become outliers.
Unfortunately, the dataset of stores become too small after that restriction, so it will be discarded.
Now let’s take a fast view on the most common Venues Categories of bh_all_venues:
bh_all_venues['Venue Category'].value_counts()
Bakery 296
Bar 259
Brazilian Restaurant 230
Gym / Fitness Center 193
Burger Joint 165
...
Fish Market 1
Speakeasy 1
Travel Agency 1
Design Studio 1
Butcher 1
Name: Venue Category, Length: 272, dtype: int64
Looks like we are ready now to prepare our dataset to the classification algorithm.
First we do one hot encoding and drop columns that are not features or target.
Then we split training set with test set and build KNN, SVM and Logistic regression models.
The target (y) will be tested in the following formats:
- the actual class
- if the class == ‘luxury’
- if the class == ‘high’
- if the class == ‘regular’
- if the class == ‘low’
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
le = preprocessing.LabelEncoder()
le.fit(["low", "regular", "high", "luxury"])
def prepare(df_v): # returns a dataset with only features and target
df_venues = df_v
# one hot encoding
onehot = pd.get_dummies(df_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
onehot.insert(0, 'Neighborhood', df_venues['Neighborhood'].values)
# creating dataset of onehot categories means
df_grouped = onehot.groupby('Neighborhood').mean().reset_index()
# droping neighborhood with less than 8 venues to avoid outliers
min_venues_mask = df_venues.groupby('Neighborhood').count()['Venue']>=8
df_grouped = df_grouped[min_venues_mask.values]
df_merged = df
# merge df with bh_food_grouped to add latitude/longitude for each neighborhood
df_merged = df_merged.join(df_grouped.set_index('Neighborhood'), on='Neighborhood')
df_merged.dropna(inplace=True)
return df_merged.drop(['Latitude', 'Longitude', 'Neighborhood'], 1)
def build_evaluate(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
LR = LogisticRegression(C=0.1, solver='liblinear', multi_class='auto').fit(X_train, y_train)
yhat_lr = LR.predict(X_test)
yhat_lr_prob = LR.predict_proba(X_test)
print(' Logistic regression accuracy score', metrics.accuracy_score(y_test, yhat_lr))
print(' Logistic regression log loss', metrics.log_loss(y_test, yhat_lr_prob))
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
#Train Model and Predict
neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
yhat=neigh.predict(X_test)
mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])
print(" KNN accuracy score", mean_acc.max(), " with k=", mean_acc.argmax()+1)
clf = svm.SVC(kernel='rbf', gamma='auto')
clf.fit(X_train, y_train)
yhat_svm = clf.predict(X_test)
print(" SVM accuracy score", metrics.accuracy_score(y_test, yhat_svm))
prepare(bh_all_venues)
Class | Acai House | American Restaurant | Arepa Restaurant | Argentinian Restaurant | Asian Restaurant | Australian Restaurant | BBQ Joint | Bagel Shop | Baiano Restaurant | ... | Soup Place | South American Restaurant | Spanish Restaurant | Steakhouse | Sushi Restaurant | Syrian Restaurant | Taco Place | Tapas Restaurant | Vegetarian / Vegan Restaurant | Wings Joint | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | low | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.090909 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
4 | low | 0.041667 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.041667 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
5 | low | 0.050000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.100000 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
6 | low | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.052632 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.157895 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
11 | low | 0.055556 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.055556 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
240 | luxury | 0.037037 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.074074 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.074074 | 0.0 |
241 | luxury | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.011628 | 0.0 | 0.011628 | 0.0 | 0.011628 | ... | 0.0 | 0.0 | 0.0 | 0.023256 | 0.011628 | 0.0 | 0.0 | 0.0 | 0.058140 | 0.0 |
242 | luxury | 0.058824 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.058824 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
243 | luxury | 0.022727 | 0.0 | 0.0 | 0.022727 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.022727 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.022727 | 0.0 | 0.0 | 0.0 | 0.045455 | 0.0 |
244 | luxury | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.086957 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.043478 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
130 rows × 72 columns
bh_food_venues = pd.read_csv('bh_food_venues.csv')
bh_pro_venues = pd.read_csv('bh_pro_venues.csv')
names = ['all', 'food', 'professional']
datasets = [bh_all_venues, bh_food_venues, bh_pro_venues]
for name, df_venues in zip(names, datasets):
print('### {} venues category dataset ###'.format(name))
df_merged = prepare(df_venues)
y = le.transform(df_merged['Class'].values)
y0 = y == 0
y1 = y == 1
y2 = y == 2
y3 = y == 3
X = df_merged.drop('Class', 1)
print(' ### Actual class ###')
build_evaluate(X, y)
print(' ### if y == \'low\' ###')
build_evaluate(X, y0)
print(' ### if y == \'regular\' ###')
build_evaluate(X, y1)
print(' ### if y == \'high\' ###')
build_evaluate(X, y2)
print(' ### if y == \'luxury\' ###')
build_evaluate(X, y3)
### all venues category dataset ###
### Actual class ###
Logistic regression accuracy score 0.6923076923076923
Logistic regression log loss 1.1909560870176388
KNN accuracy score 0.5769230769230769 with k= 8
SVM accuracy score 0.6923076923076923
### if y == 'low' ###
Logistic regression accuracy score 0.9230769230769231
Logistic regression log loss 0.40203381145443345
KNN accuracy score 0.9230769230769231 with k= 6
SVM accuracy score 0.9230769230769231
### if y == 'regular' ###
Logistic regression accuracy score 0.3076923076923077
Logistic regression log loss 0.7406334499138543
KNN accuracy score 0.6538461538461539 with k= 1
SVM accuracy score 0.3076923076923077
### if y == 'high' ###
Logistic regression accuracy score 0.9615384615384616
Logistic regression log loss 0.34060287134407913
KNN accuracy score 1.0 with k= 1
SVM accuracy score 0.9615384615384616
### if y == 'luxury' ###
Logistic regression accuracy score 0.8076923076923077
Logistic regression log loss 0.5148135568320493
KNN accuracy score 0.8076923076923077 with k= 2
SVM accuracy score 0.8076923076923077
### food venues category dataset ###
### Actual class ###
Logistic regression accuracy score 0.6923076923076923
Logistic regression log loss 1.1909560870176388
KNN accuracy score 0.5769230769230769 with k= 8
SVM accuracy score 0.6923076923076923
### if y == 'low' ###
Logistic regression accuracy score 0.9230769230769231
Logistic regression log loss 0.40203381145443345
KNN accuracy score 0.9230769230769231 with k= 6
SVM accuracy score 0.9230769230769231
### if y == 'regular' ###
Logistic regression accuracy score 0.3076923076923077
Logistic regression log loss 0.7406334499138543
KNN accuracy score 0.6538461538461539 with k= 1
SVM accuracy score 0.3076923076923077
### if y == 'high' ###
Logistic regression accuracy score 0.9615384615384616
Logistic regression log loss 0.34060287134407913
KNN accuracy score 1.0 with k= 1
SVM accuracy score 0.9615384615384616
### if y == 'luxury' ###
Logistic regression accuracy score 0.8076923076923077
Logistic regression log loss 0.5148135568320493
KNN accuracy score 0.8076923076923077 with k= 2
SVM accuracy score 0.8076923076923077
### professional venues category dataset ###
### Actual class ###
Logistic regression accuracy score 0.5714285714285714
Logistic regression log loss 1.2284283597098578
KNN accuracy score 0.5357142857142857 with k= 2
SVM accuracy score 0.5714285714285714
### if y == 'low' ###
Logistic regression accuracy score 0.8214285714285714
Logistic regression log loss 0.488417167619632
KNN accuracy score 0.8214285714285714 with k= 4
SVM accuracy score 0.8214285714285714
### if y == 'regular' ###
Logistic regression accuracy score 0.42857142857142855
Logistic regression log loss 0.6960464112425366
KNN accuracy score 0.6785714285714286 with k= 5
SVM accuracy score 0.42857142857142855
### if y == 'high' ###
Logistic regression accuracy score 0.8928571428571429
Logistic regression log loss 0.3983842504318268
KNN accuracy score 0.9285714285714286 with k= 2
SVM accuracy score 0.8928571428571429
### if y == 'luxury' ###
Logistic regression accuracy score 0.8571428571428571
Logistic regression log loss 0.4717453131865441
KNN accuracy score 0.8571428571428571 with k= 2
SVM accuracy score 0.8571428571428571
4. Results
As mentioned before, we built KNN, SVM and Logistic regression models, but we will show only the results obtained by Logistic regression, because it gets the better scores (jaccard index score) in most cases.
4.1 All venue categories dataset scores
- the actual class: 0.6666
- if the class == ‘luxury’: 0.9
- if the class == ‘high’: 0.8666
- if the class == ‘regular’: 0.3333
- if the class == ‘low’: 0.9
4.2 Food venue categories dataset scores
- the actual class: 0.6923
- if the class == ‘luxury’: 0.8
- if the class == ‘high’: 0.96
- if the class == ‘regular’: 0.3076
- if the class == ‘low’: 0.923
4.3 Professional venue categories dataset scores
- the actual class: 0.5714
- if the class == ‘luxury’: 0.8571
- if the class == ‘high’: 0.8928
- if the class == ‘regular’: 0.4286
- if the class == ‘low’: 0.8214
5. Discussion
It’s interesting to see that the food venue categories dataset got the best overall results but closely followed by the all categories which indicates that may be possible to combine two or more categories to get optimistic scores, as it’s clear that there are categories that disturbs the score (see the professional venue categories dataset).
Besides that, it’s also interesting to see that the model built with foods venues categories dataset can predict the ‘high’ and ‘low’ class remarkably well, and definately could be used in a different city, similar to ‘Belo Horizonte’.
6. Conclusion
Even though we couldn’t get a great model to predict the actual class of a Neighborhood, we could get interesting results on predicting the ‘high’ and ‘low’ classes using the Food categories dataset, and predicting ‘luxury’ and ‘low’ casses using the All categories dataset.