1. Introduction

For an entrepreneur, choosing the location of a new establishment within a city can be a very important and also very difficult task. For this, it is advisable to have as much information as possible from each neighborhood.

Similarly, the city government also needs as much information as possible from each neighborhood to manage them properly.

Amongst all information about neighborhoods, one that stands out is its prevailing social class. The needs and opportunities of a neighborhood are often associated with this information.

Here, we will seek to develop a model capable of predicting the prevailing social class of each neighborhood, based on the categories of venues there. This model will be trained with data from the set of reports of venues in each neighborhood, retrievable from the Foursquare API and with the data a report from UFMG (University of Minas Gerais) that informs the majority social class of each neighborhood.

If the model works well, we may use it to find out a valuable information of neighborhood on cities similar to Belo Horizonte that hasn’t a report about their neighborhood prevailing social class.

2. Data acquisition and cleaning

The report about the Belo Horizonte neighborhood prevailing social classes is published in PDF format. Fortunately, it is very easy to copy the data contents and past into a csv file. The resulting columns are “Neighborhood” and “Class”. Let’s see the head of this data set.

import pandas as pd

bairros = pd.read_csv('bh_bairros_classes.csv') # it was previously saved
bairros.head()
Neighborhood Class
0 AARAO REIS low
1 ALTO DOS PINHEIROS low
2 ALTO PARAISO low
3 ALVARO CAMARGOS low
4 ALVORADA low

We will retrieve the Neighborhood venues with Foursquare API, by calling the “query” endpoint for each Neighborhood, which requires localization data. We use geopy library to retrieve localization data of each Neighborhood.

from geopy import Nominatim
from geopy.exc import GeocoderUnavailable, GeocoderTimedOut, GeocoderServiceError

geolocator = Nominatim(user_agent="ny_explorer")

def geolocator_belohorizonte(neighborhood):    
    try:
        locator = geolocator.geocode(neighborhood +', Belo Horizonte')
    except (GeocoderUnavailable, GeocoderTimedOut, GeocoderServiceError):
        # print('Geocoder unavailable or timed out... will try again!')
        locator = geolocator_belohorizonte(neighborhood)
    return locator
print(bairros.shape)
bairros.head()
(245, 4)
Neighborhood Class Latitude Longitude
0 AARAO REIS low -19.847221 -43.919508
1 ALTO DOS PINHEIROS low -19.932567 -44.004875
2 ALVARO CAMARGOS low -19.916339 -44.007857
3 ALVORADA low -30.031715 -51.049711
4 ANA LUCIA low -19.887783 -43.906368

The geopy library sometimes can’t be accurate, so we had to remove the Neighborhoods that couldn’t have its localization data accurately retrieved. The identification of such cases was map manually with a map vizualization support.

import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

classes = {'low':0, 'regular':1, 'high':2, 'luxury':3}


ilat = -19.9227318
ilon = -43.9450948

# create map
map_classes = folium.Map(location=[ilat,  ilon], zoom_start=11)

# set color scheme for the clusters
x = np.arange(4)
ys = [i + x + (i*x)**2 for i in range(4)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, socialclass in zip(bairros['Latitude'], bairros['Longitude'], bairros['Neighborhood'], bairros['Class']):
    label = folium.Popup(str(poi) + ' Class ' + socialclass, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[classes[socialclass]-1],
        fill=True,
        fill_color=rainbow[classes[socialclass]-1],
        fill_opacity=0.7).add_to(map_classes)
    
map_classes

By viewing the map it’s clear that geopy defined many points out of bounds of Belo Horizonte. Besides that, as I live in Belo Horizonte, I could detect some neighborhoods far from downtown that were inacurately defined by geopy.

So, we chose to restrict the analysis to neighborhoods not far from downtown.

Also, the location of “Pindorama” neighborhood, near from downtown, is remarkably wrong. So it will be removed too.

df = bairros[(bairros['Latitude']<ilat+0.09) & (bairros['Latitude']>ilat-0.09) & (bairros['Longitude']>ilon-0.06) & (bairros['Longitude']<ilon+0.06)]
df = df[df['Neighborhood'] != 'PINDORAMA']

Now we have the Latitude and Longitude for each neighborhood, so we are ready to make Foursquare API calls on “query” endpoint.

Actually, for each neighborhood we made 4 API calls, one for each category: “food”, “stores and services” and “professional”, and one for all categories combined.

So there will be 4 resulting datasets, one for each category, plus one for all categories combined, and it will look like this:

categories = {
    'food': '4d4b7105d754a06374d81259',
    'stores': '4d4b7105d754a06378d81259', 
    'profissional': '4d4b7105d754a06375d81259'
}

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    

# function that returns nearby venues by accessing foursquare    
def getNearbyVenues(names, latitudes, longitudes, category, radius=500, LIMIT=150):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):              
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            category
        )
                    
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['id'], 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'ID',                              
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)
bh_food_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude'],
                                   categories['food']
                                  )

bh_pro_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude'],
                                   categories['professional']
                                  )

bh_stores_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude'],
                                   categories['stores']
                                  )

bh_all_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude'],
                                   ''
                                  )
bh_food_venues.head()
Neighborhood ID Venue Latitude Longitude Venue Category
0 AARAO REIS 4eb1d86e77c814d925751c99 Chapa Mágica -19.845448 -43.921754 BBQ Joint
1 AARAO REIS 5bf1cc4275eee40039f91adf Burger King -19.846823 -43.919360 Fast Food Restaurant
2 AARAO REIS 4daba4b84b22f071ead33715 Celo Burguer -19.847524 -43.919394 Burger Joint
3 AARAO REIS 516dc84d498e618c69124919 bobs -19.846710 -43.917326 Burger Joint
4 AARAO REIS 539991af498ea6a823188d29 Padaria Vila Verde -19.847362 -43.921778 Bakery

3. Methodology

So now we have 4 clean datasets:

  • bh_food_venues
  • bh_stores_venues
  • bh_pro_venues
  • bh_all_venues

They will be our asset for training our model using classification algorithms - the venues categories will be its features (after an onehot) and the social class will be the target variable.

The resulting models will be evaluated and we will show the best dataset and best classification algorithm for our goal.

Let’s pick one dataset and see if it’s ready to go:

bh_all_venues = pd.read_csv('bh_food_venues.csv')
bh_all_venues.head()
Neighborhood ID Venue Latitude Longitude Venue Category
0 AARAO REIS 4eb1d86e77c814d925751c99 Chapa Mágica -19.845448 -43.921754 BBQ Joint
1 AARAO REIS 5bf1cc4275eee40039f91adf Burger King -19.846823 -43.919360 Fast Food Restaurant
2 AARAO REIS 4daba4b84b22f071ead33715 Celo Burguer -19.847524 -43.919394 Burger Joint
3 AARAO REIS 516dc84d498e618c69124919 bobs -19.846710 -43.917326 Burger Joint
4 AARAO REIS 539991af498ea6a823188d29 Padaria Vila Verde -19.847362 -43.921778 Bakery

As we are going to use the Venues categories as features of our clustering algorithms, it’s appropriate to avoid neighborhoods with small number of venues, because it’s high potential to become outliers.

Unfortunately, the dataset of stores become too small after that restriction, so it will be discarded.

Now let’s take a fast view on the most common Venues Categories of bh_all_venues:

bh_all_venues['Venue Category'].value_counts()
Bakery                  296
Bar                     259
Brazilian Restaurant    230
Gym / Fitness Center    193
Burger Joint            165
                       ... 
Fish Market               1
Speakeasy                 1
Travel Agency             1
Design Studio             1
Butcher                   1
Name: Venue Category, Length: 272, dtype: int64

Looks like we are ready now to prepare our dataset to the classification algorithm.

First we do one hot encoding and drop columns that are not features or target.

Then we split training set with test set and build KNN, SVM and Logistic regression models.

The target (y) will be tested in the following formats:

  • the actual class
  • if the class == ‘luxury’
  • if the class == ‘high’
  • if the class == ‘regular’
  • if the class == ‘low’
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

le = preprocessing.LabelEncoder()
le.fit(["low", "regular", "high", "luxury"])

def prepare(df_v):  # returns a dataset with only features and target
    df_venues = df_v

    # one hot encoding
    onehot = pd.get_dummies(df_venues[['Venue Category']], prefix="", prefix_sep="")

    # add neighborhood column back to dataframe
    onehot.insert(0, 'Neighborhood', df_venues['Neighborhood'].values)  

    # creating dataset of onehot categories means
    df_grouped = onehot.groupby('Neighborhood').mean().reset_index()

    # droping neighborhood with less than 8 venues to avoid outliers
    min_venues_mask = df_venues.groupby('Neighborhood').count()['Venue']>=8    
    df_grouped = df_grouped[min_venues_mask.values]

    df_merged = df

    # merge df with bh_food_grouped to add latitude/longitude for each neighborhood
    df_merged = df_merged.join(df_grouped.set_index('Neighborhood'), on='Neighborhood')

    df_merged.dropna(inplace=True)    
    return df_merged.drop(['Latitude', 'Longitude', 'Neighborhood'], 1)


def build_evaluate(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

    LR = LogisticRegression(C=0.1, solver='liblinear', multi_class='auto').fit(X_train, y_train)
    yhat_lr = LR.predict(X_test)
    yhat_lr_prob = LR.predict_proba(X_test)

    print('        Logistic regression accuracy score', metrics.accuracy_score(y_test, yhat_lr))
    print('        Logistic regression log loss', metrics.log_loss(y_test, yhat_lr_prob))

    Ks = 10
    mean_acc = np.zeros((Ks-1))
    std_acc = np.zeros((Ks-1))
    ConfustionMx = [];
    for n in range(1,Ks):

        #Train Model and Predict  
        neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
        yhat=neigh.predict(X_test)
        mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)


        std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

    print("        KNN accuracy score", mean_acc.max(), " with k=", mean_acc.argmax()+1)
    
    clf = svm.SVC(kernel='rbf', gamma='auto')
    clf.fit(X_train, y_train) 

    yhat_svm = clf.predict(X_test)
    print("        SVM accuracy score", metrics.accuracy_score(y_test, yhat_svm))      
prepare(bh_all_venues)
Class Acai House American Restaurant Arepa Restaurant Argentinian Restaurant Asian Restaurant Australian Restaurant BBQ Joint Bagel Shop Baiano Restaurant ... Soup Place South American Restaurant Spanish Restaurant Steakhouse Sushi Restaurant Syrian Restaurant Taco Place Tapas Restaurant Vegetarian / Vegan Restaurant Wings Joint
0 low 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.090909 0.0 0.000000 ... 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.0
4 low 0.041667 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 ... 0.0 0.0 0.0 0.041667 0.000000 0.0 0.0 0.0 0.000000 0.0
5 low 0.050000 0.0 0.0 0.000000 0.000000 0.0 0.100000 0.0 0.000000 ... 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.0
6 low 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.052632 0.0 0.000000 ... 0.0 0.0 0.0 0.157895 0.000000 0.0 0.0 0.0 0.000000 0.0
11 low 0.055556 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 ... 0.0 0.0 0.0 0.055556 0.000000 0.0 0.0 0.0 0.000000 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
240 luxury 0.037037 0.0 0.0 0.000000 0.000000 0.0 0.074074 0.0 0.000000 ... 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.074074 0.0
241 luxury 0.000000 0.0 0.0 0.000000 0.011628 0.0 0.011628 0.0 0.011628 ... 0.0 0.0 0.0 0.023256 0.011628 0.0 0.0 0.0 0.058140 0.0
242 luxury 0.058824 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 ... 0.0 0.0 0.0 0.058824 0.000000 0.0 0.0 0.0 0.000000 0.0
243 luxury 0.022727 0.0 0.0 0.022727 0.000000 0.0 0.000000 0.0 0.022727 ... 0.0 0.0 0.0 0.000000 0.022727 0.0 0.0 0.0 0.045455 0.0
244 luxury 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.086957 0.0 0.000000 ... 0.0 0.0 0.0 0.000000 0.043478 0.0 0.0 0.0 0.000000 0.0

130 rows × 72 columns

bh_food_venues = pd.read_csv('bh_food_venues.csv')
bh_pro_venues = pd.read_csv('bh_pro_venues.csv')
names = ['all', 'food', 'professional']
datasets = [bh_all_venues, bh_food_venues, bh_pro_venues]

for name, df_venues in zip(names, datasets):
    
    print('### {} venues category dataset ###'.format(name))
    
    df_merged = prepare(df_venues)

    y = le.transform(df_merged['Class'].values)
    y0 = y == 0
    y1 = y == 1
    y2 = y == 2
    y3 = y == 3

    X = df_merged.drop('Class', 1)

    print('    ### Actual class ###')
    build_evaluate(X, y)

    print('    ### if y == \'low\' ###')
    build_evaluate(X, y0)

    print('    ### if y == \'regular\' ###')
    build_evaluate(X, y1)

    print('    ### if y == \'high\' ###')
    build_evaluate(X, y2)

    print('    ### if y == \'luxury\' ###')
    build_evaluate(X, y3)
### all venues category dataset ###
    ### Actual class ###
        Logistic regression accuracy score 0.6923076923076923
        Logistic regression log loss 1.1909560870176388
        KNN accuracy score 0.5769230769230769  with k= 8
        SVM accuracy score 0.6923076923076923
    ### if y == 'low' ###
        Logistic regression accuracy score 0.9230769230769231
        Logistic regression log loss 0.40203381145443345
        KNN accuracy score 0.9230769230769231  with k= 6
        SVM accuracy score 0.9230769230769231
    ### if y == 'regular' ###
        Logistic regression accuracy score 0.3076923076923077
        Logistic regression log loss 0.7406334499138543
        KNN accuracy score 0.6538461538461539  with k= 1
        SVM accuracy score 0.3076923076923077
    ### if y == 'high' ###
        Logistic regression accuracy score 0.9615384615384616
        Logistic regression log loss 0.34060287134407913
        KNN accuracy score 1.0  with k= 1
        SVM accuracy score 0.9615384615384616
    ### if y == 'luxury' ###
        Logistic regression accuracy score 0.8076923076923077
        Logistic regression log loss 0.5148135568320493
        KNN accuracy score 0.8076923076923077  with k= 2
        SVM accuracy score 0.8076923076923077
### food venues category dataset ###
    ### Actual class ###
        Logistic regression accuracy score 0.6923076923076923
        Logistic regression log loss 1.1909560870176388
        KNN accuracy score 0.5769230769230769  with k= 8
        SVM accuracy score 0.6923076923076923
    ### if y == 'low' ###
        Logistic regression accuracy score 0.9230769230769231
        Logistic regression log loss 0.40203381145443345
        KNN accuracy score 0.9230769230769231  with k= 6
        SVM accuracy score 0.9230769230769231
    ### if y == 'regular' ###
        Logistic regression accuracy score 0.3076923076923077
        Logistic regression log loss 0.7406334499138543
        KNN accuracy score 0.6538461538461539  with k= 1
        SVM accuracy score 0.3076923076923077
    ### if y == 'high' ###
        Logistic regression accuracy score 0.9615384615384616
        Logistic regression log loss 0.34060287134407913
        KNN accuracy score 1.0  with k= 1
        SVM accuracy score 0.9615384615384616
    ### if y == 'luxury' ###
        Logistic regression accuracy score 0.8076923076923077
        Logistic regression log loss 0.5148135568320493
        KNN accuracy score 0.8076923076923077  with k= 2
        SVM accuracy score 0.8076923076923077
### professional venues category dataset ###
    ### Actual class ###
        Logistic regression accuracy score 0.5714285714285714
        Logistic regression log loss 1.2284283597098578
        KNN accuracy score 0.5357142857142857  with k= 2
        SVM accuracy score 0.5714285714285714
    ### if y == 'low' ###
        Logistic regression accuracy score 0.8214285714285714
        Logistic regression log loss 0.488417167619632
        KNN accuracy score 0.8214285714285714  with k= 4
        SVM accuracy score 0.8214285714285714
    ### if y == 'regular' ###
        Logistic regression accuracy score 0.42857142857142855
        Logistic regression log loss 0.6960464112425366
        KNN accuracy score 0.6785714285714286  with k= 5
        SVM accuracy score 0.42857142857142855
    ### if y == 'high' ###
        Logistic regression accuracy score 0.8928571428571429
        Logistic regression log loss 0.3983842504318268
        KNN accuracy score 0.9285714285714286  with k= 2
        SVM accuracy score 0.8928571428571429
    ### if y == 'luxury' ###
        Logistic regression accuracy score 0.8571428571428571
        Logistic regression log loss 0.4717453131865441
        KNN accuracy score 0.8571428571428571  with k= 2
        SVM accuracy score 0.8571428571428571

4. Results

As mentioned before, we built KNN, SVM and Logistic regression models, but we will show only the results obtained by Logistic regression, because it gets the better scores (jaccard index score) in most cases.

4.1 All venue categories dataset scores

  • the actual class: 0.6666
  • if the class == ‘luxury’: 0.9
  • if the class == ‘high’: 0.8666
  • if the class == ‘regular’: 0.3333
  • if the class == ‘low’: 0.9

4.2 Food venue categories dataset scores

  • the actual class: 0.6923
  • if the class == ‘luxury’: 0.8
  • if the class == ‘high’: 0.96
  • if the class == ‘regular’: 0.3076
  • if the class == ‘low’: 0.923

4.3 Professional venue categories dataset scores

  • the actual class: 0.5714
  • if the class == ‘luxury’: 0.8571
  • if the class == ‘high’: 0.8928
  • if the class == ‘regular’: 0.4286
  • if the class == ‘low’: 0.8214

5. Discussion

It’s interesting to see that the food venue categories dataset got the best overall results but closely followed by the all categories which indicates that may be possible to combine two or more categories to get optimistic scores, as it’s clear that there are categories that disturbs the score (see the professional venue categories dataset).

Besides that, it’s also interesting to see that the model built with foods venues categories dataset can predict the ‘high’ and ‘low’ class remarkably well, and definately could be used in a different city, similar to ‘Belo Horizonte’.

6. Conclusion

Even though we couldn’t get a great model to predict the actual class of a Neighborhood, we could get interesting results on predicting the ‘high’ and ‘low’ classes using the Food categories dataset, and predicting ‘luxury’ and ‘low’ casses using the All categories dataset.