Why this blog? To begin my exploration into recommendation systems, I laid my hands on the most obvious “Restaurant Recommendation” use-case. Nevertheless, in my journey of applying AI/ML to business problems, this was quite an experience on feature engineering, handling geospatial features, intensity of cold-start problems, dealing with extremely imbalanced data (only 1.5% positive classes) and experimenting with deep learning models for content filtering. I intend to document and share these learnings with a hope it will be helpful to ML practitioners.

Where does “restaurant recommendation” fit in? Food delivery services like Zomato, Swiggy partner with restaurants and connect them to its large customer base via its online ordering platform. It typically recommends restaurants to its app-users based on their proximity to customers’ location, prior food choices, ratings of the restaurants and its promotional offers. Customers can order food from multiple locations either from their home, office, friend’s place. Based on the customer’s location or zip-code, restaurants in the neighbourhood are listed to the app users.

Photo by Vernon Raineil Cenzon

Exploratory Data Analysis

Chosen data and its problem statement: The food service delivery data in Kaggle comprises customer’s and restaurant’s meta data. The objective is to build a restaurant recommender system leveraging this data and predict if a customer from a certain location had ordered via a certain restaurant vendor. The target is a binary outcome.

Customer Data

Analysis Outcomes

So What?

91% of train customers had ‘DOB’ missing and 99.9% customers had same ‘status’ value.

These features were deemed not useful in customer profiling.

Duplicate customer records were present as every update to the customer record was retained.

Most recent records had the account status verified. Older records were removed.

No overlap between train and test customer files!!

This implied, model predictions were for new customers and model approaches to be adopted accordingly.

19 train customer-locations and 9 test customer-locations had missing latitudes and longitudes.

Imputed with mean coordinates of respective customer.

877 train customers had only locations without demographic details

Null treated as a valid category for the demographic features.

De-duplication of customer demographic

Transform function in Pandas came in handy

# Check for duplicates
train_customers[train_customers.duplicated("customer_id")]["customer_id"].value_counts()

# Exclude chronologically older records
train_customers_dedup = train_customers[train_customers["updated_at"] == train_customers.groupby(["customer_id"])['updated_at'].transform('max')]

Geospatial analysis

import reverse_geocode
coordinates = list(zip(train_locations_valid["latitude"],train_locations_valid["longitude"]))
train_locations_valid[["country_code","city","country"]] = pd.DataFrame(reverse_geocode.search(coordinates))

Based on above findings, I had to conclude that customer’s coordinates were masked in the data and instances cannot be ignored due to incorrect location.

Vendor data

def add_food_cols(row,food_items):
  try:
    for item in food_items:
      row[item] = int(item in(row["vendor_tag_name"].split(',')))
  except:
    for item in food_items:
      row[item] = 0.0
  return row

#Food specialities of the vendor
import itertools
vendor_foods = pd.unique(vendors.loc[~vendors["vendor_tag_name"].isna(),"vendor_tag_name"])
food_list = [foods.split(',') for foods in vendor_foods] 
food_items = list(set(itertools.chain(*food_list))) 
print("Number of food items available across all vendors is ", len(food_items))
print("Sample food items available in ", food_items[:20])

#Create count feature for vendor_tag attribute (Number of food specialities available in the restaurant)
vendors["tag_counts"]=vendors["vendor_tag"].str.split(',').str.len()

#Create categorical features
vendors = vendors.apply(lambda x: add_food_cols(x,food_items=food_items),axis=1)

Number of food items available across all vendors is 68

Sample food items available in ['Pizzas', 'Cafe', 'Healthy Food', 'Grills', 'Pizza', 'Crepes', 'Ice creams', 'Smoothies', 'Italian', 'Combos', 'Shuwa', 'Family Meal', 'Bagels', 'Thali', 'Hot Dogs', 'Pastry', 'Mandazi', 'Coffee', 'Soups', 'Japanese']

Vendor summary features

Combined Feature Analysis

Further feature engineering was done on the merged customer and vendor features.

Great Circle Distance: This is the shortest distance between customer and vendor coordinates. It measures the length of the arc connecting the 2 coordinates and this arc is part of the circumference of an imaginary circle whose center coincides with the center of the earth and also shares the same radius. The imaginary circle is called as great circle and divides earth into two equal halves. USGS has a nice illustration on Great Circle. The great circle distance was calculated using Haversine formula as below. I had used Wikipedia and Plus Maths for this.

The distance calculation was implemented in a vectorized way considering performance using the numpy functions.

def calculate_haversine(lon1, lat1, lon2, lat2):
    """
    All args must be of equal length
    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    delta_lon = lon2 - lon1
    delta_lat = lat2 - lat1

    haversine_angle = np.sin(delta_lat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(delta_lon/2.0)**2
    haversine_distance = 2 * 6371 * np.arcsin(np.sqrt(haversine_angle))
    return haversine_distance

train_merged['h_distance_new'] = calculate_haversine(train_merged['longitude_x'],train_merged['latitude_x'],train_merged['longitude_y'],train_merged['latitude_y'])

Distribution of the haversine distance

The peaks in the probability distribution plot corresponds to the isolated clusters of customers location we observed in the map earlier.

Analysis on the presence of Cold Start Problem:

Modelling Approaches:

Keeping our EDA findings in mind, I ended up experimenting with below approaches:

Neighbourhood based collaborative filtering method

def calculate_vendor_distances(customer_locations,vendor_locations):
  cols = [vendor for vendor in vendor_locations["vendor_id"]]
  nbr_customer_locations = customer_locations.shape[0]
  customer_vendor_distances = pd.DataFrame(columns=["customer_id","location_number"]+cols)
  customer_vendor_distances[["customer_id","location_number"]] = customer_locations[["customer_id","location_number"]] 

  #compute haversine distance of each vendor against all vendors
  for ind1 in vendor_locations.index:
    start_latit = np.repeat(vendor_locations.loc[ind1,"latitude"],repeats=nbr_customer_locations,axis=0)
    start_longit = np.repeat(vendor_locations.loc[ind1,"longitude"],repeats=nbr_customer_locations,axis=0)
    end_latit = customer_locations["latitude"].values
    end_longit = customer_locations["longitude"].values
    customer_vendor_distances[cols[ind1]] = calculate_haversine(start_longit,start_latit,end_longit,end_latit)
  return customer_vendor_distances

train_custvend_distances = calculate_vendor_distances(train_locations[['customer_id', 'location_number', 'latitude', 'longitude']],vendor_locations)
test_custvend_distances = calculate_vendor_distances(test_locations[['customer_id', 'location_number', 'latitude', 'longitude']],vendor_locations)
from sklearn.metrics.pairwise import cosine_similarity
def predict_cosine_neighbours(nearest_neighbours,train_custvend_distances, test_custvend_distances, train_custlocn_vendor_order, out_cols):
  similarity_scores = cosine_similarity(test_custvend_distances.iloc[:,2:], Y=train_custvend_distances.iloc[:,2:], dense_output=True) #output will be (test.shape,train.shape)

  #determine the indices of top 'k' nearest train customers
  similar_customerlocn_indices = np.argpartition(similarity_scores, kth = -nearest_neighbours, axis=-1)[:,-nearest_neighbours:] #for every test customer (axis=-1) take the top21 similar train customers
  flatind = similar_customerlocn_indices.ravel() #flatten all the indices obtained
 
  test_out_vendor = pd.DataFrame(columns = out_cols) 
  test_out_vendor[["customer_id","location_number"]] = test_custvend_distances[["customer_id","location_number"]]
  
  #for every vendor
  for vend in ordered_vendor_cols:
    #obtain the label corresponding to the nearest neighbours that has the maximum votes
    vend_y = mode(train_custlocn_vendor_order.loc[flatind,vend].values.reshape(similar_customerlocn_indices.shape[0],-1),axis=1)[0]
    test_out_vendor[vend] = vend_y #populate the vendor column in the output DF

  test_out_vendor_melt = unpivot_data(test_out_vendor)
  return test_out_vendor_melt

def unpivot_data(df):
  #reorganize dataframe with multiple vendor cols into single vendor column
  df_melt = pd.melt(df, 
            id_vars=['customer_id','location_number'],
            value_vars=list(df.columns[2:]), # list of days of the week
            var_name='vendor_id', 
            value_name='target')

  df_melt["vendor_id"] = df_melt["vendor_id"].str.split("vendor_id_").str[1]
  return df_melt

test_out_pred = predict_cosine_neighbours(nn, train_custvend_distances,test_custvend_distances,train_custlocn_vendor_order,out_cols)

Content-based filtering - Clustering & tree-based classifiers

Content-based filtering - Deep Learning Model

int(np.ceil(nbr_of_categories**0.25))

Model Evaluation

Since both the classes have equal weightage here, we need to balance their precision and recall. Hence F1-score was used to evaluate these models.

Model Variant

F1-Score

Neighbourhood based collaborative filtering

0.53

Content-based filtering - Clustering & Gradient Boosted Trees

0.60

Content-based filtering - Clustering & Random Forest

0.57

Content-based filtering - Deep Learning Model

0.56

You can find code to the notebooks as well as the code for building the GUI on my github and the GUI here.

References:

https://www.kaggle.com/mrmorj/restaurant-recommendation-challenge

https://datascientyst.com/reverse-geocoding-latitude-longitude-city-country-python-pandas/

https://skelouse.github.io/faster_mapping_with_folium

https://www.geeksforgeeks.org/python-datetime-strptime-function/

https://www.kaggle.com/code/speedoheck/calculate-distance-with-geo-coordinates/notebook

https://plus.maths.org/content/lost-lovely-haversine

https://en.wikipedia.org/wiki/Haversine_formula

https://stats.stackexchange.com/questions/243207/what-is-the-proper-usage-of-scale-pos-weight-in-xgboost-for-imbalanced-datasets