14 May 2021

Recommendation engine using Text data ,Cosine Similarity and TFIDF technique , Azure ML

recommender engine tfidf azure

What are we trying to do

We will build a very simple recommendation engine using Text Data. To demostrate this we would use a case study approach and build a recommendation engine for a non profit organization Career Village. I have detailed post on the methodology of the recommendation engine in the post here. In this post we will show of how we train, infer and deploy the solution in Azure.

We divide our approach into 2 major blocks:

Building the Model in Azure ML
Inference from the Model in Azure ML

Building the model in Azure ML has the following steps:

Create the Azure ML workspace
Upload data into the Azure ML Workspace
Create the code folder
Create the Compute Cluster
Create the Model
Create the Compute Environment
Create the Estimator
Create the Experiment and Run
Register the Model

Inferencing from the model in Azure ML has the following steps:

Create the Inference Script
Create the Inference Dependencies
Create the Inference Config
Create the Inference Clusters
Deploy the Model in the Inference Cluster
Get the predictions

Create the Azure ML workspace

We need to create an Azure ML workspace that contains the experiments, runs, models, and everything. It is a house for everything. Let’s create one

import azureml.core
print(azureml.core.VERSION)

from azureml.core import Workspace
from azureml.core.authentication import InteractiveLoginAuthentication

sid = <your-subscription-id>
forced_interactive_auth = InteractiveLoginAuthentication(tenant_id=<your-tenant-id>)
ws = Workspace.create(name='azureml_workspace',
            subscription_id= sid, 
            resource_group='rgazureml',
            create_resource_group = True,
            location='centralus'
            )

What does this code segment actually do:

Deployed an Azure ML workspace azureml_workspace
Deployed an AppInsights
Deployed KeyVault
Deployed StorageAccount

in the resource group rgazureml. You can navigate to the resource group to view these features.

Upload data into the Azure ML Workspace

Machine Learning is about data. We need data to build our models. The data which we are going to use is the Career Village dataset. Let us store the data in the Azure ML workspace.

# Upload data into the Azure ML Workspace

#upload data by using get_default_datastore()
ds = ws.get_default_datastore()
ds.upload(src_dir='./recodata', target_path='winedata', overwrite=True, show_progress=True)

print('Done')

We upload the data to the workspace default store. The default store is associated with the storage account we had created earlier.

Create the code folder

We store the code files in a directory

import os

# create the folder
folder_training_script = './recocode'
os.makedirs(folder_training_script, exist_ok=True)

print('Done')

Till this point, we have uploaded the data for machine learning. We need to identify the computing environment to process the machine learning models. Let’s create a compute cluster using the Azure ML SDK.

Create the Compute Cluster

from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os

# Step 1: name the cluster and set the minimal and maximal number of nodes 
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpucluster")
min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 1)

# Step 2: choose environment variables 
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D2_V2")

provisioning_config = AmlCompute.provisioning_configuration(
    vm_size = vm_size, min_nodes = min_nodes, max_nodes = max_nodes)

# create the cluster
compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)

print('Compute target created')

Create the Model

I have detailed post on the methodology of the recommendation engine in the post here

Once the training is complete, we save the following

TF-IDF fitted vectorizer
TF-IDF matrix of the questions

%%writefile $folder_training_script/train.py

import argparse
import os
import numpy as np
import pandas as pd
import glob
import gc
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import string
import pickle

from azureml.core import Run
# from utils import load_data

parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
args = parser.parse_args()

def clean_text(text):
    '''Make text lowercase,remove punctuation
    .'''
    text = str(text).lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    return text


data_folder = os.path.join(args.data_folder, 'recodata')
print('Data folder:', data_folder)

questions  = pd.read_csv(os.path.join(data_folder, 'questions.csv'))
professionals = pd.read_csv(os.path.join(data_folder, 'professionals.csv'))
answers = pd.read_csv(os.path.join(data_folder, 'answers.csv'))

prof_ans = pd.merge(professionals, answers, how = 'left' ,
                    left_on = 'professionals_id', right_on = 'answers_author_id')
prof_ans_q = pd.merge(prof_ans, questions, how = 'left' ,
                      left_on = 'answers_question_id', right_on = 'questions_id')

prof_ans_q = prof_ans_q[(~prof_ans_q["questions_title"].isna()) | (~prof_ans_q["questions_body"].isna()) ]

q = prof_ans_q["questions_title"] + " " + prof_ans_q["questions_body"]
q  = q.apply(lambda x:clean_text(x))

MAX_DF     = 0.95
MIN_DF     = 2
LANGUAGE   = 'english'

tfidf_vectorizer = TfidfVectorizer(max_df=MAX_DF, 
                                   min_df=MIN_DF,
                                   stop_words=LANGUAGE)

q = q.dropna()
tfidf_vectorizer.fit(q)
q_tfidf = tfidf_vectorizer.transform((q))

# Get the experiment run context
run = Run.get_context()

pickle.dump(tfidf_vectorizer,open('outputs/tfidf_vectorizer.pkl',"wb"))
pickle.dump(q_tfidf,open("outputs/q_tfidf.pkl","wb"))


run.complete()

Create the Compute Environment

Till this point, we have done the following things

Uploaded the data into Azure
Created the compute resources for the machine learning model
Created the model in a file

We need certain packages so that we can perform machine learning. Usually, we would require packages such as sklearn. This is a small machine learning model, but ideally, we would require more packages. In the next step, we create an Environment which houses the packages for the machine learning model

from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

# Create a Python environment for the experiment
reco_env = Environment("reco-experiment-env")
reco_env.python.user_managed_dependencies = False # Let Azure ML manage dependencies
reco_env.docker.enabled = False # Use a docker container

# Create a set of package dependencies (conda or pip as required)
wine_packages = CondaDependencies.create(conda_packages=['scikit-learn'])

# Add the dependencies to the environment
reco_env.python.conda_dependencies = wine_packages

print(reco_env.name, 'defined.')

# Register the environment
reco_env.register(workspace=ws)

Create the Estimator

Till this point, we have done the following things

Uploaded the data into Azure
Created the compute resources for the machine learning model
Created the model in a file
Created the environment for the compute clusters

We need to bind all this together and we would create an Estimator which binds the source code, hyperparameters required for the model, compute clusters, and the compute environment together. To run the model, we would also pass the hyperparameters required for the model.

from azureml.train.estimator import Estimator

script_params = {
    '--data-folder': ds.as_mount()
}

registered_env = Environment.get(ws, 'reco-experiment-env')

# Create an estimator
estimator = Estimator(source_directory=folder_training_script,
                      script_params=script_params,
                      compute_target = compute_target, # Run the experiment on the remote compute target
                      environment_definition = registered_env,
                      entry_script='train.py')

We are now ready for an AzureML Experiment which would house a number of Runs. The following code segment creates an experiment and also creates a Run with the created Estimator

# Create the Experiment and Run               

from azureml.core import Experiment

#Create an experiment
experiment = Experiment(workspace = ws, name = "reco_expt")

print('Experiment created')
run = experiment.submit(config=estimator)
run

Register the Model

We register the following in the workspace using the following code segment

TF-IDF fitted vectorizer
TF-IDF matrix of the questions

tfidf_vectorizer = run.register_model(model_name='tfidf_vectorizer_model',
                           model_path='outputs/tfidf_vectorizer.pkl',
                           tags = {'area': "tfidf_vectorizer", 'type': "sklearn"},
                           description = "tfidf_vectorizer")

q_tfidf = run.register_model(model_name='q_tfidf_model',
                           model_path='outputs/q_tfidf.pkl',
                           tags = {'area': "q_tfidf", 'type': "sklearn"},
                           description = "q_tfidf")

print(tfidf_vectorizer.name, q_tfidf.name, sep='\t')

Create the Inference Script

Till this point , we have built the model and registered in our workspace. We will now use this model to predict on new data. The first step is to create an inference script.

%%writefile $folder_training_script/score.py

import json
import joblib
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pickle
from azureml.core.model import Model

# Called when the service is loaded
def init():
    global model_path,model_path2
    
    # Get the path to the registered model file and load it
    model_path = Model.get_model_path('tfidf_vectorizer_model')
    model_path2 = Model.get_model_path('q_tfidf_model')
    
    

# Called when a request is received
def run(raw_data):
    # Get the input data as a numpy array
    data = np.array(json.loads(raw_data)['data'])
    

    with open(model_path, 'rb') as f:
        tfidf_vectorizer2 = pickle.load(f)
    with open(model_path2, 'rb') as f:
        q_tfidf2 = pickle.load(f)
    
    q_new_tfidf = tfidf_vectorizer2.transform(data)
    result = cosine_similarity(q_new_tfidf,q_tfidf2)
    predictions = result
    
    return predictions.tolist()

We also create the environment for the packages which are required for the inference script through the script

Create the Inference Dependencies

from azureml.core.conda_dependencies import CondaDependencies

# Add the dependencies for your model
myenv = CondaDependencies()
myenv.add_conda_package("scikit-learn")

# Save the environment config as a .yml file
env_file = './recocode/env.yml'
with open(env_file,"w") as f:
    f.write(myenv.serialize_to_string())
print("Saved dependency info in", env_file)

Till this point, we have the following

Inference script
Dependencies required for the inference script

We need to combine this together and the InferenceConfig will help us to do this

Create the Inference Config

from azureml.core.model import InferenceConfig

classifier_inference_config = InferenceConfig(runtime= "python",
                                              source_directory = './recocode',
                                              entry_script="score.py",
                                              conda_file="env.yml")

Create the Inference Clusters

We create the Inference Clusters, the AKS clusters required for the inference

from azureml.core.compute import ComputeTarget, AksCompute

cluster_name = 'aks-cluster'
compute_config = AksCompute.provisioning_configuration(cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST)
production_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
production_cluster.wait_for_completion(show_output=True)

Deploy the Model in the Inference Cluster

We create the deployment config for the AKS Web Service

from azureml.core.webservice import AksWebservice

classifier_deploy_config = AksWebservice.deploy_configuration(cpu_cores = 1,
                                                              memory_gb = 1)

The final step is to combine the

Inference config
Deployment config
Inference Cluster to create an endpoint which can be used for predicting new data

from azureml.core.model import Model

model1 = ws.models['tfidf_vectorizer_model']
model2 = ws.models['q_tfidf_model']
service = Model.deploy(workspace=ws,
                       name = 'reco-service',
                       models = [model1,model2],
                       inference_config = classifier_inference_config,
                       deployment_config = classifier_deploy_config,
                       deployment_target = production_cluster)
service.wait_for_deployment(show_output = True)

We print the endpoint for the model

endpoint = service.scoring_uri
print(endpoint)

Get the recommendations

We get the Service Keys which we would require for prediction

primary_key, secondary_key = service.get_keys()

The final step is creating predictions from the endpoint which is illustrated below

import requests
import json

# An array of new data cases
x_new = ["I want to be a data scientist. What should I study"]

# Convert the array to a serializable list in a JSON document
json_data = json.dumps({"data": x_new})

# Set the content type in the request headers
request_headers = { "Content-Type":"application/json",
                    "Authorization":"Bearer " + primary_key }

# Call the service
response = requests.post(url = endpoint,
                         data = json_data,
                         headers = request_headers)

print(response)

Get the 3 best answers

import numpy as np
import pandas as pd

data_folder = "recodata"

questions  = pd.read_csv(os.path.join(data_folder, 'questions.csv'))
professionals = pd.read_csv(os.path.join(data_folder, 'professionals.csv'))
answers = pd.read_csv(os.path.join(data_folder, 'answers.csv'))

prof_ans = pd.merge(professionals, answers, how = 'left' ,
                    left_on = 'professionals_id', right_on = 'answers_author_id')
prof_ans_q = pd.merge(prof_ans, questions, how = 'left' ,
                      left_on = 'answers_question_id', right_on = 'questions_id')

prof_ans_q = prof_ans_q[(~prof_ans_q["questions_title"].isna()) | (~prof_ans_q["questions_body"].isna()) ]

index = np.argsort(result)[:,-10:]
print(index[0])

print(prof_ans_q.iloc[index[0][-1]]["answers_body"])

print(prof_ans_q.iloc[index[0][-2]]["answers_body"])

print(prof_ans_q.iloc[index[0][-3]]["answers_body"])

Thoughts - Ambarish