Time to Merge Prediction Inference Service

In the previous notebook, we explored some basic machine learning models for predicting time to merge of a PR. We then deployed the model with the highest f1-score as a service using Seldon. The purpose of this notebook is to check whether this service is running as intended, and more specifically to ensure that the model performance is what we expect it to be. So here, we will use the test set from the aforementioned notebook as the query payload for the service, and then verify that the return values are the same as those obtained during training/testing locally.

import sys
import json
import os
import requests
from dotenv import load_dotenv, find_dotenv
import numpy as np

from sklearn.metrics import classification_report

metric_template_path = "../data-sources/TestGrid/metrics"
if metric_template_path not in sys.path:
    sys.path.insert(1, metric_template_path)

from ipynb.fs.defs.metric_template import (  # noqa: E402
    CephCommunication,
)

load_dotenv(find_dotenv())
True
## CEPH Bucket variables
## Create a .env file on your local with the correct configs,
s3_endpoint_url = os.getenv("S3_ENDPOINT")
s3_access_key = os.getenv("S3_ACCESS_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")
s3_path = "github"
REMOTE = os.getenv("REMOTE")
INPUT_DATA_PATH = "../../../data/processed/github"
if REMOTE:
    cc = CephCommunication(s3_endpoint_url, s3_access_key, s3_secret_key, s3_bucket)
    X_test = cc.read_from_ceph(s3_path, "X_test.parquet")
    y_test = cc.read_from_ceph(s3_path, "y_test.parquet")

else:
    print(
        "The X_test.parquet and y_test.parquet files are not included in the ocp-ci-analysis github repo."
    )
    print(
        "Please set REMOTE=1 in the .env file and read this data from the S3 bucket instead."
    )
X_test
size is_reviewer is_approver created_at_day created_at_month created_at_weekday created_at_hour change_in_.github change_in_docs change_in_pkg ... title_wordcount_fix title_wordcount_haproxy title_wordcount_oc title_wordcount_publishing title_wordcount_revert title_wordcount_router title_wordcount_sh title_wordcount_staging title_wordcount_support title_wordcount_travis
3599 3 True True 6 7 0 21 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
14562 4 True True 9 6 4 22 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
10102 0 False False 29 7 4 3 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
578 3 False False 16 12 1 13 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
9401 1 True True 17 6 4 5 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10438 1 False True 16 8 1 2 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6027 3 False False 23 11 0 16 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
11558 3 False False 25 10 1 8 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
18442 4 False False 5 2 0 10 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
23485 0 False False 25 7 3 13 0 0 1 ... 0 0 0 0 0 0 0 0 0 0

2706 rows × 96 columns

y_test
ttm_class
3599 9
14562 5
10102 2
578 4
9401 6
... ...
10438 6
6027 8
11558 5
18442 4
23485 6

2706 rows × 1 columns

# endpoint from the seldon deployment
base_url = "http://ttm-pipeline-opf-seldon.apps.zero.massopen.cloud/predict"
# convert the dataframe into a numpy array and then to a list (required by seldon)
data = {"data": {"ndarray": X_test.to_numpy().tolist()}}

# create the query payload
json_data = json.dumps(data)
headers = {"content-Type": "application/json"}
# query our inference service
response = requests.post(base_url, data=json_data, headers=headers)
response
<Response [200]>
# what are the names of the prediction classes
json_response = response.json()
json_response["data"]["names"]
['t:0', 't:1', 't:2', 't:3', 't:4', 't:5', 't:6', 't:7', 't:8', 't:9']
# probabality estimates for each of the class for a sample PR
json_response["data"]["ndarray"][0]
[0.02, 0.03, 0.085, 0.165, 0.09, 0.155, 0.09, 0.135, 0.07, 0.16]
# get predicted classes from probabilities for each PR
preds = np.argmax(json_response["data"]["ndarray"], axis=1)
# evaluate results
print(classification_report(y_test, preds))
              precision    recall  f1-score   support

           0       0.31      0.42      0.36       249
           1       0.14      0.10      0.12       217
           2       0.23      0.27      0.25       364
           3       0.15      0.17      0.16       240
           4       0.13      0.10      0.11       275
           5       0.14      0.10      0.12       236
           6       0.23      0.23      0.23       333
           7       0.16      0.14      0.15       270
           8       0.18      0.17      0.17       260
           9       0.23      0.28      0.25       262

    accuracy                           0.20      2706
   macro avg       0.19      0.20      0.19      2706
weighted avg       0.19      0.20      0.20      2706

Conclusion

The evaluation scores in the above classification report match the ones we saw in the training notebook. Great, looks like our inference service and model are working as expected, and are ready to predict some times to merge of PRs!