Mapping Image SHA to Image Name using Pyxis database

In order to answer most of the customer usage and business insights related questions raised in the OpenShift Workload Fingerprinting project, we need to connect two disparate datasets - the insights operator archive and the pyxis database. That is, we want to use pyxis to determine the product name, architecture, vulnerabilities, vendor, etc corresponding to the container image SHA’s in the insights dataset. In a previous issue, we figured out how to do this for a given SHA, by using curl in the terminal. In this notebook, we will try to do this programmatically, and do it for all the SHA’s available in our dataset. We will then store this merged dataset to an s3 bucket and use it for the rest of the analysis in the project going forward.

Pre-requisite

In order to fetch the image name (and other details) for the given ‘sha’ of the image_id, please complete the pre-requisite described below.

  1. Follow the link in order to set-up a kerberos ticket and Red Hat IdM on your machine.

  2. Update the /etc/krb5.conf on your machine by setting dns_canonical_hostname to false, as described in the first ‘red box’ in this guide

  3. Obtain the kerberos ticket by running, $ kinit <your_kerberos_username>@IPA.REDHAT.COM

In this notebook, we map the given “sha” of the image_id for the image layer dataset and container dataset provided from the workload data of the insight operator.

Importing useful packages

import io
import boto3
import requests
import os
import json
import warnings
import pandas as pd
import multiprocessing as mp

from requests_kerberos import HTTPKerberosAuth, OPTIONAL
from dotenv import find_dotenv, load_dotenv
from tqdm import tqdm
load_dotenv(find_dotenv())
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)

Data Collection

In this section, we will fetch from our s3 bucket the containers dataset and the image layers dataset that have been curated from insights operator archives. To learn more about the general content of datasets, please check out the getting_started notebook.

# CEPH Bucket variables
s3_endpoint_url = os.getenv("S3_ENDPOINT")
s3_access_key = os.getenv("S3_ACCESS_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")

# s3 resource to communicate with storage
s3 = boto3.resource(
    "s3",
    endpoint_url=s3_endpoint_url,
    aws_access_key_id=s3_access_key,
    aws_secret_access_key=s3_secret_key,
)

dates = [
    "2021-08-05",
    "2021-08-08",
    "2021-08-10",
    "2021-08-11",
    "2021-08-15",
    "2021-08-17",
    "2021-08-19",
    "2021-08-22",
    "2021-08-23",
]

image_layers_df = pd.DataFrame([])
image1 = []
for date in dates:
    obj1 = s3.Object(
        s3_bucket,
        "prototype/workload/image_layers/date={date}/{date}.parquet".format(date=date),
    )
    buffer1 = io.BytesIO()
    obj1.download_fileobj(buffer1)
    image1 = pd.read_parquet(buffer1)
    image_layers_df = image_layers_df.append(image1)

containers_df = pd.DataFrame([])
image2 = []
for date in dates:
    obj2 = s3.Object(
        s3_bucket,
        "prototype/workload/containers/date={date}/{date}.parquet".format(date=date),
    )
    buffer2 = io.BytesIO()
    obj2.download_fileobj(buffer2)
    image2 = pd.read_parquet(buffer2)
    containers_df = containers_df.append(image2)
image_layers_df.head(3)
cluster_id image_id layer_image_id layer_image_level first_command first_arg archive_path
0 9e97b920-2876-4076-8bfb-36fe123bc273 sha256:3bc831c3d6614afcd5a8e1728b8bbe6709c957d... sha256:1cadda38f72dece653de82063e3c8e910265fe7... 0 U7Yi5SISAtKW <NA> archives/compressed/9e/9e97b920-2876-4076-8bfb...
1 9e97b920-2876-4076-8bfb-36fe123bc273 sha256:3bc831c3d6614afcd5a8e1728b8bbe6709c957d... sha256:a50df8fd88fecefc26fd331f832672108deb08c... 1 U7Yi5SISAtKW <NA> archives/compressed/9e/9e97b920-2876-4076-8bfb...
2 9e97b920-2876-4076-8bfb-36fe123bc273 sha256:3bc831c3d6614afcd5a8e1728b8bbe6709c957d... sha256:904d3325f999f09cad1ba9676937fc8b72ff285... 2 U7Yi5SISAtKW <NA> archives/compressed/9e/9e97b920-2876-4076-8bfb...
containers_df.head(3)
cluster_id namespace shape shape_instances image_id first_command first_arg init_container archive_path
0 98df2866-2131-41c1-97f3-aba6f8761c3d 0LiT6ZNtbpYL sha256:7ac9e625af2e30671ebec339821489da205116c... 6 sha256:6c05d74eb1fa37a77ed9215d83933265564d661... N9KxLV2avCo2 BuLIUMMJnyP_ False archives/compressed/98/98df2866-2131-41c1-97f3...
1 98df2866-2131-41c1-97f3-aba6f8761c3d 0LiT6ZNtbpYL sha256:b1adc9101829bec6f71530547b1151891a99116... 6 sha256:6c05d74eb1fa37a77ed9215d83933265564d661... N9KxLV2avCo2 EbplhSJxzSTF False archives/compressed/98/98df2866-2131-41c1-97f3...
2 98df2866-2131-41c1-97f3-aba6f8761c3d 0LiT6ZNtbpYL sha256:b1adc9101829bec6f71530547b1151891a99116... 6 sha256:8b9ecf20324c62d92b4a812a9f502b1059cfed0... Cl6kTzfbYztA <NA> True archives/compressed/98/98df2866-2131-41c1-97f3...

Function to extract mapped information for image_id

def mapped_df(image_id):
    dataframe = pd.DataFrame([])
    base_url = "https://pyxis.engineering.redhat.com/v1/images?filter=image_id=="
    image_id = image_id
    team_url = base_url + str(image_id)
    r = requests.get(team_url, auth=kerberos_auth, verify=False)
    if r.status_code == 200:
        data = json.loads(r.content)
        if len(data["data"]) > 0:
            if len(data["data"][0]["parsed_data"]["labels"]) > 0:
                df = pd.DataFrame(data["data"][0]["parsed_data"]["labels"])
                table = pd.pivot_table(
                    df, values="value", aggfunc=lambda x: x, columns="name"
                )
                table["image_id"] = image_id
                table = table.set_index("image_id")
                dataframe = dataframe.append(table)
    return dataframe

Function to extract mapped information for image_layer_id

def mapped_layer_df(image_id):
    dataframe = pd.DataFrame([])
    base_url = "https://pyxis.engineering.redhat.com/v1/images?filter=top_layer_id=="
    team_url = base_url + str(image_id)
    r = requests.get(team_url, auth=kerberos_auth, verify=False)
    if r.status_code == 200:
        data = json.loads(r.content)
        if len(data["data"]) > 0:
            if len(data["data"][0]["parsed_data"]["labels"]) > 0:
                df = pd.DataFrame(data["data"][0]["parsed_data"]["labels"])
                table = pd.pivot_table(
                    df, values="value", aggfunc=lambda x: x, columns="name"
                )
                table["image_id"] = image_id
                table = table.set_index("image_id")
                dataframe = dataframe.append(table)
    return dataframe

Mapping the SHA’s in image_id column of Image layers Dataset

First, we try to form a list of unique image_id from the image layer dataset. Using that list, we will be doing the web scraping followed by the formation of the dataframe with image_id and corresponding product name, summary, vendor, version, and other attributes.

# Creating the list of image_id
arr_imageid = image_layers_df.image_id.unique()
kerberos_auth = HTTPKerberosAuth(mutual_authentication=OPTIONAL)
# Size of the list
len(arr_imageid)
5315
# number of max processes
n_max_processes = mp.cpu_count()
print(n_max_processes)
8
with mp.Pool(processes=n_max_processes) as pool:
    df = list(tqdm(pool.imap(mapped_df, arr_imageid), total=len(arr_imageid)))
    dataframe_image_id = pd.concat(df)
100%|██████████| 5315/5315 [10:54<00:00,  8.13it/s]
dataframe_image_id.shape
(958, 73)

Mapped 956 (~18%) image_id’s sha out of 5315 image_ids.

"""
# Uploading the mapping dataset in the bucket
parquet_buffer = io.BytesIO()
dataframe_image_id.to_parquet(parquet_buffer)
s3_obj = s3.Object(
    s3_bucket, "prototype/workload/image_layers/dataframe_image_id.parquet"
)
status = s3_obj.put(Body=parquet_buffer.getvalue())
"""
'\n# Uploading the mapping dataset in the bucket\nparquet_buffer = io.BytesIO()\ndataframe_image_id.to_parquet(parquet_buffer)\ns3_obj = s3.Object(\n    s3_bucket, "prototype/workload/image_layers/dataframe_image_id.parquet"\n)\nstatus = s3_obj.put(Body=parquet_buffer.getvalue())\n'

The corresponding image_id mapped with the product name is saved in the bucket in the form of dataframe (dataframe_image_id.parquet).


Mapping SHA’s from image_layer_id column of Image Layers Dataset

In addition to the image_id column, the image_layer_id column also contains image SHA’s. These SHA’s correspond to the layers that make up the image in image_id. In this section, we try to form a list of unique image_layer_id from the image layer dataset. Using that list, we will be doing the web scraping followed by the formation of the dataframe with image_layer_id and corresponding product name, summary, vendor, and other attributes.

arr_layer_imageid = image_layers_df.layer_image_id.unique()
# Size of the list
len(arr_layer_imageid)
17817
with mp.Pool(processes=n_max_processes) as pool:
    df_image_layerid = list(
        tqdm(
            pool.imap(mapped_layer_df, arr_layer_imageid),
            total=len(arr_layer_imageid),
        )
    )
    df_image_layerid = pd.concat(df_image_layerid)
100%|██████████| 17817/17817 [35:21<00:00,  8.40it/s]
df_image_layerid.shape
(1292, 105)

We were able to create a dataframe which maps 1292 (~7%) sha’s of the image layer id provided in the image layer dataset out of 17817 sha’s in the image layer dataset.

"""
# Uploading the mapping dataset in the bucket
parquet_buffer = io.BytesIO()
df_image_layer_id.to_parquet(parquet_buffer)
s3_obj = s3.Object(
    s3_bucket, "prototype/workload/image_layers/df_image_layer_id.parquet"
)
status = s3_obj.put(Body=parquet_buffer.getvalue())
"""
'\n# Uploading the mapping dataset in the bucket\nparquet_buffer = io.BytesIO()\ndf_image_layer_id.to_parquet(parquet_buffer)\ns3_obj = s3.Object(\n    s3_bucket, "prototype/workload/image_layers/df_image_layer_id.parquet"\n)\nstatus = s3_obj.put(Body=parquet_buffer.getvalue())\n'

Mapping the SHA’s inimage id column of Containers dataset

In this section, we will map the SHA’s in the containers dataset to their product name, summary, vendor, and other attributes. We will first form a list of unique image_id’s from the containers dataset. Using that list, we will be doing the web scraping followed by the formation of the dataframe with image_id and corresponding attributes.

# Listing out the SHA's of image_id
arr_cont_imageid = containers_df.image_id.unique()
# Size of the list
len(arr_cont_imageid)
33488
with mp.Pool(processes=n_max_processes) as pool:
    df_cont_image_id = list(
        tqdm(pool.imap(mapped_df, arr_cont_imageid), total=len(arr_cont_imageid))
    )
    df_cont_image_id = pd.concat(df_cont_image_id)
100%|██████████| 33488/33488 [1:07:39<00:00,  8.25it/s]
df_cont_image_id.shape
(8928, 262)

Here, we successfully did the mapping for 8928 (~26%) sha’s of the image_id for container dataset out of 33488 sha’s of image_id.

"""
# Uploading the mapping dataset in the bucket
parquet_buffer = io.BytesIO()
df_cont_image_id.to_parquet(parquet_buffer)
s3_obj = s3.Object(
    s3_bucket, "prototype/workload/containers/df_cont_image_id.parquet"
)
status = s3_obj.put(Body=parquet_buffer.getvalue())
"""
'\n# Uploading the mapping dataset in the bucket\nparquet_buffer = io.BytesIO()\ndf_cont_image_id.to_parquet(parquet_buffer)\ns3_obj = s3.Object(\n    s3_bucket, "prototype/workload/containers/df_cont_image_id.parquet"\n)\nstatus = s3_obj.put(Body=parquet_buffer.getvalue())\n'

The corresponding mapped dataframe is saved in bucket.


Conclusion

The notebook does takes some time to run. In the notebook, we were able to map the product name with the corresponding image_id from the image layer dataset and the container dataset. They mapped dataframe are then saved in the bucket.

As next steps, we will be extracting the telemetry information (cpu usage, memory usage) corresponding to different cluster_id in the workload dataset.