ngrok config add-authtoken YOUR_NGROK_AUTH_TOKEN

# Install dependencies

!pip install psycopg2-binary
!pip install tensorflow
!pip install torch torchvision

Requirement already satisfied: psycopg2-binary in /usr/local/lib/python3.11/dist-packages (2.9.10)

!pip install tensorflow
!pip install torch torchvision

Requirement already satisfied: tensorflow in /usr/local/lib/python3.11/dist-packages (2.18.0)
Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (1.4.0)
Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (1.6.3)
Requirement already satisfied: flatbuffers>=24.3.25 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (25.2.10)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (0.6.0)
Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (0.2.0)
Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (18.1.1)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (3.4.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from tensorflow) (24.2)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (5.29.5)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (2.32.3)
Requirement already satisfied: setuptools in /usr/local/lib/python3.11/dist-packages (from tensorflow) (75.2.0)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (1.17.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (3.1.0)
Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (4.14.0)
Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (1.17.2)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (1.73.0)
Requirement already satisfied: tensorboard<2.19,>=2.18 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (2.18.0)
Requirement already satisfied: keras>=3.5.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (3.8.0)
Requirement already satisfied: numpy<2.1.0,>=1.26.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (2.0.2)
Requirement already satisfied: h5py>=3.11.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (3.14.0)
Requirement already satisfied: ml-dtypes<0.5.0,>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (0.4.1)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.11/dist-packages (from tensorflow) (0.37.1)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.11/dist-packages (from astunparse>=1.6.0->tensorflow) (0.45.1)
Requirement already satisfied: rich in /usr/local/lib/python3.11/dist-packages (from keras>=3.5.0->tensorflow) (13.9.4)
Requirement already satisfied: namex in /usr/local/lib/python3.11/dist-packages (from keras>=3.5.0->tensorflow) (0.1.0)
Requirement already satisfied: optree in /usr/local/lib/python3.11/dist-packages (from keras>=3.5.0->tensorflow) (0.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2.21.0->tensorflow) (3.4.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2.21.0->tensorflow) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2.21.0->tensorflow) (2.4.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2.21.0->tensorflow) (2025.6.15)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.11/dist-packages (from tensorboard<2.19,>=2.18->tensorflow) (3.8)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.11/dist-packages (from tensorboard<2.19,>=2.18->tensorflow) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from tensorboard<2.19,>=2.18->tensorflow) (3.1.3)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.11/dist-packages (from werkzeug>=1.0.1->tensorboard<2.19,>=2.18->tensorflow) (3.0.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich->keras>=3.5.0->tensorflow) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich->keras>=3.5.0->tensorflow) (2.19.1)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich->keras>=3.5.0->tensorflow) (0.1.2)
Requirement already satisfied: torch in /usr/local/lib/python3.11/dist-packages (2.6.0+cu124)
Requirement already satisfied: torchvision in /usr/local/lib/python3.11/dist-packages (0.21.0+cu124)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from torch) (3.18.0)
Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.11/dist-packages (from torch) (4.14.0)
Requirement already satisfied: networkx in /usr/local/lib/python3.11/dist-packages (from torch) (3.5)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from torch) (3.1.6)
Requirement already satisfied: fsspec in /usr/local/lib/python3.11/dist-packages (from torch) (2025.3.2)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.11/dist-packages (from torch) (9.1.0.70)
Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.5.8)
Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.11/dist-packages (from torch) (11.2.1.3)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.11/dist-packages (from torch) (10.3.5.147)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.11/dist-packages (from torch) (11.6.1.9)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.11/dist-packages (from torch) (12.3.1.170)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.2 in /usr/local/lib/python3.11/dist-packages (from torch) (0.6.2)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.11/dist-packages (from torch) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: triton==3.2.0 in /usr/local/lib/python3.11/dist-packages (from torch) (3.2.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.11/dist-packages (from torch) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from sympy==1.13.1->torch) (1.3.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from torchvision) (2.0.2)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.11/dist-packages (from torchvision) (11.2.1)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->torch) (3.0.2)

# Import all the required libraries

import os
import getpass
import psycopg2
import traceback
from psycopg2.extras import execute_values
from google.colab import drive
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import cv2
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
import albumentations as A
from albumentations.pytorch import ToTensorV2
from PIL import Image
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import torch
import torchvision
import torchvision.models as torchmodels
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.models as models
from torchvision.models import resnet18, ResNet18_Weights
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tensorflow.keras.applications import ResNet50
from tensorflow.keras import layers, models
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras import layers, models, optimizers
import seaborn as sns
from datetime import datetime
import random
from torch.nn import CrossEntropyLoss
from io import BytesIO
from io import StringIO
from sqlalchemy import create_engine

# Defining Database Setup Functions

# ---------------------- DB Connection ----------------------
def connect_to_postgres():
    DB_HOST = "4.tcp.ngrok.io"
    DB_PORT = "11218"
    DB_NAME = "artiszen_db"
    DB_USER = input("Enter PostgreSQL username: ")
    DB_PASS = getpass.getpass("Enter PostgreSQL password: ")

    try:
        conn = psycopg2.connect(
            host=DB_HOST,
            port=DB_PORT,
            dbname=DB_NAME,
            user=DB_USER,
            password=DB_PASS
        )
        print(f"✅ Connected as '{DB_USER}'.")
        return conn
    except Exception as e:
        print("❌ Connection failed:", e)
        return None

# ---------------------- Create Schema & Tables ----------------------
def create_media_schema_and_tables(conn):
    cur = conn.cursor()

    cur.execute("""
    CREATE SCHEMA IF NOT EXISTS media;
    """)

    cur.execute("""
    CREATE TABLE IF NOT EXISTS media.product_category (
        category_id SERIAL PRIMARY KEY,
        category_name VARCHAR(100) UNIQUE NOT NULL,
        has_images BOOLEAN DEFAULT TRUE,
        has_videos BOOLEAN DEFAULT FALSE
    );
    """)

    cur.execute("""
    CREATE TABLE IF NOT EXISTS media.media_files (
        media_id SERIAL PRIMARY KEY,
        file_name TEXT NOT NULL,
        file_path TEXT NOT NULL,
        file_type VARCHAR(10) CHECK (file_type IN ('image', 'video')),
        category_id INT REFERENCES media.product_category(category_id),
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    );
    """)

    cur.execute("""
    CREATE TABLE IF NOT EXISTS media.media_labels (
        label_id SERIAL PRIMARY KEY,
        media_id INT REFERENCES media.media_files(media_id),
        label TEXT NOT NULL,
        confidence FLOAT,
        created_by TEXT,
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    );
    """)

    conn.commit()
    cur.close()
    print("✅ Schema and tables verified or created.")

# ---------------------- Unique Constraint Setup ----------------------
def add_unique_constraints(conn):
    cur = conn.cursor()
    try:
        # Unique on media_files
        cur.execute("""
            ALTER TABLE media.media_files
            ADD CONSTRAINT uq_media_file
            UNIQUE (file_name, file_path, file_type, category_id);
        """)
    except Exception as e:
        conn.rollback()
        if "already exists" not in str(e):
            print("⚠️ Error adding uq_media_file:", e)
    else:
        conn.commit()

    try:
        # Unique on media_labels: media_id + label
        cur.execute("""
            ALTER TABLE media.media_labels
            ADD CONSTRAINT uq_media_label
            UNIQUE (media_id, label);
        """)
    except Exception as e:
        conn.rollback()
        if "already exists" not in str(e):
            print("⚠️ Error adding uq_media_label:", e)
    else:
        conn.commit()

    cur.close()
    print("✅ Unique constraints for media_files and media_labels verified/created.")

# ---------------------- Truncate Tables (optional) ----------------------
def truncate_all_media_tables(conn):
    cur = conn.cursor()
    try:
        cur.execute("""
        TRUNCATE TABLE
            media.media_labels,
            media.media_files,
            media.product_category
        RESTART IDENTITY CASCADE;
        """)
        conn.commit()
        print("🧹 Tables truncated.")
    except Exception as e:
        print("❌ Truncate failed:", e)
    cur.close()

# Mount Google Drive

drive.mount('/content/drive')

Mounted at /content/drive

# Define path for folders

base_dir = "/content/drive/MyDrive/DL/artiszen_media"
image_dir = base_dir + "/artiszen_dataset_images"
video_dir = base_dir + "/artiszen_dataset_videos"

print("📁 Image folder exists:", os.path.exists(image_dir))
print("📁 Video folder exists:", os.path.exists(video_dir))

📁 Image folder exists: True
📁 Video folder exists: True

insert_categories(conn, image_dir, video_dir)

# Defining Data Insertion Functions

# ------------------ Insert or Update Categories ------------------
def insert_categories(conn, image_dir, video_dir):
    cur = conn.cursor()
    try:
        image_categories = set(os.listdir(image_dir))
        video_categories = set(os.listdir(video_dir))
        all_categories = sorted(image_categories | video_categories)

        values = []
        for cat in all_categories:
            has_images = cat in image_categories
            has_videos = cat in video_categories
            values.append((cat.lower(), has_images, has_videos))

        print("🧪 Categories to insert:")
        for v in values:
            print("   ", v)

        insert_cat_query = """
        INSERT INTO media.product_category (category_name, has_images, has_videos)
        VALUES %s
        ON CONFLICT (category_name) DO UPDATE
        SET has_images = EXCLUDED.has_images,
            has_videos = EXCLUDED.has_videos;
        """
        execute_values(cur, insert_cat_query, values)
        print(f"✅ Inserted/updated {len(values)} categories.")
        conn.commit()
    except Exception as e:
        conn.rollback()
        print("❌ Error inserting categories:", e)
        raise
    finally:
        cur.close()

# ------------------ Helper Function: Collect Nested Files ------------------
def collect_nested_files(cur, root_folder, media_type, valid_exts, base_dir):
    collected = []
    for category in os.listdir(root_folder):
        cat_folder = os.path.join(root_folder, category)
        if not os.path.isdir(cat_folder):
            continue

        cur.execute("SELECT category_id FROM media.product_category WHERE category_name = %s", (category.lower(),))
        result = cur.fetchone()
        if not result:
            print(f"⚠️ Category '{category}' not found. Skipping.")
            continue
        category_id = result[0]

        count = 0
        for subdir, _, files in os.walk(cat_folder):
            for f in files:
                ext = os.path.splitext(f)[1].lower()
                if ext not in valid_exts:
                    continue
                abs_path = os.path.join(subdir, f)
                rel_path = os.path.relpath(abs_path, base_dir).replace("\\", "/")
                collected.append((f, rel_path, media_type, category_id))
                count += 1
        print(f"📂 {category.lower()} [{media_type}]: {count} file(s)")
    return collected

# ------------------ Insert or Update Media Files ------------------
def insert_media_files(conn, image_dir, video_dir, base_dir):
    cur = conn.cursor()
    try:
        image_exts = {".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp"}
        video_exts = {".mp4", ".mov", ".avi", ".mkv", ".wmv", ".flv", ".webm"}
        records = []
        records += collect_nested_files(cur, image_dir, "image", image_exts, base_dir)
        records += collect_nested_files(cur, video_dir, "video", video_exts, base_dir)

        if records:
            invalid_records = [r for r in records if None in r]
            if invalid_records:
                print(f"❌ Found {len(invalid_records)} invalid record(s) with NULL values. Aborting insert.")
                for r in invalid_records[:5]:
                    print("❗", r)
                raise ValueError("Found invalid records with NULLs.")

            print("🔍 First 5 valid records:")
            for r in records[:5]:
                print(r, type(r))

            insert_media_query = """
            INSERT INTO media.media_files (file_name, file_path, file_type, category_id)
            VALUES %s
            ON CONFLICT (file_name, file_path, file_type, category_id)
            DO UPDATE SET
                file_name = EXCLUDED.file_name,
                file_path = EXCLUDED.file_path,
                file_type = EXCLUDED.file_type,
                category_id = EXCLUDED.category_id;
            """
            execute_values(cur, insert_media_query, records)
            print(f"✅ Total media files inserted/updated: {len(records)}")
        else:
            print("⚠️ No media files to insert.")

        conn.commit()
    except Exception as e:
        traceback.print_exc()
        conn.rollback()
        print("❌ Error inserting media files:", e)
        raise
    finally:
        cur.close()

# ------------------ Insert or Update Temporary Labels ------------------
def insert_temp_labels(conn):
    cur = conn.cursor()
    try:
        cur.execute("""
            INSERT INTO media.media_labels (media_id, label, confidence, created_by)
            SELECT mf.media_id, pc.category_name, 1.0, 'auto-category'
            FROM media.media_files mf
            JOIN media.product_category pc ON mf.category_id = pc.category_id
            ON CONFLICT (media_id, label) DO UPDATE
            SET confidence = EXCLUDED.confidence,
                created_by = EXCLUDED.created_by;
        """)
        conn.commit()
        print("✅ Temporary labels inserted/updated.")
    except Exception as e:
        conn.rollback()
        print("❌ Error inserting labels:", e)
        raise
    finally:
        cur.close()

def run_full_ingestion_pipeline(conn, image_dir, video_dir, base_dir):
    if conn is None or conn.closed != 0:
        print("❌ Invalid or closed database connection.")
        return None

    # STEP 1: Insert Categories
    try:
        insert_categories(conn, image_dir, video_dir)
    except Exception as e:
        print("⛔ insert_categories() failed:", e)
        conn.rollback()
        print("ℹ️ Rolled back transaction after failure in categories.")
        return None

    # STEP 2: Insert Media Files
    try:
        insert_media_files(conn, image_dir, video_dir, base_dir)
    except Exception as e:
        print("⛔ insert_media_files() failed:", e)
        conn.rollback()
        print("ℹ️ Rolled back transaction after failure in media files.")
        return None

    # STEP 3: Insert Temporary Labels
    try:
        insert_temp_labels(conn)
    except Exception as e:
        print("⛔ insert_temp_labels() failed:", e)
        conn.rollback()
        print("ℹ️ Rolled back transaction after failure in temp labels.")
        return None

    print("✅ All steps in ingestion pipeline completed successfully.")
    return conn

def main():

    print("🚀 Starting media ingestion pipeline...")

    try:
        # Mount Google Drive
        print("🔄 Mounting Google Drive...")
        drive.mount('/content/drive')
    except Exception as e:
        print("❌ Failed to mount Google Drive:", e)
        return

    try:
        # Define Media Paths
        base_dir = "/content/drive/MyDrive/DL/artiszen_media"
        image_dir = base_dir + "/artiszen_dataset_images"
        video_dir = base_dir + "/artiszen_dataset_videos"

        if not os.path.exists(image_dir) or not os.path.exists(video_dir):
            raise FileNotFoundError("Image or video directory not found.")

        print("📁 Image folder exists:", os.path.exists(image_dir))
        print("📁 Video folder exists:", os.path.exists(video_dir))
    except Exception as e:
        print("❌ Failed to verify media paths:", e)
        return

    try:
        # Connect to PostgreSQL
        conn = connect_to_postgres()
        if conn is None or conn.closed != 0:
            raise ConnectionError("Failed to connect to PostgreSQL.")
        print("✅ PostgreSQL connection established.")
    except Exception as e:
        print("❌ PostgreSQL connection error:", e)
        return

    try:
        # Create schema, tables, and unique constraints
        print("🧱 Creating schema and tables...")
        create_media_schema_and_tables(conn)
        add_unique_constraints(conn)
        #truncate_all_media_tables(conn)  # Optional
    except Exception as e:
        conn.rollback()
        print("❌ Failed during schema setup:", e)
        conn.close()
        return

    try:
        # Run full ingestion pipeline
        print("📥 Running ingestion pipeline...")
        conn = run_full_ingestion_pipeline(
            conn=conn,
            image_dir=image_dir,
            video_dir=video_dir,
            base_dir=base_dir
        )
        if conn:
            print("🎉 Main pipeline finished successfully.")
            conn.close()
        else:
            print("⚠️ Ingestion completed with connection issues.")

    except Exception as e:
        conn.rollback()
        print("❌ Pipeline execution failed:", e)
        conn.close()

# Run the pipeline
main()

🚀 Starting media ingestion pipeline...
🔄 Mounting Google Drive...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
📁 Image folder exists: True
📁 Video folder exists: True
✅ PostgreSQL connection established.
🧱 Creating schema and tables...
✅ Schema and tables verified or created.
✅ Unique constraints for media_files and media_labels verified/created.
🧹 Tables truncated.
📥 Running ingestion pipeline...
🧪 Categories to insert:
    ('agendas', True, True)
    ('bundles', True, False)
    ('business cards', True, False)
    ('dft_prints', False, True)
    ('dtf_prints', True, False)
    ('hats', True, False)
    ('laser', True, False)
    ('misc', True, False)
    ('mugs', True, True)
    ('resin', True, True)
    ('shirts', True, True)
    ('stickers', True, False)
    ('sublimation', True, False)
    ('tumblers', True, True)
    ('uvdtf', True, False)
✅ Inserted/updated 15 categories.
📂 mugs [image]: 5 file(s)
📂 agendas [image]: 2 file(s)
📂 stickers [image]: 5 file(s)
📂 dtf_prints [image]: 9 file(s)
📂 hats [image]: 5 file(s)
📂 laser [image]: 17 file(s)
📂 resin [image]: 8 file(s)
📂 tumblers [image]: 5 file(s)
📂 sublimation [image]: 3 file(s)
📂 bundles [image]: 6 file(s)
📂 uvdtf [image]: 10 file(s)
📂 business cards [image]: 2 file(s)
📂 misc [image]: 6 file(s)
📂 shirts [image]: 54 file(s)
📂 dft_prints [video]: 1 file(s)
📂 agendas [video]: 2 file(s)
📂 mugs [video]: 7 file(s)
📂 resin [video]: 1 file(s)
📂 shirts [video]: 1 file(s)
📂 tumblers [video]: 5 file(s)
🔍 First 5 valid records:
('MG2301IMGIflintstones12ozBLK.jpg', 'artiszen_dataset_images/mugs/MG2301IMGIflintstones12ozBLK.jpg', 'image', 9) <class 'tuple'>
('MG2201IMGpochacco12ozWHT.jpg', 'artiszen_dataset_images/mugs/MG2201IMGpochacco12ozWHT.jpg', 'image', 9) <class 'tuple'>
('MG2201IMGspottiedot12ozWHT.jpg', 'artiszen_dataset_images/mugs/MG2201IMGspottiedot12ozWHT.jpg', 'image', 9) <class 'tuple'>
('MG2201IMGhellokitty12ozWHT.jpg', 'artiszen_dataset_images/mugs/MG2201IMGhellokitty12ozWHT.jpg', 'image', 9) <class 'tuple'>
('MG2501IMGstitch12ozBLK.jpg', 'artiszen_dataset_images/mugs/MG2501IMGstitch12ozBLK.jpg', 'image', 9) <class 'tuple'>
✅ Total media files inserted/updated: 154
✅ Temporary labels inserted/updated.
✅ All steps in ingestion pipeline completed successfully.
🎉 Main pipeline finished successfully.

# 1) Connect and load data
conn = connect_to_postgres()
df_media_files = pd.read_sql("SELECT * FROM media.media_files;", conn)
df_categories  = pd.read_sql("SELECT * FROM media.product_category;", conn)
df_labels      = pd.read_sql("SELECT * FROM media.media_labels;", conn)
conn.close()

# 2) Display samples
print("=== Media Files Sample ===")
display(df_media_files.head(10))
print("=== Categories ===")
display(df_categories)
print("=== Media Labels Sample ===")
display(df_labels.head(10))

# 3) Count of media files per category
counts = (
    df_media_files
      .groupby('category_id')
      .media_id.count()
      .reset_index(name='file_count')
      .merge(df_categories[['category_id','category_name']], on='category_id')
      .sort_values('file_count', ascending=False)
)
print("=== Media Files Count per Category ===")
display(counts)

# 4) Bar chart: Media files per category
plt.figure(figsize=(8,4))
plt.bar(counts['category_name'], counts['file_count'])
plt.xticks(rotation=45, ha='right')
plt.title("Media Files per Category")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

# 5) Distribution by file type
type_counts = df_media_files['file_type'].value_counts().reset_index()
type_counts.columns = ['file_type', 'count']
print("=== Media Files by Type ===")
display(type_counts)

plt.figure(figsize=(4,3))
plt.bar(type_counts['file_type'], type_counts['count'])
plt.title("Distribution of Media by Type")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

/tmp/ipython-input-56-3257625878.py:3: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_media_files = pd.read_sql("SELECT * FROM media.media_files;", conn)
/tmp/ipython-input-56-3257625878.py:4: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_categories  = pd.read_sql("SELECT * FROM media.product_category;", conn)

=== Media Files Sample ===

/tmp/ipython-input-56-3257625878.py:5: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_labels      = pd.read_sql("SELECT * FROM media.media_labels;", conn)

=== Categories ===

=== Media Labels Sample ===

=== Media Files Count per Category ===

=== Media Files by Type ===

# Phase 2: Setup for Model Development


# Reconnect to PostgreSQL
def connect_to_postgres():
    import getpass
    conn = psycopg2.connect(
        host="4.tcp.ngrok.io",        # e.g., "2.tcp.ngrok.io"
        port=11218,                   # e.g., 18083
        dbname="artiszen_db",
        user= input("Enter DB user: "),
        password= getpass.getpass("Enter DB password: ")
    )
    return conn

# Set base directory (same as Phase 1)
base_dir = "/content/drive/MyDrive/DL/artiszen_media"

# Load media metadata
conn = connect_to_postgres()
df_media_files = pd.read_sql("SELECT * FROM media.media_files WHERE file_type = 'image';", conn)
conn.close()

print("✅ df_media_files loaded:", df_media_files.shape)
print("Sample file path:", df_media_files.iloc[0]['file_path'])


# Ensure full reproducibility by fixing all random seeds
def seed_everything(seed: int = 42):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# Call once at the very start
seed_everything(42)

/tmp/ipython-input-5-1588994231.py:21: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_media_files = pd.read_sql("SELECT * FROM media.media_files WHERE file_type = 'image';", conn)

✅ df_media_files loaded: (137, 6)
Sample file path: artiszen_dataset_images/mugs/MG2301IMGIflintstones12ozBLK.jpg

# Confirm File Existence before Loading

# Verify the image file path is valid and the file physically exists in the local file system.
missing_files = []

for _, row in df_media_files.iterrows():
    full_path = os.path.join(base_dir, row['file_path'])
    if not os.path.exists(full_path):
        missing_files.append(full_path)

print(f"❌ Missing files: {len(missing_files)}")
for path in missing_files[:5]:
    print(path)

if missing_files:
    raise FileNotFoundError(f"{len(missing_files)} image(s) missing; aborting pipeline.")

❌ Missing files: 0

corrupt_files = []

for i, row in df_media_files.iterrows():
    if row['file_type'] != 'image':
        continue  # Skip non-image files

    img_path = os.path.join(base_dir, row['file_path'])

    try:
        img = cv2.imread(img_path)
        if img is None:
            raise ValueError("OpenCV failed")
        _ = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    except:
        try:
            img = Image.open(img_path).convert("RGB")
        except:
            corrupt_files.append(img_path)

print(f"🧨 Total problematic images: {len(corrupt_files)}")
for path in corrupt_files[:5]:
    print("❌", path)

🧨 Total problematic images: 0

from PIL import Image

sizes = []
for _, row in df_media_files.iterrows():
    if row['file_type'] != 'image':
        continue
    path = os.path.join(base_dir, row['file_path'])
    try:
        with Image.open(path) as img:
            sizes.append(img.size)
    except:
        continue

# Analyze
import collections
size_counts = collections.Counter(sizes)
print("🖼️ Top 5 image resolutions:")
print(size_counts.most_common(5))

🖼️ Top 5 image resolutions:
[((4000, 3000), 39), ((4000, 2252), 25), ((3000, 4000), 4), ((2252, 4000), 2), ((2249, 2997), 2)]

train_transform = A.Compose([
    A.Resize(224, 224),
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.5),
    A.Rotate(limit=30, p=0.5),
    A.Affine(scale=(0.9, 1.1), translate_percent=0.05, rotate=(-15, 15), p=0.5),
    A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
    ToTensorV2()
])

val_transform = A.Compose([
    A.Resize(224, 224),
    A.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)),
    ToTensorV2()
])

# Step 1: Filter only image files
df_images = df_media_files[df_media_files['file_type'] == 'image'].copy()

# Step 2: Stratified split (train 70%, temp 30%)
train_df, temp_df = train_test_split(
    df_images,
    test_size=0.30,
    stratify=df_images['category_id'],
    random_state=42
)

# Step 3: Filter out classes with only one image in temp_df
valid_classes = temp_df['category_id'].value_counts()
valid_classes = valid_classes[valid_classes >= 2].index
temp_df_filtered = temp_df[temp_df['category_id'].isin(valid_classes)]

# Step 4: Stratified split of temp_df into validation (15%) and test (15%)
val_df, test_df = train_test_split(
    temp_df_filtered,
    test_size=0.50,  # 50% of 30% = 15%
    stratify=temp_df_filtered['category_id'],
    random_state=42
)

# Step 5: Print result summary
print("✅ Train size:", len(train_df))
print("✅ Validation size:", len(val_df))
print("✅ Test size:", len(test_df))

✅ Train size: 95
✅ Validation size: 18
✅ Test size: 18

class ArtiszenImageDataset(Dataset):
    def __init__(self, df, base_dir, transform=None, fallback_size=(224, 224)):
        self.df = df.reset_index(drop=True)
        self.base_dir = base_dir
        self.transform = transform
        self.fallback_size = fallback_size

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_path = os.path.join(self.base_dir, row['file_path'])

        # ❌ Skip if file is not an image
        if not img_path.lower().endswith((".jpg", ".jpeg", ".png", ".bmp", ".gif", ".webp")):
            raise FileNotFoundError(f"⚠️ Skipping non-image file: {img_path}")

        image = cv2.imread(img_path)
        if image is None:
            print(f"⚠️ Fallback: Could not read image at {img_path}")
            image = np.zeros((*self.fallback_size, 3), dtype=np.uint8)  # black image
        else:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        if self.transform:
            image = self.transform(image=image)['image']

        label = row['category_id']
        return image, label

# Data Loaders

# Step 6: Fit label encoder on training labels only
encoder = LabelEncoder()
encoder.fit(train_df['category_id'])

# Step 7: Transform all splits using the same encoder
train_df['category_id'] = encoder.transform(train_df['category_id'])
val_df['category_id']   = encoder.transform(val_df['category_id'])
test_df['category_id']  = encoder.transform(test_df['category_id'])

# Step 8: Confirm number of classes
num_classes = len(encoder.classes_)

# Step 9: Create datasets from the splits
#    – train uses random augmentations
#    – val & test use only resize + normalize
train_dataset = ArtiszenImageDataset(train_df, base_dir, transform=train_transform)
val_dataset   = ArtiszenImageDataset(val_df,   base_dir, transform=val_transform)
test_dataset  = ArtiszenImageDataset(test_df,  base_dir, transform=val_transform)

# Step 10: Create data loaders
#    – train loader shuffles with a fixed seed generator
#    – val/test loaders don’t shuffle so order is stable
g = torch.Generator().manual_seed(42)

train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,
    generator=g
)
val_loader = DataLoader(
    val_dataset,
    batch_size=32,
    shuffle=False
)
test_loader = DataLoader(
    test_dataset,
    batch_size=32,
    shuffle=False
)

# Test train_loader batch output
for imgs, labels in train_loader:
    print("✅ Train batch shape:", imgs.shape)
    print("🟧 Train labels:", labels)
    break

# Test val_loader batch output
for imgs, labels in val_loader:
    print("✅ Validation batch shape:", imgs.shape)
    print("🟧 Validation labels:", labels)
    break

# Test test_loader batch output
for imgs, labels in test_loader:
    print("✅ Test batch shape:", imgs.shape)
    print("🟧 Test labels:", labels)
    break

✅ Train batch shape: torch.Size([32, 3, 224, 224])
🟧 Train labels: tensor([ 6,  9, 12,  3,  9,  9,  7,  9,  9,  9, 12,  1, 12,  9,  9,  6,  9, 11,
         2,  5,  9,  5,  5,  9,  3,  5,  3, 10,  9,  0,  9,  1])
✅ Validation batch shape: torch.Size([18, 3, 224, 224])
🟧 Validation labels: tensor([13,  3,  6,  1,  9,  9,  5,  9,  9,  5,  8,  9,  5, 13,  9,  9, 10,  9])
✅ Test batch shape: torch.Size([18, 3, 224, 224])
🟧 Test labels: tensor([10,  6,  9,  9,  1,  3,  9,  9,  8, 13,  9,  5,  9,  9,  9,  3,  9,  5])

# Load pretrained model
model = resnet18(pretrained=True)

# Freeze all layers (optional)
for param in model.parameters():
    param.requires_grad = False

# Replace the classifier head
num_classes = len(encoder.classes_)
model.fc = nn.Linear(model.fc.in_features, num_classes)

/usr/local/lib/python3.11/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)

model.fc = nn.Linear(model.fc.in_features, num_classes)

print("Unique category IDs:", train_df['category_id'].unique())
print("Max category ID:", train_df['category_id'].max())
print("Num classes:", len(train_df['category_id'].unique()))

Unique category IDs: [12 13 10  9  4  5  8  6  3  7  1  2 11  0]
Max category ID: 13
Num classes: 14

num_classes = len(df_media_files['category_id'].unique())

# Move model to device (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)  # Only training the classifier head

# Training loop
num_epochs = 50
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    all_preds, all_labels = [], []

    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        all_preds.extend(torch.argmax(outputs, dim=1).cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(all_labels, all_preds)
    print(f"Epoch [{epoch+1}/{num_epochs}] - Loss: {running_loss:.4f} - Train Accuracy: {acc:.4f}")

Epoch [1/50] - Loss: 8.0196 - Train Accuracy: 0.0947
Epoch [2/50] - Loss: 6.6480 - Train Accuracy: 0.3895
Epoch [3/50] - Loss: 6.2731 - Train Accuracy: 0.3895
Epoch [4/50] - Loss: 6.1012 - Train Accuracy: 0.3895
Epoch [5/50] - Loss: 5.7023 - Train Accuracy: 0.3789
Epoch [6/50] - Loss: 5.3910 - Train Accuracy: 0.4421
Epoch [7/50] - Loss: 5.1585 - Train Accuracy: 0.4947
Epoch [8/50] - Loss: 4.6779 - Train Accuracy: 0.5158
Epoch [9/50] - Loss: 4.4031 - Train Accuracy: 0.4737
Epoch [10/50] - Loss: 4.1637 - Train Accuracy: 0.4947
Epoch [11/50] - Loss: 4.1705 - Train Accuracy: 0.5053
Epoch [12/50] - Loss: 3.8192 - Train Accuracy: 0.5789
Epoch [13/50] - Loss: 3.6142 - Train Accuracy: 0.6000
Epoch [14/50] - Loss: 3.3693 - Train Accuracy: 0.7053
Epoch [15/50] - Loss: 3.3207 - Train Accuracy: 0.7158
Epoch [16/50] - Loss: 2.9542 - Train Accuracy: 0.7368
Epoch [17/50] - Loss: 2.8934 - Train Accuracy: 0.7263
Epoch [18/50] - Loss: 2.8260 - Train Accuracy: 0.8211
Epoch [19/50] - Loss: 2.5318 - Train Accuracy: 0.7789
Epoch [20/50] - Loss: 2.5559 - Train Accuracy: 0.8211
Epoch [21/50] - Loss: 2.2529 - Train Accuracy: 0.9158
Epoch [22/50] - Loss: 2.2186 - Train Accuracy: 0.9263
Epoch [23/50] - Loss: 2.0144 - Train Accuracy: 0.8842
Epoch [24/50] - Loss: 2.1004 - Train Accuracy: 0.8316
Epoch [25/50] - Loss: 2.0657 - Train Accuracy: 0.8526
Epoch [26/50] - Loss: 1.8693 - Train Accuracy: 0.8842
Epoch [27/50] - Loss: 1.7492 - Train Accuracy: 0.9158
Epoch [28/50] - Loss: 1.7816 - Train Accuracy: 0.9263
Epoch [29/50] - Loss: 1.7237 - Train Accuracy: 0.8947
Epoch [30/50] - Loss: 1.6201 - Train Accuracy: 0.9368
Epoch [31/50] - Loss: 1.6029 - Train Accuracy: 0.9053
Epoch [32/50] - Loss: 1.4315 - Train Accuracy: 0.9579
Epoch [33/50] - Loss: 1.4318 - Train Accuracy: 0.9474
Epoch [34/50] - Loss: 1.4348 - Train Accuracy: 0.9474
Epoch [35/50] - Loss: 1.1751 - Train Accuracy: 0.9474
Epoch [36/50] - Loss: 1.2470 - Train Accuracy: 0.9684
Epoch [37/50] - Loss: 1.1359 - Train Accuracy: 0.9789
Epoch [38/50] - Loss: 1.2200 - Train Accuracy: 0.9895
Epoch [39/50] - Loss: 1.0606 - Train Accuracy: 0.9895
Epoch [40/50] - Loss: 1.0176 - Train Accuracy: 0.9789
Epoch [41/50] - Loss: 1.2289 - Train Accuracy: 0.9579
Epoch [42/50] - Loss: 0.9895 - Train Accuracy: 0.9684
Epoch [43/50] - Loss: 1.0499 - Train Accuracy: 0.9684
Epoch [44/50] - Loss: 0.9508 - Train Accuracy: 1.0000
Epoch [45/50] - Loss: 0.8993 - Train Accuracy: 0.9895
Epoch [46/50] - Loss: 0.8489 - Train Accuracy: 0.9895
Epoch [47/50] - Loss: 0.9469 - Train Accuracy: 0.9789
Epoch [48/50] - Loss: 0.8619 - Train Accuracy: 0.9789
Epoch [49/50] - Loss: 0.7910 - Train Accuracy: 0.9684
Epoch [50/50] - Loss: 0.9531 - Train Accuracy: 0.9789

# Evaluate on validation set
model.eval()
all_preds, all_labels = [], []
with torch.no_grad():
    for images, labels in val_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        all_preds.extend(torch.argmax(outputs, dim=1).cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

val_acc = accuracy_score(all_labels, all_preds)
print(f"✅ PyTorch Validation Accuracy: {val_acc:.4f}")

✅ PyTorch Validation Accuracy: 0.7778

# Evaluate on test set
model.eval()
all_preds, all_labels = [], []

with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        preds = torch.argmax(outputs, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

test_acc = accuracy_score(all_labels, all_preds)
print(f"✅ PyTorch Test Accuracy:       {test_acc:.4f}")

✅ PyTorch Test Accuracy:       0.6111

#PyTorch

# Build class name list from label encoder
class_names = encoder.classes_.astype(str).tolist()
all_class_indices = list(range(len(class_names)))  # Ensure all expected class indices are included

# Print classification report
def print_classification_report(y_true, y_pred, class_names, labels, dataset_name="Validation"):
    print(f"\n📋 Classification Report for {dataset_name} Set:\n")
    report = classification_report(y_true, y_pred, labels=labels, target_names=class_names)
    print(report)

# Confusion matrix plot
def plot_confusion_matrix(y_true, y_pred, class_names, labels, dataset_name="Validation"):
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=class_names,
                yticklabels=class_names)
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.title(f"Confusion Matrix - {dataset_name} Set")
    plt.show()

# Model evaluation
def evaluate_model(model, data_loader, device):
    model.eval()
    model.to(device)

    all_preds = []
    all_labels = []

    with torch.no_grad():
        for images, labels in data_loader:
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            preds = torch.argmax(outputs, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    return all_labels, all_preds

# Evaluation on Validation Set
val_true, val_pred = evaluate_model(model, val_loader, device='cuda' if torch.cuda.is_available() else 'cpu')
print_classification_report(val_true, val_pred, class_names, all_class_indices, dataset_name="Validation")
plot_confusion_matrix(val_true, val_pred, class_names, all_class_indices, dataset_name="Validation")

# Evaluation on Test Set
test_true, test_pred = evaluate_model(model, test_loader, device='cuda' if torch.cuda.is_available() else 'cpu')
print_classification_report(test_true, test_pred, class_names, all_class_indices, dataset_name="Test")
plot_confusion_matrix(test_true, test_pred, class_names, all_class_indices, dataset_name="Test")

📋 Classification Report for Validation Set:

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      1.00      1.00         1
           2       0.00      0.00      0.00         0
           3       0.00      0.00      0.00         1
           4       0.00      0.00      0.00         0
           5       0.60      1.00      0.75         3
           6       0.00      0.00      0.00         1
           7       0.00      0.00      0.00         0
           8       1.00      1.00      1.00         1
           9       0.80      1.00      0.89         8
          10       0.00      0.00      0.00         1
          11       0.00      0.00      0.00         0
          12       0.00      0.00      0.00         0
          13       1.00      0.50      0.67         2

    accuracy                           0.78        18
   macro avg       0.31      0.32      0.31        18
weighted avg       0.68      0.78      0.71        18

/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

📋 Classification Report for Test Set:

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         0
           3       0.00      0.00      0.00         2
           4       0.00      0.00      0.00         0
           5       0.33      0.50      0.40         2
           6       0.00      0.00      0.00         1
           7       0.00      0.00      0.00         0
           8       1.00      1.00      1.00         1
           9       0.90      1.00      0.95         9
          10       0.00      0.00      0.00         1
          11       0.00      0.00      0.00         0
          12       0.00      0.00      0.00         0
          13       0.00      0.00      0.00         1

    accuracy                           0.61        18
   macro avg       0.16      0.18      0.17        18
weighted avg       0.54      0.61      0.57        18

/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

tf.random.set_seed(42)
tf.config.experimental.enable_op_determinism()

# Define number of output classes
num_classes = df_media_files['category_id'].nunique()

# Load pretrained base model
base_model = ResNet50(
    weights='imagenet',
    include_top=False,
    input_shape=(224, 224, 3)
)

# Optionally freeze the base model for transfer learning
base_model.trainable = False

# Build full model
model_tf = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(256, activation='relu'),
    layers.Dense(num_classes, activation='softmax')
])

# Compile the model
model_tf.compile(
    optimizer=optimizers.Adam(),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Optional: show model summary
model_tf.summary()

Model: "sequential_1"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ resnet50 (Functional)           │ (None, 7, 7, 2048)     │    23,587,712 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ global_average_pooling2d_1      │ (None, 2048)           │             0 │
│ (GlobalAveragePooling2D)        │                        │               │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 256)            │       524,544 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 14)             │         3,598 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 24,115,854 (91.99 MB)

 Trainable params: 528,142 (2.01 MB)

 Non-trainable params: 23,587,712 (89.98 MB)

# TENSORFLOW

def load_and_preprocess_image(path, label):
    image = tf.io.read_file(path)
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, [224, 224]) / 255.0
    return image, label

# Step 1: Extract image paths and labels
train_paths = train_df['file_path'].apply(lambda x: os.path.join(base_dir, x)).tolist()
val_paths   = val_df['file_path'].apply(lambda x: os.path.join(base_dir, x)).tolist()

train_labels = train_df['category_id'].astype(np.int32).tolist()
val_labels   = val_df['category_id'].astype(np.int32).tolist()

# Step 2: Create TensorFlow datasets
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_paths, train_labels))
val_tf_dataset   = tf.data.Dataset.from_tensor_slices((val_paths, val_labels))

# Step 3: Apply image loading + preprocessing
train_tf_dataset = (
    train_tf_dataset
      .map(load_and_preprocess_image)
      .shuffle(buffer_size=100, seed=42)   # ← now deterministic
      .batch(32)
      .prefetch(tf.data.AUTOTUNE)
)

val_tf_dataset   = val_tf_dataset.map(load_and_preprocess_image).batch(32).prefetch(tf.data.AUTOTUNE)

# Step 4: Extract test image paths and labels
test_paths  = test_df['file_path'].apply(lambda x: os.path.join(base_dir, x)).tolist()
test_labels = test_df['category_id'].astype(np.int32).tolist()

# Step 5: Create TensorFlow test dataset
test_tf_dataset = (
    tf.data.Dataset.from_tensor_slices((test_paths, test_labels))
      .map(load_and_preprocess_image)
      .batch(32)
      .prefetch(tf.data.AUTOTUNE)
)

history = model_tf.fit(train_tf_dataset, validation_data=val_tf_dataset, epochs=50)

Epoch 1/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 39s 9s/step - accuracy: 0.2007 - loss: 2.5783 - val_accuracy: 0.4444 - val_loss: 2.0330
Epoch 2/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 28s 8s/step - accuracy: 0.4018 - loss: 2.2425 - val_accuracy: 0.5556 - val_loss: 1.9432
Epoch 3/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.4081 - loss: 2.1572 - val_accuracy: 0.5556 - val_loss: 1.9697
Epoch 4/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.4057 - loss: 2.1570 - val_accuracy: 0.4444 - val_loss: 1.9391
Epoch 5/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 42s 8s/step - accuracy: 0.4135 - loss: 2.0607 - val_accuracy: 0.4444 - val_loss: 1.9496
Epoch 6/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.3940 - loss: 2.0486 - val_accuracy: 0.4444 - val_loss: 1.9093
Epoch 7/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 32s 10s/step - accuracy: 0.4176 - loss: 1.9853 - val_accuracy: 0.5000 - val_loss: 1.9131
Epoch 8/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 28s 8s/step - accuracy: 0.4384 - loss: 1.9552 - val_accuracy: 0.4444 - val_loss: 1.8762
Epoch 9/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 40s 8s/step - accuracy: 0.4189 - loss: 1.9365 - val_accuracy: 0.4444 - val_loss: 1.8707
Epoch 10/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 32s 10s/step - accuracy: 0.4332 - loss: 1.9124 - val_accuracy: 0.4444 - val_loss: 1.8711
Epoch 11/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 41s 10s/step - accuracy: 0.4865 - loss: 1.7424 - val_accuracy: 0.4444 - val_loss: 1.8878
Epoch 12/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 41s 10s/step - accuracy: 0.4463 - loss: 1.8657 - val_accuracy: 0.4444 - val_loss: 1.9109
Epoch 13/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 29s 8s/step - accuracy: 0.3863 - loss: 1.8915 - val_accuracy: 0.5000 - val_loss: 1.9113
Epoch 14/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.4529 - loss: 1.8418 - val_accuracy: 0.4444 - val_loss: 1.9033
Epoch 15/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 28s 8s/step - accuracy: 0.4072 - loss: 1.8996 - val_accuracy: 0.4444 - val_loss: 1.9075
Epoch 16/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.4593 - loss: 1.7750 - val_accuracy: 0.4444 - val_loss: 1.8956
Epoch 17/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 32s 10s/step - accuracy: 0.4593 - loss: 1.7489 - val_accuracy: 0.5000 - val_loss: 1.9028
Epoch 18/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 28s 8s/step - accuracy: 0.4281 - loss: 1.7720 - val_accuracy: 0.5000 - val_loss: 1.8978
Epoch 19/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 28s 8s/step - accuracy: 0.4671 - loss: 1.6903 - val_accuracy: 0.4444 - val_loss: 1.9255
Epoch 20/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 32s 10s/step - accuracy: 0.4828 - loss: 1.6496 - val_accuracy: 0.4444 - val_loss: 1.9367
Epoch 21/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 41s 10s/step - accuracy: 0.4919 - loss: 1.6611 - val_accuracy: 0.5000 - val_loss: 1.9517
Epoch 22/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.4998 - loss: 1.6509 - val_accuracy: 0.5000 - val_loss: 1.9520
Epoch 23/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 32s 10s/step - accuracy: 0.4437 - loss: 1.7014 - val_accuracy: 0.4444 - val_loss: 1.9643
Epoch 24/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 36s 8s/step - accuracy: 0.4541 - loss: 1.6283 - val_accuracy: 0.4444 - val_loss: 1.9556
Epoch 25/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.3968 - loss: 1.7200 - val_accuracy: 0.5000 - val_loss: 1.9483
Epoch 26/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 26s 8s/step - accuracy: 0.4425 - loss: 1.6760 - val_accuracy: 0.5000 - val_loss: 1.9512
Epoch 27/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 48s 11s/step - accuracy: 0.4816 - loss: 1.5742 - val_accuracy: 0.4444 - val_loss: 1.9897
Epoch 28/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 28s 8s/step - accuracy: 0.4152 - loss: 1.7099 - val_accuracy: 0.4444 - val_loss: 2.0171
Epoch 29/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 32s 10s/step - accuracy: 0.4425 - loss: 1.5996 - val_accuracy: 0.4444 - val_loss: 2.0194
Epoch 30/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 29s 8s/step - accuracy: 0.3669 - loss: 1.7325 - val_accuracy: 0.4444 - val_loss: 2.0172
Epoch 31/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 41s 8s/step - accuracy: 0.4384 - loss: 1.5623 - val_accuracy: 0.4444 - val_loss: 2.0091
Epoch 32/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 28s 8s/step - accuracy: 0.4464 - loss: 1.5977 - val_accuracy: 0.4444 - val_loss: 1.9982
Epoch 33/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 40s 8s/step - accuracy: 0.4817 - loss: 1.5617 - val_accuracy: 0.4444 - val_loss: 2.0030
Epoch 34/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 32s 10s/step - accuracy: 0.4999 - loss: 1.4963 - val_accuracy: 0.4444 - val_loss: 2.0259
Epoch 35/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 36s 8s/step - accuracy: 0.4685 - loss: 1.5027 - val_accuracy: 0.4444 - val_loss: 2.0612
Epoch 36/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 46s 10s/step - accuracy: 0.4687 - loss: 1.4987 - val_accuracy: 0.3889 - val_loss: 2.0647
Epoch 37/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 36s 8s/step - accuracy: 0.5052 - loss: 1.4540 - val_accuracy: 0.4444 - val_loss: 2.0487
Epoch 38/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 46s 11s/step - accuracy: 0.4609 - loss: 1.4732 - val_accuracy: 0.4444 - val_loss: 2.0677
Epoch 39/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.5142 - loss: 1.4065 - val_accuracy: 0.4444 - val_loss: 2.0625
Epoch 40/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 41s 8s/step - accuracy: 0.5300 - loss: 1.3800 - val_accuracy: 0.4444 - val_loss: 2.0516
Epoch 41/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 32s 10s/step - accuracy: 0.5183 - loss: 1.4256 - val_accuracy: 0.3889 - val_loss: 2.0613
Epoch 42/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 28s 8s/step - accuracy: 0.5183 - loss: 1.4302 - val_accuracy: 0.4444 - val_loss: 2.0830
Epoch 43/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.4661 - loss: 1.4003 - val_accuracy: 0.4444 - val_loss: 2.1111
Epoch 44/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.5442 - loss: 1.3906 - val_accuracy: 0.4444 - val_loss: 2.1136
Epoch 45/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.5040 - loss: 1.4182 - val_accuracy: 0.3889 - val_loss: 2.1069
Epoch 46/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 46s 11s/step - accuracy: 0.5536 - loss: 1.3613 - val_accuracy: 0.4444 - val_loss: 2.1051
Epoch 47/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 41s 10s/step - accuracy: 0.5210 - loss: 1.3323 - val_accuracy: 0.4444 - val_loss: 2.1104
Epoch 48/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.5235 - loss: 1.3175 - val_accuracy: 0.4444 - val_loss: 2.1319
Epoch 49/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 32s 10s/step - accuracy: 0.5483 - loss: 1.2732 - val_accuracy: 0.4444 - val_loss: 2.1283
Epoch 50/50
3/3 ━━━━━━━━━━━━━━━━━━━━ 27s 8s/step - accuracy: 0.5419 - loss: 1.3109 - val_accuracy: 0.4444 - val_loss: 2.1354

# Evaluate on Train, Validation and Test Sets

# Evaluate on the Training Set
train_loss, train_acc = model_tf.evaluate(train_tf_dataset, verbose=0)
print(f"✅ TensorFlow Training Accuracy:   {train_acc:.4f}")

# Evaluate on the Validation Set
val_loss, val_acc = model_tf.evaluate(val_tf_dataset, verbose=0)
print(f"✅ TensorFlow Validation Accuracy: {val_acc:.4f}")

# Evaluate on the Test Set
test_loss, test_acc = model_tf.evaluate(test_tf_dataset, verbose=0)
print(f"✅ TensorFlow Test Accuracy:       {test_acc:.4f}")

✅ TensorFlow Training Accuracy:   0.5368
✅ TensorFlow Validation Accuracy: 0.4444
✅ TensorFlow Test Accuracy:       0.5556

# 1) Extract the true labels from each dataset
train_true = np.concatenate([y.numpy() for x, y in train_tf_dataset], axis=0)
val_true   = np.concatenate([y.numpy() for x, y in val_tf_dataset],   axis=0)
test_true  = np.concatenate([y.numpy() for x, y in test_tf_dataset],  axis=0)

# 2) Get the model’s predictions for each split
train_pred = np.argmax(model_tf.predict(train_tf_dataset, verbose=0), axis=1)
val_pred   = np.argmax(model_tf.predict(val_tf_dataset,   verbose=0), axis=1)
test_pred  = np.argmax(model_tf.predict(test_tf_dataset,  verbose=0), axis=1)

# Make sure you have a list of human-readable class names:
# class_names = encoder.classes_.tolist()

# 3) Plot side-by-side confusion matrices
fig, axes = plt.subplots(1, 3, figsize=(18, 5), constrained_layout=True)
for ax, (name, y_true, y_pred) in zip(axes, [
    ("Train",      train_true, train_pred),
    ("Validation", val_true,   val_pred),
    ("Test",       test_true,  test_pred),
]):
    cm = confusion_matrix(y_true, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=class_names, yticklabels=class_names,
                ax=ax)
    ax.set_title(f"Confusion Matrix – {name}")
    ax.set_xlabel("Predicted")
    ax.set_ylabel("Actual")

plt.show()

# Model names
model_names = ['PyTorch ResNet18', 'TensorFlow ResNet50']

# Replace these with your real accuracy values
train_accuracies = [0.9789, 0.5368]   # e.g. PyTorch train acc, TensorFlow train acc
val_accuracies   = [0.7778, 0.4444]   # your validation accuracies
test_accuracies  = [0.6111, 0.5556]   # e.g. PyTorch test acc, TensorFlow test acc

# X-axis positions and bar width
x = np.arange(len(model_names))
width = 0.25

fig, ax = plt.subplots(figsize=(10, 6))
bar1 = ax.bar(x - width, train_accuracies, width, label='Train')
bar2 = ax.bar(x,         val_accuracies,   width, label='Validation')
bar3 = ax.bar(x + width, test_accuracies,  width, label='Test')

# Labels, title, and legend
ax.set_xticks(x)
ax.set_xticklabels(model_names)
ax.set_ylim(0, 1)
ax.set_ylabel('Accuracy')
ax.set_title('Model Comparison: PyTorch vs TensorFlow')
ax.legend()

# Annotate each bar with its height
for bar_group in (bar1, bar2, bar3):
    for bar in bar_group:
        height = bar.get_height()
        ax.annotate(f'{height:.4f}',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords='offset points',
                    ha='center')

ax.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

# Get Predictions for Train and Validation Sets

def get_preds_and_labels_tf(dataset):
    all_preds, all_labels = [], []
    for images, labels in dataset:
        preds = model_tf.predict(images)
        predicted_classes = np.argmax(preds, axis=1)
        all_preds.extend(predicted_classes)
        all_labels.extend(labels.numpy())
    return np.array(all_labels), np.array(all_preds)

# Get predictions
train_true, train_preds_class = get_preds_and_labels_tf(train_tf_dataset)
val_true, val_preds_class     = get_preds_and_labels_tf(val_tf_dataset)

WARNING:tensorflow:5 out of the last 11 calls to <function TensorFlowTrainer.make_predict_function.<locals>.one_step_on_data_distributed at 0x7bed36e7eac0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.

1/1 ━━━━━━━━━━━━━━━━━━━━ 8s 8s/step
1/1 ━━━━━━━━━━━━━━━━━━━━ 6s 6s/step

WARNING:tensorflow:5 out of the last 11 calls to <function TensorFlowTrainer.make_predict_function.<locals>.one_step_on_data_distributed at 0x7bed36e7eac0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.

1/1 ━━━━━━━━━━━━━━━━━━━━ 7s 7s/step
1/1 ━━━━━━━━━━━━━━━━━━━━ 3s 3s/step

# Classification Reports


# 1) Identify which labels actually appear in each split
train_labels = np.unique(train_true)
val_labels   = np.unique(val_true)

# 2) Map those back to human-readable names
train_target_names = [class_names[i] for i in train_labels]
val_target_names   = [class_names[i] for i in val_labels]

# 3) Print your TensorFlow classification reports
print("📋 Train Report (TensorFlow):")
print(classification_report(
    train_true,
    train_preds_class,
    labels=train_labels,
    target_names=train_target_names,
    zero_division=0
))

print("📋 Validation Report (TensorFlow):")
print(classification_report(
    val_true,
    val_preds_class,
    labels=val_labels,
    target_names=val_target_names,
    zero_division=0
))

📋 Train Report (TensorFlow):
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       1.00      0.25      0.40         4
           2       1.00      1.00      1.00         1
           3       0.00      0.00      0.00         6
           4       0.00      0.00      0.00         4
           5       0.89      0.67      0.76        12
           6       0.67      0.50      0.57         4
           7       0.67      0.50      0.57         4
           8       0.00      0.00      0.00         6
           9       0.47      0.97      0.63        37
          10       0.00      0.00      0.00         3
          11       1.00      0.50      0.67         2
          12       0.00      0.00      0.00         4
          13       0.00      0.00      0.00         7

    accuracy                           0.54        95
   macro avg       0.41      0.31      0.33        95
weighted avg       0.42      0.54      0.43        95

📋 Validation Report (TensorFlow):
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         1
           3       0.00      0.00      0.00         1
           5       0.00      0.00      0.00         3
           6       0.00      0.00      0.00         1
           8       0.00      0.00      0.00         1
           9       0.50      1.00      0.67         8
          10       0.00      0.00      0.00         1
          13       0.00      0.00      0.00         2

   micro avg       0.47      0.44      0.46        18
   macro avg       0.06      0.12      0.08        18
weighted avg       0.22      0.44      0.30        18

# Confusion Matrices


# 4) Plot side‐by‐side confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(20, 8), constrained_layout=True)

# — Train Confusion Matrix —
cm_train = confusion_matrix(train_true, train_preds_class, labels=train_labels)
sns.heatmap(
    cm_train, annot=True, fmt="d", cmap="YlGnBu",
    xticklabels=train_target_names,
    yticklabels=train_target_names,
    ax=axes[0]
)
axes[0].set_title("Confusion Matrix – Train (TensorFlow)")
axes[0].set_xlabel("Predicted")
axes[0].set_ylabel("True")

# — Validation Confusion Matrix —
cm_val = confusion_matrix(val_true, val_preds_class, labels=val_labels)
sns.heatmap(
    cm_val, annot=True, fmt="d", cmap="YlGnBu",
    xticklabels=val_target_names,
    yticklabels=val_target_names,
    ax=axes[1]
)
axes[1].set_title("Confusion Matrix – Validation (TensorFlow)")
axes[1].set_xlabel("Predicted")
axes[1].set_ylabel("True")

plt.show()

print("Train set:", train_true.shape, train_preds_class.shape)
print("Val set:", val_true.shape, val_preds_class.shape)

Train set: (95,) (95,)
Val set: (18,) (18,)

# 1) Get true & pred labels for the test split ---

test_true, test_preds_class = get_preds_and_labels_tf(test_tf_dataset)

# 2) Identify which classes actually appear ---
unique_labels = np.unique(test_true)

# 3) Map those label‐IDs back to names ---
target_names = [class_names[i] for i in unique_labels]

# 4) Print the Test Report ---
print("📋 Test Report (TensorFlow):\n")
print(classification_report(
    test_true,
    test_preds_class,
    labels=unique_labels,
    target_names=target_names,
    zero_division=0
))

1/1 ━━━━━━━━━━━━━━━━━━━━ 3s 3s/step
📋 Test Report (TensorFlow):

              precision    recall  f1-score   support

           1       0.00      0.00      0.00         1
           3       0.00      0.00      0.00         2
           5       0.00      0.00      0.00         2
           6       0.00      0.00      0.00         1
           8       0.00      0.00      0.00         1
           9       0.56      1.00      0.72         9
          10       0.00      0.00      0.00         1
          13       1.00      1.00      1.00         1

    accuracy                           0.56        18
   macro avg       0.20      0.25      0.21        18
weighted avg       0.34      0.56      0.42        18

# 1. Reconnect via the helper function
conn = connect_to_postgres()

# 2. Only fetch the category mapping from the database
query = "SELECT category_id, category_name FROM media.product_category;"
df_media_product_category = pd.read_sql(query, conn)

# 3. Filter only image files (used by the model)
df_images = df_media_files[df_media_files['file_type'] == 'image'].copy()

# 4. Merge image entries with category names
df_labeled = df_images.merge(
    df_media_product_category[['category_id', 'category_name']],
    on='category_id',
    how='left'
)

# 5. Preview unique category names actually used in training
df_class_labels = df_labeled[['category_id', 'category_name']].drop_duplicates().sort_values('category_id')

# Optional: display the result
df_class_labels

Enter PostgreSQL username: postgres
Enter PostgreSQL password: ··········
✅ Connected as 'postgres'.

/tmp/ipython-input-66-1999008503.py:6: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_media_product_category = pd.read_sql(query, conn)

conn = connect_to_postgres()
# ————————————————————————————————————————
# 1. Build original ID→name lookup from SQL
query = "SELECT category_id, category_name FROM media.product_category;"
df_media_product_category = pd.read_sql(query, conn)
orig2name = {
    int(row['category_id']): row['category_name']
    for _, row in df_media_product_category.iterrows()
}

# 2. Build encoded-index→name lookup via your LabelEncoder
#    encoder.classes_ is an array of the original IDs in sorted order
index2name = [
    orig2name.get(int(orig_id), f"ID_{int(orig_id)}")
    for orig_id in encoder.classes_
]
# e.g. index2name[0] is the name for encoded label "0", or "ID_0" if missing

# ————————————————————————————————————————
def plot_cm(y_true, y_pred, index2name, title):
    labels_present = np.unique(y_true)
    names_present  = [index2name[i] for i in labels_present]
    cm = confusion_matrix(y_true, y_pred, labels=labels_present)
    plt.figure(figsize=(8,6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=names_present, yticklabels=names_present)
    plt.title(title)
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.tight_layout()
    plt.show()
    return labels_present, names_present

def print_cr(y_true, y_pred, index2name, dataset_name):
    labels_present, names_present = plot_cm(
        y_true, y_pred, index2name,
        title=f"{dataset_name} Confusion Matrix"
    )
    print(f"\n📋 Classification Report for {dataset_name} Set:\n")
    print(classification_report(
        y_true, y_pred,
        labels=labels_present,
        target_names=names_present,
        zero_division=0
    ))


def evaluate_torch(model, loader, device):
    """
    Run model in eval-mode over a DataLoader and return
    (all_true_labels, all_predicted_labels) as NumPy arrays.
    """
    model.eval()
    model.to(device)
    all_preds, all_labels = [], []

    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            preds   = torch.argmax(outputs, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    return np.array(all_labels), np.array(all_preds)

# ————————————————————————————————————————
# Now run for train/val/test:
train_true, train_pred = evaluate_torch(model, train_loader, device)
val_true,   val_pred   = evaluate_torch(model, val_loader,   device)
test_true,  test_pred  = evaluate_torch(model, test_loader,  device)

print_cr(train_true, train_pred, index2name, "Training")
print_cr(  val_true,   val_pred, index2name, "Validation")
print_cr( test_true,  test_pred, index2name, "Test")

Enter PostgreSQL username: postgres
Enter PostgreSQL password: ··········
✅ Connected as 'postgres'.

/tmp/ipython-input-73-2310655759.py:5: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_media_product_category = pd.read_sql(query, conn)

📋 Classification Report for Training Set:

                precision    recall  f1-score   support

          ID_0       1.00      1.00      1.00         1
       agendas       1.00      1.00      1.00         4
       bundles       1.00      1.00      1.00         1
business cards       1.00      1.00      1.00         6
    dft_prints       1.00      1.00      1.00         4
    dtf_prints       1.00      1.00      1.00        12
          hats       1.00      1.00      1.00         4
         laser       1.00      1.00      1.00         4
          misc       1.00      1.00      1.00         6
          mugs       1.00      1.00      1.00        37
         resin       1.00      1.00      1.00         3
        shirts       1.00      1.00      1.00         2
      stickers       1.00      1.00      1.00         4
   sublimation       1.00      1.00      1.00         7

      accuracy                           1.00        95
     macro avg       1.00      1.00      1.00        95
  weighted avg       1.00      1.00      1.00        95

📋 Classification Report for Validation Set:

                precision    recall  f1-score   support

       agendas       1.00      1.00      1.00         1
business cards       0.00      0.00      0.00         1
    dtf_prints       0.60      1.00      0.75         3
          hats       0.00      0.00      0.00         1
          misc       1.00      1.00      1.00         1
          mugs       0.80      1.00      0.89         8
         resin       0.00      0.00      0.00         1
   sublimation       1.00      0.50      0.67         2

      accuracy                           0.78        18
     macro avg       0.55      0.56      0.54        18
  weighted avg       0.68      0.78      0.71        18

📋 Classification Report for Test Set:

                precision    recall  f1-score   support

       agendas       0.00      0.00      0.00         1
business cards       0.00      0.00      0.00         2
    dtf_prints       0.33      0.50      0.40         2
          hats       0.00      0.00      0.00         1
          misc       1.00      1.00      1.00         1
          mugs       0.90      1.00      0.95         9
         resin       0.00      0.00      0.00         1
   sublimation       0.00      0.00      0.00         1

      accuracy                           0.61        18
     macro avg       0.28      0.31      0.29        18
  weighted avg       0.54      0.61      0.57        18

# 0) Reconnect and fetch mapping
conn = connect_to_postgres()
query = "SELECT category_id, category_name FROM media.product_category;"
df_media_product_category = pd.read_sql(query, conn)
conn.close()

# 1) Build original ID→name dict, with int keys
orig2name = {
    int(row['category_id']): row['category_name']
    for _, row in df_media_product_category.iterrows()
}

# 2) Build index2name list via the encoder, with fallback for missing IDs
index2name = [
    orig2name.get(int(orig_id), f"ID_{int(orig_id)}")
    for orig_id in encoder.classes_
]
# Now index2name[0] == orig2name[0] if it exists, else "ID_0"


# ————————————————————————————————————————
def evaluate_model(model, loader, device):
    model.eval()
    model.to(device)
    all_preds, all_labels = [], []
    with torch.no_grad():
        for imgs, labels in loader:
            imgs = imgs.to(device); labels = labels.to(device)
            outs = model(imgs)
            preds = torch.argmax(outs, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    return np.array(all_labels), np.array(all_preds)

def plot_cm(y_true, y_pred, names, title):
    labels = np.unique(y_true)
    names_present = [names[i] for i in labels]
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    plt.figure(figsize=(8,6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=names_present,
                yticklabels=names_present)
    plt.title(f"{title} Confusion Matrix")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.tight_layout()
    plt.show()

def print_report(y_true, y_pred, names, title):
    labels = np.unique(y_true)
    names_present = [names[i] for i in labels]
    print(f"\n📋 Classification Report — {title} Set:\n")
    print(classification_report(
        y_true, y_pred,
        labels=labels,
        target_names=names_present,
        zero_division=0
    ))

# ————————————————————————————————————————
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Evaluate each split
train_true, train_pred = evaluate_model(model, train_loader, device)
val_true,   val_pred   = evaluate_model(model, val_loader,   device)
test_true,  test_pred  = evaluate_model(model, test_loader,  device)

# 1. Overall accuracies
print(f"✅ PyTorch Training Accuracy:   {accuracy_score(train_true, train_pred):.4f}")
print(f"✅ PyTorch Validation Accuracy: {accuracy_score(val_true,   val_pred):.4f}")
print(f"✅ PyTorch Test Accuracy:       {accuracy_score(test_true,  test_pred):.4f}")

# 2. Detailed reports + confusion matrices
print_report(train_true, train_pred, index2name, "Training")
plot_cm   (train_true, train_pred, index2name, "Training")

print_report(val_true, val_pred, index2name, "Validation")
plot_cm   (val_true, val_pred, index2name, "Validation")

print_report(test_true, test_pred, index2name, "Test")
plot_cm   (test_true, test_pred, index2name, "Test")

Enter PostgreSQL username: postgres
Enter PostgreSQL password: ··········
✅ Connected as 'postgres'.

/tmp/ipython-input-75-297475426.py:4: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_media_product_category = pd.read_sql(query, conn)

✅ PyTorch Training Accuracy:   0.9895
✅ PyTorch Validation Accuracy: 0.7778
✅ PyTorch Test Accuracy:       0.6111

📋 Classification Report — Training Set:

                precision    recall  f1-score   support

          ID_0       1.00      1.00      1.00         1
       agendas       1.00      1.00      1.00         4
       bundles       1.00      1.00      1.00         1
business cards       1.00      1.00      1.00         6
    dft_prints       1.00      1.00      1.00         4
    dtf_prints       1.00      1.00      1.00        12
          hats       1.00      1.00      1.00         4
         laser       1.00      1.00      1.00         4
          misc       1.00      1.00      1.00         6
          mugs       0.97      1.00      0.99        37
         resin       1.00      1.00      1.00         3
        shirts       1.00      1.00      1.00         2
      stickers       1.00      1.00      1.00         4
   sublimation       1.00      0.86      0.92         7

      accuracy                           0.99        95
     macro avg       1.00      0.99      0.99        95
  weighted avg       0.99      0.99      0.99        95

📋 Classification Report — Validation Set:

                precision    recall  f1-score   support

       agendas       1.00      1.00      1.00         1
business cards       0.00      0.00      0.00         1
    dtf_prints       0.60      1.00      0.75         3
          hats       0.00      0.00      0.00         1
          misc       1.00      1.00      1.00         1
          mugs       0.80      1.00      0.89         8
         resin       0.00      0.00      0.00         1
   sublimation       1.00      0.50      0.67         2

      accuracy                           0.78        18
     macro avg       0.55      0.56      0.54        18
  weighted avg       0.68      0.78      0.71        18

📋 Classification Report — Test Set:

                precision    recall  f1-score   support

       agendas       0.00      0.00      0.00         1
business cards       0.00      0.00      0.00         2
    dtf_prints       0.33      0.50      0.40         2
          hats       0.00      0.00      0.00         1
          misc       1.00      1.00      1.00         1
          mugs       0.90      1.00      0.95         9
         resin       0.00      0.00      0.00         1
   sublimation       0.00      0.00      0.00         1

      accuracy                           0.61        18
     macro avg       0.28      0.31      0.29        18
  weighted avg       0.54      0.61      0.57        18

#CV and Error for Baseline Model
import os
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from sklearn.preprocessing import LabelEncoder
import psycopg2

# Reproducibility function
def seed_everything(seed=42):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)

# Image loading
def load_images_and_labels(df, base_dir, target_size=(224, 224)):
    images, labels = [], []
    for _, row in df.iterrows():
        img_path = os.path.join(base_dir, row['file_path'])
        try:
            img = load_img(img_path, target_size=target_size)
            img_array = img_to_array(img) / 255.0
            images.append(img_array)
            labels.append(row['category_id'])
        except:
            continue
    return np.array(images), np.array(labels)

# Connect to PostgreSQL
# FIXED ORDER
conn = connect_to_postgres()

# ✅ Use the connection while it's open
df_media_files = pd.read_sql(
    "SELECT file_path, category_id FROM media.media_files WHERE file_type = 'image';",
    conn
)

# ✅ Close only after the query is done
conn.close()

# Filter and load images
df = df_media_files.sample(n=min(150, len(df_media_files)), random_state=42)
base_dir = "/content/drive/MyDrive/DL/artiszen_media"
images, labels = load_images_and_labels(df, base_dir)

# Encode labels
le = LabelEncoder()
labels_encoded = le.fit_transform(labels)
num_classes = len(le.classes_)
labels_categorical = to_categorical(labels_encoded, num_classes)

# Cross-validation
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
fold_results = []

for fold, (train_idx, test_idx) in enumerate(kfold.split(images, labels_encoded)):
    print(f"🔁 Fold {fold + 1}")
    seed_everything(42 + fold)

    x_train, x_test = images[train_idx], images[test_idx]
    y_train, y_test = labels_categorical[train_idx], labels_categorical[test_idx]

    model = models.Sequential([
        layers.Input(shape=(224, 224, 3)),
        layers.Conv2D(32, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Flatten(),
        layers.Dense(64, activation='relu'),
        layers.Dense(num_classes, activation='softmax')
    ])

    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(x_train, y_train, epochs=3, batch_size=16, verbose=0)

    y_pred_probs = model.predict(x_test)
    y_pred = np.argmax(y_pred_probs, axis=1)
    y_true = np.argmax(y_test, axis=1)

    # ✅ ADD THIS:
    report = classification_report(y_true, y_pred, output_dict=True, zero_division=0)
    fold_results.append(report)
# Convert nested classification reports into flat DataFrames
report_dfs = []
for report in fold_results:
    df = pd.DataFrame(report).T
    if not df.empty:
        report_dfs.append(df)
# Display results
if report_dfs:
    df_results = pd.concat(report_dfs).groupby(level=0).mean().reset_index()
    print("✅ TensorFlow Cross-Validation Results:")
    print(df_results)

    # Optional: Save to CSV
    df_results.to_csv("tf_cv_results.csv", index=False)
else:
    print("❌ No valid reports to aggregate. Try lowering n_splits or reviewing data quality.")

#CV And EA with Pytorch Post Mapping
import os, random
import numpy as np
import pandas as pd
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader, Subset
from torchvision import transforms
from PIL import Image
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Set seed for reproducibility
def seed_everything(seed=42):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# 2. Dataset class
class ImageDataset(Dataset):
    def __init__(self, df, image_dir, transform=None):
        self.df = df.reset_index(drop=True)
        self.image_dir = image_dir
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_path = os.path.join(self.image_dir, row['file_path'])
        image = Image.open(img_path).convert("RGB")
        if self.transform:
            image = self.transform(image)
        label = row['encoded_label']
        return image, label

# 3. Simple CNN
class SimpleCNN(nn.Module):
    def __init__(self, num_classes):
        super(SimpleCNN, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=1),
            nn.ReLU(), nn.MaxPool2d(2)
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(16 * 112 * 112, 128),
            nn.ReLU(),
            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        x = self.conv(x)
        x = self.fc(x)
        return x

# 4. Evaluation function
def evaluate_model(model, loader, device):
    model.eval()
    y_true, y_pred = [], []
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            preds = model(images).argmax(1)
            y_true.extend(labels.cpu().numpy())
            y_pred.extend(preds.cpu().numpy())
    return np.array(y_true), np.array(y_pred)

# 5. Confusion matrix and report
def plot_confusion(y_true, y_pred, label_names, title):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8,6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=label_names, yticklabels=label_names)
    plt.title(title)
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.tight_layout()
    plt.show()

# 6. MAIN PIPELINE
seed_everything()
image_dir = "/content/drive/MyDrive/DL/artiszen_media"
os.path.join(image_dir, row['file_path'])


# Encode labels
le = LabelEncoder()
df_labeled = df_labeled.dropna(subset=['category_name', 'file_path'])
df_labeled['encoded_label'] = le.fit_transform(df_labeled['category_name'])
label_names = le.classes_
num_classes = len(label_names)

# Transform
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

# Dataset
full_dataset = ImageDataset(df_labeled, image_dir, transform=transform)

# Cross-validation
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for fold, (train_idx, test_idx) in enumerate(kfold.split(df_labeled, df_labeled['encoded_label'])):
    print(f"\n🔁 Fold {fold+1}")

    train_ds = Subset(full_dataset, train_idx)
    test_ds  = Subset(full_dataset, test_idx)
    train_loader = DataLoader(train_ds, batch_size=16, shuffle=True)
    test_loader  = DataLoader(test_ds,  batch_size=16, shuffle=False)

    model = SimpleCNN(num_classes).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()

    # Train 3 epochs
    model.train()
    for epoch in range(3):
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

    # Evaluation
    y_true, y_pred = evaluate_model(model, test_loader, device)
    acc = accuracy_score(y_true, y_pred)
    print(f"✅ Accuracy: {acc:.4f}")
    present_labels = np.unique(y_true)
    present_names = [label_names[i] for i in present_labels]
    print(classification_report(y_true, y_pred, labels=present_labels, target_names=present_names, zero_division=0))

    plot_confusion(y_true, y_pred, label_names, title=f"Fold {fold+1} Confusion Matrix")

# 1) Fetch a random sample of auto-category labels from Postgres


# Reconnect if needed
conn = connect_to_postgres()

# Pull 20 random media_ids with auto-category labels
query = """
SELECT mf.media_id,
       mf.file_path,
       mf.file_type,
       ml.label      AS auto_label,
       pc.category_name
FROM media.media_files mf
JOIN media.media_labels ml
  ON mf.media_id = ml.media_id
JOIN media.product_category pc
  ON ml.label = pc.category_name
WHERE ml.created_by = 'auto-category'
  AND mf.file_type = 'image'
ORDER BY RANDOM()
LIMIT 20;
"""
df_audit = pd.read_sql(query, conn)
conn.close()

# 2) Display the DataFrame for review
print("=== Random Sample of auto-category Labels ===")
display(df_audit)

# 3) Plot thumbnails in a grid for visual inspection
n = len(df_audit)
cols = 5
rows = (n + cols - 1) // cols
plt.figure(figsize=(cols * 2.5, rows * 2.5))

for idx, row in df_audit.iterrows():
    img_path = os.path.join(base_dir, row["file_path"])
    try:
        img = Image.open(img_path).convert("RGBA")
    except:
        img = Image.new("RGBA", (224,224), (255,0,0,255))  # red placeholder for missing
    ax = plt.subplot(rows, cols, idx + 1)
    ax.imshow(img)
    ax.set_title(f"{row['auto_label']}", fontsize=8)
    ax.axis("off")

plt.suptitle("Thumbnail Audit of auto-category Labels", y=1.02)
plt.tight_layout()
plt.show()

Enter PostgreSQL username: postgres
Enter PostgreSQL password: ··········
✅ Connected as 'postgres'.
=== Random Sample of auto-category Labels ===

/tmp/ipython-input-77-3121270237.py:28: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_audit = pd.read_sql(query, conn)

# 🛠️ Automated Label Mismatch Correction from Audit Sample

import os
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image

# 1️⃣ Reconnect and fetch the audit sample
conn = connect_to_postgres()
query = """
SELECT mf.media_id,
       mf.file_path,
       mf.file_type,
       ml.label      AS auto_label,
       pc.category_name
FROM media.media_files mf
JOIN media.media_labels ml
  ON mf.media_id = ml.media_id
JOIN media.product_category pc
  ON ml.label = pc.category_name
WHERE ml.created_by = 'auto-category'
  AND mf.file_type = 'image'
ORDER BY RANDOM()
LIMIT 20;
"""
df_audit = pd.read_sql(query, conn)

# 2️⃣ Ensure the new_label column exists
with conn.cursor() as cur:
    cur.execute("""
        ALTER TABLE media.media_labels
        ADD COLUMN IF NOT EXISTS new_label TEXT;
    """)
    conn.commit()

# 3️⃣ Apply corrections for every sampled mismatch
corrected = []
with conn.cursor() as cur:
    for _, row in df_audit.iterrows():
        if row["auto_label"] != row["category_name"]:
            cur.execute("""
                UPDATE media.media_labels
                   SET new_label = %s
                 WHERE media_id = %s
                   AND created_by = 'auto-category';
            """, (row["category_name"], row["media_id"]))
            corrected.append(row["media_id"])
    conn.commit()

# 4️⃣ Verify corrections
if corrected:
    verify_q = sql = """
    SELECT media_id, label, new_label, created_by
      FROM media.media_labels
     WHERE media_id = ANY(%s)
       AND created_by = 'auto-category';
    """
    df_verify = pd.read_sql(verify_q, conn, params=(corrected,))
    print(f"✅ Corrected {len(corrected)} rows. Verification:")
    display(df_verify)
else:
    print("ℹ️ No mismatches found in this sample.")

conn.close()

Enter PostgreSQL username: postgres
Enter PostgreSQL password: ··········
✅ Connected as 'postgres'.

/tmp/ipython-input-78-2353606184.py:26: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_audit = pd.read_sql(query, conn)

ℹ️ No mismatches found in this sample.

# I intentionally modified the following
# Label_id =1, Media_id = 29, Label = was: 'Laser' now: 'Resin', created_by: 'manually-edited(testing)
# 🛠️ Automated Label Mismatch Correction from Audit Sample


conn = connect_to_postgres()

# ─── 2) Load all auto-category image labels + their true category ───────────
df_all = pd.read_sql("""
SELECT
  mf.media_id,
  mf.file_path,
  mf.file_type,
  ml.label       AS auto_label,
  pc.category_name  AS true_category
FROM media.media_files mf
JOIN media.media_labels ml
  ON mf.media_id = ml.media_id
JOIN media.product_category pc
  ON mf.category_id = pc.category_id
WHERE ml.created_by = 'auto-category'
  AND mf.file_type  = 'image'
""", conn)

print("ℹ️  Found", len(df_all), "auto-category images.")
display(df_all.head())

# ─── 3) Identify all mismatches ────────────────────────────────────────────────
df_mismatch = df_all[df_all["auto_label"] != df_all["true_category"]].copy()
if df_mismatch.empty:
    print("✅ No mismatches detected. Nothing to update.")
else:
    print("⚠️  Mismatches detected for media_id(s):", df_mismatch["media_id"].tolist())
    display(df_mismatch)

    # ─── 4) Ensure the `new_label` column exists ─────────────────────────────
    with conn.cursor() as cur:
        cur.execute("""
            ALTER TABLE media.media_labels
            ADD COLUMN IF NOT EXISTS new_label TEXT;
        """)
        conn.commit()
        print("✅ Ensured `new_label` column exists.")

    # ─── 5) Batch-update every mismatched row ────────────────────────────────
    corrected_ids = []
    with conn.cursor() as cur:
        for _, row in df_mismatch.iterrows():
            cur.execute("""
                UPDATE media.media_labels
                   SET new_label   = %s
                 WHERE media_id    = %s
                   AND created_by  = 'auto-category'
            """, (row["true_category"], row["media_id"]))
            corrected_ids.append(row["media_id"])
        conn.commit()
    print(f"✅ Applied corrections to {len(corrected_ids)} row(s).")

    # ─── 6) Verification ─────────────────────────────────────────────────────
    verify_q = """
    SELECT media_id, label AS auto_label, new_label, created_by, created_at
      FROM media.media_labels
     WHERE media_id = ANY(%s)
       AND created_by = 'auto-category';
    """
    df_verify = pd.read_sql(verify_q, conn, params=(corrected_ids,))
    print("🔍 Verification of corrected rows:")
    display(df_verify)

conn.close()

ℹ️  Found 136 auto-category images.

/tmp/ipython-input-19-1940780161.py:13: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_all = pd.read_sql("""

✅ No mismatches detected. Nothing to update.

# 1) Prep display & open connection
pd.set_option('display.max_rows', None)      # show all rows
pd.set_option('display.max_colwidth', None)  # show full file paths & labels
conn = connect_to_postgres()
# NOTE: For media_id=29, the original auto-generated label was 'laser' (created_by='auto-category').
#       I manually corrected it to 'resin' and updated created_by to 'manual-edited'.
# SQL to apply that manual correction
  # UPDATE media.media_labels
  #    SET label      = 'resin',
  #        created_by = 'manual-edited'
  #  WHERE media_id  = 29
  #    AND created_by IN ('auto-category','manual-edited');


# 2) Load the full set of image labels (both auto and manual)
df_all = pd.read_sql("""
    SELECT
      mf.media_id,
      mf.file_path,
      mf.file_type,
      ml.label         AS auto_label,
      pc.category_name AS true_category,
      ml.created_by
    FROM media.media_files AS mf
    JOIN media.media_labels AS ml
      ON mf.media_id = ml.media_id
    JOIN media.product_category AS pc
      ON mf.category_id = pc.category_id
    WHERE ml.created_by IN ('auto-category','manual-edited')
      AND mf.file_type   = 'image'
    ORDER BY mf.media_id;
""", conn)

print(f"ℹ️ Found {len(df_all)} image-label rows:\n")
display(df_all)

# 3) Identify mismatches
df_mismatch = df_all[df_all.auto_label != df_all.true_category]
if df_mismatch.empty:
    print("✅ No mismatches detected. Nothing to update.")
else:
    print("⚠️ Mismatches found for media_id(s):", df_mismatch.media_id.tolist(), "\n")
    display(df_mismatch)

    # 4) Ensure new_label column exists
    with conn.cursor() as cur:
        cur.execute("""
            ALTER TABLE media.media_labels
            ADD COLUMN IF NOT EXISTS new_label TEXT;
        """)
    conn.commit()
    print("✅ Ensured `new_label` column exists.")

    # 5) Apply corrections
    corrected = []
    with conn.cursor() as cur:
        for _, row in df_mismatch.iterrows():
            cur.execute("""
                UPDATE media.media_labels
                   SET new_label   = %s
                 WHERE media_id    = %s
                   AND created_by IN ('auto-category','manual-edited');
            """, (row.true_category, row.media_id))
            if cur.rowcount:
                corrected.append(row.media_id)
    conn.commit()
    print(f"✅ Corrected {len(corrected)} row(s):", corrected)

    # 6) Verify the updates
    df_verify = pd.read_sql("""
        SELECT
          media_id,
          label      AS auto_label,
          new_label,
          created_by,
          created_at
        FROM media.media_labels
        WHERE media_id = ANY(%s)
          AND created_by IN ('auto-category','manual-edited')
        ORDER BY media_id;
    """, conn, params=(corrected,))
    print("\n🔍 Verification of corrected rows:")
    display(df_verify)

# 7) Close connection
conn.close()

Enter PostgreSQL username: postgres
Enter PostgreSQL password: ··········
✅ Connected as 'postgres'.

/tmp/ipython-input-79-147989872.py:12: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_all = pd.read_sql("""

ℹ️ Found 137 image-label rows:

⚠️ Mismatches found for media_id(s): [29]

✅ Ensured `new_label` column exists.
✅ Corrected 1 row(s): [29]

🔍 Verification of corrected rows:

/tmp/ipython-input-79-147989872.py:66: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_verify = pd.read_sql("""

# 1. Connect
conn = connect_to_postgres()

with conn.cursor() as cur:
    # 2. Add official_label and updated_date columns if needed
    cur.execute("""
        ALTER TABLE media.media_labels
        ADD COLUMN IF NOT EXISTS official_label TEXT,
        ADD COLUMN IF NOT EXISTS updated_date TIMESTAMPTZ;
    """)
    conn.commit()
    print("ℹ️ Ensured `official_label` and `updated_date` columns exist.")

    # 3. Populate official_label and updated_date in one go
    cur.execute("""
        UPDATE media.media_labels
           SET official_label = CASE
                                   WHEN new_label IS NOT NULL THEN new_label
                                   ELSE label
                                END,
               updated_date   = NOW()
    """)
    conn.commit()
    print(f"✅ official_label populated and updated_date set on {cur.rowcount} rows.")

    # 4. (Optional) Drop the old `new_label` column if you no longer need it
    # cur.execute("ALTER TABLE media.media_labels DROP COLUMN IF EXISTS new_label;")
    # conn.commit()
    # print("ℹ️ Dropped `new_label` column.")

# 5. Verify
with conn.cursor() as cur:
    cur.execute("""
        SELECT media_id, label AS auto_label, official_label, updated_date
          FROM media.media_labels
         ORDER BY media_id
         LIMIT 20;
    """)
    print("🔍 Sample of updated media_labels:")
    for row in cur.fetchall():
        print(row)

Enter PostgreSQL username: postgres
Enter PostgreSQL password: ··········
✅ Connected as 'postgres'.
ℹ️ Ensured `official_label` and `updated_date` columns exist.
✅ official_label populated and updated_date set on 154 rows.
🔍 Sample of updated media_labels:
(1, 'mugs', 'mugs', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(2, 'mugs', 'mugs', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(3, 'mugs', 'mugs', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(4, 'mugs', 'mugs', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(5, 'mugs', 'mugs', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(6, 'agendas', 'agendas', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(7, 'agendas', 'agendas', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(8, 'stickers', 'stickers', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(9, 'stickers', 'stickers', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(10, 'stickers', 'stickers', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(11, 'stickers', 'stickers', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(12, 'stickers', 'stickers', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(13, 'dtf_prints', 'dtf_prints', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(14, 'dtf_prints', 'dtf_prints', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(15, 'dtf_prints', 'dtf_prints', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(16, 'dtf_prints', 'dtf_prints', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(17, 'dtf_prints', 'dtf_prints', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(18, 'dtf_prints', 'dtf_prints', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(19, 'dtf_prints', 'dtf_prints', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))
(20, 'dtf_prints', 'dtf_prints', datetime.datetime(2025, 6, 20, 6, 17, 28, 668599, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))))

# Prep display for full output
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

# Reconnect and query the updated media_labels
conn = connect_to_postgres()
df_labels = pd.read_sql("""
    SELECT
      media_id,
      label,
      created_by,
      new_label,
      official_label,
      created_at,
      updated_date
    FROM media.media_labels
    ORDER BY media_id;
""", conn)
conn.close()

# Show the full result
print("=== media.media_labels ===")
display(df_labels)

Enter PostgreSQL username: postgres
Enter PostgreSQL password: ··········
✅ Connected as 'postgres'.

/tmp/ipython-input-83-2906072900.py:7: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_labels = pd.read_sql("""

=== media.media_labels ===

# 1. Load the full set of official labels
conn = connect_to_postgres()
df_labels = pd.read_sql("""
    SELECT
      ml.media_id,
      ml.official_label
    FROM media.media_labels ml
    JOIN media.media_files mf
      ON ml.media_id = mf.media_id
    WHERE mf.file_type = 'image'
    ORDER BY ml.official_label;
""", conn)
conn.close()

# 2. Compute and display class counts
counts = df_labels['official_label'].value_counts().sort_values()
print("=== Number of images per official_label ===")
display(counts)

# 3. Identify minority classes (e.g. fewer than 20 images)
minority = counts[counts < 20]
print("\n⚠️ Under-represented classes (count < 20):")
display(minority)

Enter PostgreSQL username: postgres
Enter PostgreSQL password: ··········
✅ Connected as 'postgres'.
=== Number of images per official_label ===

/tmp/ipython-input-84-3082260124.py:3: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df_labels = pd.read_sql("""

⚠️ Under-represented classes (count < 20):

# Compute class weights inversely proportional to their frequency
class_counts = train_df['category_id'].value_counts().sort_index().values
class_weights = 1.0 / torch.tensor(class_counts, dtype=torch.float)
# Normalize so that sum(weights) == number of classes
class_weights = class_weights / class_weights.sum() * len(class_counts)

# Use the weighted loss to penalize rare classes more heavily
criterion = nn.CrossEntropyLoss(weight=class_weights.to(device))

# 1) Recompute class weights from train_df
class_counts = train_df['category_id'].value_counts().sort_index().values
class_weights = 1.0 / torch.tensor(class_counts, dtype=torch.float)
class_weights = class_weights / class_weights.sum() * len(class_counts)
class_weights = class_weights.to(device)

# 2) Create a sampler so that minority classes are oversampled
#    Sample weights per example = weight[class_id]
example_weights = class_weights[torch.tensor(train_df['category_id'].values)]
sampler = WeightedRandomSampler(
    weights=example_weights,
    num_samples=len(example_weights),
    replacement=True
)

# 3) Rebuild train_loader to use the sampler
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    sampler=sampler,    # ← replaces shuffle=True
    num_workers=2
)

# 4) Define weighted loss
criterion = nn.CrossEntropyLoss(weight=class_weights)

# 5) Re-train the model head with weighted loss
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

num_epochs = 20
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    all_preds, all_labels = [], []

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * images.size(0)
        preds = outputs.argmax(dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

    epoch_loss = running_loss / len(train_loader.dataset)
    epoch_acc  = accuracy_score(all_labels, all_preds)
    print(f"Epoch {epoch+1:02d}/{num_epochs} — Loss: {epoch_loss:.4f} — Train Acc: {epoch_acc:.4f}")

    # Optional: run a quick validation pass each epoch
    model.eval()
    v_preds, v_labels = [], []
    with torch.no_grad():
        for imgs, lbls in val_loader:
            imgs, lbls = imgs.to(device), lbls.to(device)
            out = model(imgs)
            v_preds.extend(out.argmax(dim=1).cpu().numpy())
            v_labels.extend(lbls.cpu().numpy())
    val_acc = accuracy_score(v_labels, v_preds)
    print(f"           Validation Acc: {val_acc:.4f}\n")

# 6) Final evaluation on train/val/test with your utilities
train_true, train_pred = evaluate_torch(model, train_loader, device)
val_true,   val_pred   = evaluate_torch(model, val_loader,   device)
test_true,  test_pred  = evaluate_torch(model, test_loader,  device)

print_report(train_true, train_pred, index2name, "Training")
plot_cm(train_true, train_pred, index2name, "Training")

print_report(val_true, val_pred, index2name, "Validation")
plot_cm(val_true, val_pred, index2name, "Validation")

print_report(test_true, test_pred, index2name, "Test")
plot_cm(test_true, test_pred, index2name, "Test")

Epoch 01/20 — Loss: 0.3597 — Train Acc: 0.9053
           Validation Acc: 0.7222

Epoch 02/20 — Loss: 0.1737 — Train Acc: 0.9579
           Validation Acc: 0.5556

Epoch 03/20 — Loss: 0.1312 — Train Acc: 1.0000
           Validation Acc: 0.5556

Epoch 04/20 — Loss: 0.1240 — Train Acc: 0.9789
           Validation Acc: 0.5000

Epoch 05/20 — Loss: 0.1293 — Train Acc: 0.9684
           Validation Acc: 0.5000

Epoch 06/20 — Loss: 0.0986 — Train Acc: 0.9263
           Validation Acc: 0.6111

Epoch 07/20 — Loss: 0.0980 — Train Acc: 0.9789
           Validation Acc: 0.6111

Epoch 08/20 — Loss: 0.1082 — Train Acc: 0.9579
           Validation Acc: 0.5556

Epoch 09/20 — Loss: 0.1063 — Train Acc: 0.8737
           Validation Acc: 0.6111

Epoch 10/20 — Loss: 0.1110 — Train Acc: 0.9263
           Validation Acc: 0.6111

Epoch 11/20 — Loss: 0.1017 — Train Acc: 0.9263
           Validation Acc: 0.6111

Epoch 12/20 — Loss: 0.1060 — Train Acc: 0.9474
           Validation Acc: 0.6111

Epoch 13/20 — Loss: 0.0860 — Train Acc: 0.8947
           Validation Acc: 0.6111

Epoch 14/20 — Loss: 0.0575 — Train Acc: 0.9684
           Validation Acc: 0.6667

Epoch 15/20 — Loss: 0.0797 — Train Acc: 0.9263
           Validation Acc: 0.7222

Epoch 16/20 — Loss: 0.0747 — Train Acc: 0.9368
           Validation Acc: 0.7222

Epoch 17/20 — Loss: 0.0623 — Train Acc: 0.9474
           Validation Acc: 0.6667

Epoch 18/20 — Loss: 0.1052 — Train Acc: 0.9789
           Validation Acc: 0.7222

Epoch 19/20 — Loss: 0.0506 — Train Acc: 0.9789
           Validation Acc: 0.7222

Epoch 20/20 — Loss: 0.0684 — Train Acc: 0.9684
           Validation Acc: 0.7222


📋 Classification Report — Training Set:

                precision    recall  f1-score   support

          ID_0       1.00      1.00      1.00         9
       agendas       1.00      1.00      1.00        13
       bundles       1.00      1.00      1.00         6
business cards       0.89      1.00      0.94         8
    dft_prints       1.00      1.00      1.00         4
    dtf_prints       1.00      1.00      1.00         6
          hats       1.00      1.00      1.00         5
         laser       1.00      1.00      1.00         5
          misc       1.00      1.00      1.00        11
          mugs       1.00      0.60      0.75         5
         resin       0.89      1.00      0.94         8
        shirts       1.00      1.00      1.00         7
      stickers       1.00      1.00      1.00         6
   sublimation       1.00      1.00      1.00         2

      accuracy                           0.98        95
     macro avg       0.98      0.97      0.97        95
  weighted avg       0.98      0.98      0.98        95

📋 Classification Report — Validation Set:

                precision    recall  f1-score   support

       agendas       0.50      1.00      0.67         1
business cards       0.00      0.00      0.00         1
    dtf_prints       1.00      1.00      1.00         3
          hats       0.00      0.00      0.00         1
          misc       1.00      1.00      1.00         1
          mugs       1.00      0.75      0.86         8
         resin       1.00      1.00      1.00         1
   sublimation       1.00      0.50      0.67         2

     micro avg       0.93      0.72      0.81        18
     macro avg       0.69      0.66      0.65        18
  weighted avg       0.86      0.72      0.77        18

📋 Classification Report — Test Set:

                precision    recall  f1-score   support

       agendas       0.00      0.00      0.00         1
business cards       0.00      0.00      0.00         2
    dtf_prints       0.50      0.50      0.50         2
          hats       0.00      0.00      0.00         1
          misc       0.33      1.00      0.50         1
          mugs       1.00      0.44      0.62         9
         resin       0.00      0.00      0.00         1
   sublimation       0.50      1.00      0.67         1

     micro avg       0.41      0.39      0.40        18
     macro avg       0.29      0.37      0.29        18
  weighted avg       0.60      0.39      0.43        18

# Save PyTorch model
torch.save(model.state_dict(), "final_model.pt")
print("✅ PyTorch model saved as 'final_model.pt'")

# Save TensorFlow/Keras model
model.save("final_model.h5")
print("✅ TensorFlow model saved as 'final_model.h5'")

# Export requirements to a file
!pip freeze > requirements.txt
print("✅ requirements.txt exported")

Split	Previous Accuracy	New Accuracy
Train	0.54	1.00
Validation	0.44	0.78
Test	0.56	0.61

	media_id	file_name	file_path	file_type	category_id	created_at
0	1	MG2301IMGIflintstones12ozBLK.jpg	artiszen_dataset_images/mugs/MG2301IMGIflintst...	image	9	2025-06-20 01:59:25.470238
1	2	MG2201IMGpochacco12ozWHT.jpg	artiszen_dataset_images/mugs/MG2201IMGpochacco...	image	9	2025-06-20 01:59:25.470238
2	3	MG2201IMGspottiedot12ozWHT.jpg	artiszen_dataset_images/mugs/MG2201IMGspottied...	image	9	2025-06-20 01:59:25.470238
3	4	MG2201IMGhellokitty12ozWHT.jpg	artiszen_dataset_images/mugs/MG2201IMGhellokit...	image	9	2025-06-20 01:59:25.470238
4	5	MG2501IMGstitch12ozBLK.jpg	artiszen_dataset_images/mugs/MG2501IMGstitch12...	image	9	2025-06-20 01:59:25.470238
5	6	AG25A5flowers.jpg	artiszen_dataset_images/agendas/AG25A5flowers.jpg	image	1	2025-06-20 01:59:25.470238
6	7	AG25A5fridak.jpg	artiszen_dataset_images/agendas/AG25A5fridak.jpg	image	1	2025-06-20 01:59:25.470238
7	8	S25teacher3in.jpg	artiszen_dataset_images/stickers/S25teacher3in...	image	12	2025-06-20 01:59:25.470238
8	9	S25morewords2in.jpg	artiszen_dataset_images/stickers/S25morewords2...	image	12	2025-06-20 01:59:25.470238
9	10	STH25abelynda3insq.jpg	artiszen_dataset_images/stickers/STH25abelynda...	image	12	2025-06-20 01:59:25.470238

	category_id	category_name	has_images	has_videos
0	1	agendas	True	True
1	2	bundles	True	False
2	3	business cards	True	False
3	4	dft_prints	False	True
4	5	dtf_prints	True	False
5	6	hats	True	False
6	7	laser	True	False
7	8	misc	True	False
8	9	mugs	True	True
9	10	resin	True	True
10	11	shirts	True	True
11	12	stickers	True	False
12	13	sublimation	True	False
13	14	tumblers	True	True
14	15	uvdtf	True	False

	label_id	media_id	label	confidence	created_by	created_at	new_label	official_label	updated_date
0	1	1	mugs	1.0	auto-category	2025-06-20 01:59:27.230310	None	None	None
1	2	2	mugs	1.0	auto-category	2025-06-20 01:59:27.230310	None	None	None
2	3	3	mugs	1.0	auto-category	2025-06-20 01:59:27.230310	None	None	None
3	4	4	mugs	1.0	auto-category	2025-06-20 01:59:27.230310	None	None	None
4	5	5	mugs	1.0	auto-category	2025-06-20 01:59:27.230310	None	None	None
5	6	6	agendas	1.0	auto-category	2025-06-20 01:59:27.230310	None	None	None
6	7	7	agendas	1.0	auto-category	2025-06-20 01:59:27.230310	None	None	None
7	8	8	stickers	1.0	auto-category	2025-06-20 01:59:27.230310	None	None	None
8	9	9	stickers	1.0	auto-category	2025-06-20 01:59:27.230310	None	None	None
9	10	10	stickers	1.0	auto-category	2025-06-20 01:59:27.230310	None	None	None

Step	Applies to Images	Applies to Videos
Resize	✅	❌
Normalize	✅	❌
Augmentation (Flip, Crop)	✅	❌
Torch Dataset prep	✅	❌
Video preprocessing	❌	🔜 Planned

	media_id	file_path	file_type	auto_label	category_name
0	38	artiszen_dataset_images/laser/LEG23tampapromoS...	image	laser	laser
1	56	artiszen_dataset_images/tumblers/TB22IMGrstyli...	image	tumblers	tumblers
2	53	artiszen_dataset_images/tumblers/TB24IMGabwell...	image	tumblers	tumblers
3	29	artiszen_dataset_images/laser/LEG24lilostchSla...	image	laser	laser
4	40	artiszen_dataset_images/laser/LCUT23cheffMDF.jpg	image	laser	laser
5	60	artiszen_dataset_images/bundles/BDL25DTFsavage...	image	bundles	bundles
6	112	artiszen_dataset_images/shirts/KT24DTFfrogBLK.jpg	image	shirts	shirts
7	76	artiszen_dataset_images/business cards/BSCD250...	image	business cards	business cards
8	96	artiszen_dataset_images/shirts/WT25DTFmomcheer...	image	shirts	shirts
9	1	artiszen_dataset_images/mugs/MG2301IMGIflintst...	image	mugs	mugs
10	108	artiszen_dataset_images/shirts/WH24DTFlollypop...	image	shirts	shirts
11	103	artiszen_dataset_images/shirts/KT25DTFtrashpBL...	image	shirts	shirts
12	28	artiszen_dataset_images/laser/LEG25customPenG.jpg	image	laser	laser
13	63	artiszen_dataset_images/bundles/BDL24DTFpinish...	image	bundles	bundles
14	109	artiszen_dataset_images/shirts/WS24DTFlollypop...	image	shirts	shirts
15	130	artiszen_dataset_images/shirts/MST24DTFgatorGR...	image	shirts	shirts
16	32	artiszen_dataset_images/laser/LCUT2401kuromi30...	image	laser	laser
17	95	artiszen_dataset_images/shirts/KT25DTFstitchWH...	image	shirts	shirts
18	78	artiszen_dataset_images/misc/S25VINYLpursuing.jpg	image	misc	misc
19	134	artiszen_dataset_images/shirts/KT23VINYLfirstd...	image	shirts	shirts

	media_id	file_path	file_type	auto_label	true_category
0	83	artiszen_dataset_images/misc/DTF25popsiclebags.jpg	image	misc	misc
1	22	artiszen_dataset_images/hats/BC25DTFsavagesWHT.jpg	image	hats	hats
2	23	artiszen_dataset_images/hats/BC25DTFarmyRED.jpg	image	hats	hats
3	24	artiszen_dataset_images/hats/BC25DTFbvBLU.jpg	image	hats	hats
4	25	artiszen_dataset_images/hats/BC24DTFgatorGRY.jpg	image	hats	hats

	count
official_label
business cards	2
agendas	2
sublimation	3
mugs	5
hats	5
tumblers	5
stickers	5
bundles	6
misc	6
resin	8
dtf_prints	9
uvdtf	10
laser	17
shirts	54

Deep Learning-Based Product Image Classifier for Artiszën Crafts¶

IST 691 Deep Learning – Final Project¶

Project Objective¶

Connecting Google Colab to PostgreSQL using ngrok¶

Install Dependencies & Import Required Libraries¶

PostgreSQL Schema Setup¶

📁 Connect Google Drive & Define Dataset Paths¶

🧾 Data Ingestion Pipeline Summary (Structured Markdown & Code Guidance)¶

Full Ingestion Pipeline Overview¶

🔄 Defining main()¶

🎯 Outcome:¶

Exploratory Data Analysis¶

Model Development¶

Image Integrity Check: Validating Image Files from Database¶

Normalize and Analyze Image Properties¶

📌 Scope Clarification: Images vs Videos¶

Summary: Media File Type Handling¶

Data Augmentation and Preprocessing¶

Dataset Split Strategy¶

Create Custom Dataset Classes for PyTorch and DataLoaders¶

CNN model training¶

1. PyTorch CNN Model using ResNet18¶

Run Evaluation on Validation and Test Set Using PyTorch¶

Evaluation Summary for PyTorch Model (ResNet18)¶

Confusion Matrix Observations:¶

Classification Metrics:¶

Important Notes:¶

2. TensorFlow CNN Model using ResNet50¶

Run Evaluation on the Train and Validation Sets Using TensorFlow¶

Run Evaluation on the Test Set Using TensorFlow¶

Verifying Category Label Mapping for Image Classification Models¶

TensorFlow ResNet50: Before vs. After Improvements¶

Revisiting PyTorch (Changing the Classes to Their Actual Names)¶

Classification Report Comparison: PyTorch vs TensorFlow¶

A) Using Numeric Class IDs¶

B) After Mapping IDs → Category Names¶

Cross-Validation & Error Analysis¶

OUR BASELINE MODEL CV & ERROR ANALYSIS¶

Label Audit & Cleanup¶

Inspect Official Label Mapping and Identify Minority Classes¶

Under-represented Classes (Count < 20)¶

Fine-Tuning with Transfer Learning (Weighted Sampler & Loss)¶

📌 Conclusion¶