Data Mining Course Project

Smoking and Drinking Prediction

This data mining project course is centered on analyzing a Smoking and Drinking Dataset with body signals, which provides critical insights into patterns and factors contributing to drinking or smoking. The dataset serves as the foundation for exploring various data mining techniques and predictive modeling approaches. More detailed information about the dataset, including its structure, attributes, and data sources, can be found in the DATASET section of this page.

Relevant links:

Team

LJ

Luna Julio

Data Scientist

Universidad del Norte, Colombia

JV

Joseph Venegas

Data Scientist

Universidad del Norte, Colombia

JG

Jose Gomez

Data Analyst

Universidad del Norte, Colombia

ET

Edgar Torres

Backend Engineer

Universidad del Norte, Colombia

GG

Giuseppe Gomez

Frontend Engineer

Universidad del Norte, Colombia

Dataset

Aug 16, 2024

Description

This dataset, sourced from Kaggle, has been obtained from the National Health Insurance Service in Korea, with all personal and sensitive information removed. The dataset aims to facilitate the classification of individuals as smokers or drinkers. Here's a small sample:

Index	Sex	Age	Height	Weight	Waistline	Sight Left	Sight Right	Hear Left	Hear Right	SBP	DBP	BLDS	Total Cholesterol	HDL Cholesterol	LDL Cholesterol	Triglyceride	Hemoglobin	Urine Protein	Serum Creatinine	SGOT AST	SGOT ALT	Gamma GTP	Smoking Status	Drinking Status
0	Male	35	170	75	90.0	1.0	1.0	1.0	1.0	120.0	80.0	99.0	193.0	48.0	126.0	92.0	17.1	1.0	1.0	21.0	35.0	40.0	1.0	Y
1	Male	30	180	80	89.0	0.9	1.2	1.0	1.0	130.0	82.0	106.0	228.0	55.0	148.0	121.0	15.8	1.0	0.9	20.0	36.0	27.0	3.0	N
2	Male	40	165	75	91.0	1.2	1.5	1.0	1.0	120.0	70.0	98.0	136.0	41.0	74.0	104.0	15.8	1.0	0.9	47.0	32.0	68.0	1.0	N
3	Male	50	175	80	91.0	1.5	1.2	1.0	1.0	145.0	87.0	95.0	201.0	76.0	104.0	106.0	17.6	1.0	1.1	29.0	34.0	18.0	1.0	N
4	Male	50	165	60	80.0	1.0	1.2	1.0	1.0	138.0	82.0	101.0	199.0	61.0	117.0	104.0	13.8	1.0	0.8	19.0	12.0	25.0	1.0	N

This dataset has been sourced from Kaggle. For detailed information about the measures used in each column, please refer to the source.

Description of variables

Index	Variable	Description
1	Sex	Indicates the individual's gender, which can be male or female.
2	Age	The individual's age, rounded to the nearest 5 years.
3	Height	The individual's height, rounded to the nearest 5 cm.
4	Weight	The individual's weight, measured in kilograms.
5	Sight_left	Visual acuity of the left eye, assessed as normal or abnormal.
6	Sight_right	Visual acuity of the right eye, assessed as normal or abnormal.
7	Hear_left	Hearing status of the left ear, classified as normal (1) or abnormal (2).
8	Hear_right	Hearing status of the right ear, classified as normal (1) or abnormal (2).
9	SBP (Systolic Blood Pressure)	The systolic blood pressure measurement, in mmHg.
10	DBP (Diastolic Blood Pressure)	The diastolic blood pressure measurement, in mmHg.
11	BLDS (Fasting Blood Glucose)	Blood glucose level after fasting, measured in mg/dL.
12	Tot_chole (Total Cholesterol)	Total cholesterol level in the blood, measured in mg/dL.
13	HDL_chole (HDL Cholesterol)	High-density lipoprotein cholesterol level in the blood, measured in mg/dL.
14	LDL_chole (LDL Cholesterol)	Low-density lipoprotein cholesterol level in the blood, measured in mg/dL.
15	Triglyceride	Triglyceride level in the blood, measured in mg/dL.
16	Hemoglobin	Hemoglobin concentration in the blood, measured in g/dL.
17	Urine_protein	Protein level in urine, classified from absence (-) to significant presence (+4).
18	Serum_creatinine	Serum creatinine level, measured in mg/dL.
19	SGOT_AST (SGOT/AST)	Serum level of aspartate aminotransferase (SGOT/AST), measured in IU/L.
20	SGOT_ALT (SGOT/ALT)	Serum level of alanine aminotransferase (SGOT/ALT), measured in IU/L.
21	Gamma_GTP	Serum level of gamma-glutamyl transpeptidase (Gamma-GTP), measured in IU/L.
22	SMK_stat_type_cd (Smoking Status)	Classification of smoking status, which can be never smoked (1), used to smoke but quit (2), or currently smoking (3).
23	DRK_YN (Drinker or Not)	Indicates whether the individual consumes alcohol or not.

Justification

When patients are diagnosed with a respiratory problem or another health condition, they often show reluctance to reveal information about their smoking or drinking habits, and how frequently they engage in these activities. This lack of information may be due to various reasons, such as fear of being judged or appearing irresponsible about their health. As a result, doctors may face difficulties in providing appropriate treatment. To address this issue, it is crucial to obtain accurate information about the patient's smoking and drinking status without relying solely on their self-reports. Therefore, based on research, it has been shown that this information can be predicted from various measurable bodily signals, such as blood pressure, cholesterol, urine proteins, and certain enzymes. Thus, the main objective of this project is to train a machine learning model to predict the patients' status, providing a response that allows for verifying whether they consume alcohol and if they smoke frequently.

Goals

Perform an exploratory analysis with the data that will be used on a machine learning model.
Using that model, predict whether a person consumes alcohol or smokes.

Relevance

The primary goal of this project is to analyze body signals and classify individuals based on their smoking and drinking habits. The dataset aligns perfectly with these objectives by providing key health metrics (e.g., blood pressure, cholesterol levels, and liver enzymes) alongside smoking and drinking status. These variables are essential for identifying potential health risks associated with these habits, enabling the development of predictive models and classifications.

Case studies

Classifying smoking urges via machine learning

This study explores the use of machine learning methods to predict high-urge states during a smoking cessation attempt. Three classification approaches (Bayes, discriminant analysis, and decision trees) were evaluated using data from over 300 participants. The results indicated that decision trees, based on feature selection, achieved the highest accuracy, with 86% classification precision. This highlights the potential of machine learning to enhance smoking cessation interventions through mobile technologies, optimizing healthcare resources and providing personalized real-time support.

Dumortier, A., Beckjord, E., Shiffman, S., & Sejdić, E. (2016). Classifying smoking urges via machine learning. Computer methods and programs in biomedicine, 137, 203-213.

Analizing influence of smoking and alcohol drinking behaviors on body type with deep learning, machine learning and data mining techniques

This study examines the use of deep learning, machine learning, and data mining approaches to analyze the body types and consumption patterns of alcohol and smoking among various individuals. Despite numerous studies in the literature highlighting the risks and dangers of smoking and excessive, long-term alcohol use, there are also studies that emphasize and cite the benefits of limited and moderate alcohol consumption. The analysis aims to provide insights to leaders and society in general by investigating behaviors that have been present for centuries across different individuals, groups, and societies.

YAVUZ, Ö. (2022). ANALYZING INFLUENCE OF SMOKING AND ALCOHOL DRINKING BEHAVIORS ON BODY TYPE WITH DEEP LEARNING, MACHINE LEARNING AND DATA MINING TECHNIQUES. Premium e-Journal of Social Sciences (PEJOSS), 6(19), 46-58.

Hypotheses Based on Health Metrics for Predicting Smoking and Drinking Behaviors

Cholesterol (HDL, LDL, and Triglycerides):

Smoking: Smoking affects lipid profiles, with variations observed between adults and adolescents. In adult smokers, studies show higher levels of total cholesterol, LDL, VLDL, and triglycerides, and lower levels of HDL and apolipoprotein A. Adolescent smokers, however, often have lower levels of total cholesterol and HDL compared to non-smokers. Gender differences are also noted, with male smokers exhibiting a higher LDL/HDL ratio and a more pronounced reduction in HDL, potentially due to hormonal changes during puberty.

Drinking: Alcohol consumption has mixed effects on cholesterol and triglycerides. Moderate alcohol intake tends to increase HDL ("good" cholesterol), which benefits heart health. However, excessive alcohol consumption can raise LDL ("bad" cholesterol) levels, contributing to plaque buildup in arteries and increasing cardiovascular disease risk. Excessive drinking is also linked to elevated triglyceride levels, exacerbating cardiovascular risks.

Blood Pressure (SBP and DBP):

Smoking: Nicotine in cigarettes can elevate both systolic and diastolic blood pressure due to its vasoconstrictive effects, causing the heart to pump harder.

Drinking: Alcohol consumption is associated with increased systolic and diastolic blood pressure. This effect is more pronounced in individuals over 40 years old, while the relationship is less clear in younger individuals. Some older non-drinkers may have slightly higher blood pressure compared to moderate drinkers, possibly due to factors like obesity. Generally, reducing alcohol intake can help prevent hypertension.

Blood Glucose (BLDS):

Smoking: Smokers may have higher fasting blood glucose levels due to smoking's impact on insulin sensitivity, increasing the risk of type 2 diabetes.

Drinking: Chronic alcohol consumption can disrupt glucose regulation, potentially increasing or decreasing blood glucose levels depending on liver function.

Liver Enzymes (SGOT/AST, SGPT/ALT, Gamma-GTP):

Drinking: Excessive alcohol consumption is linked to elevated levels of liver enzymes such as Gamma-GTP, AST, and ALT, indicating liver damage or fatty liver, which are common in chronic drinkers.

Serum Creatinine:

Drinking: Alcohol can affect kidney function, which may be reflected in abnormal serum creatinine levels, especially with prolonged excessive drinking.

Improving Patient Care through Predictive Modeling:

To enhance patient diagnosis and treatment, it is crucial to use objective and precise data to predict behaviors such as smoking and drinking. By developing a model that predicts these behaviors based on health metrics like sex, age, height, weight, blood pressure, and cholesterol levels, healthcare providers can offer more personalized interventions. An effective model could identify individuals at risk of smoking or drinking based on their health characteristics, enabling tailored treatment plans. Additionally, the model could facilitate early intervention strategies by identifying risk patterns before they become serious, promoting more effective prevention and better health management.

References

Bondía, J. P., Abaitua, F. R., López, R. E., Martínez, J. I., Mazón, H. F., Izurzu, Y. A., & Elízaga, I. V. (1997). Distribución de las variables lipídicas en adolescentes fumadores. An Esp Pediatr, 46, 245-251.
Nasiff-Hadad, A., Gira, P., & Bruckert, E. (1997). Efectos del alcohol sobre las lipoproteínas. Revista Cubana de Medicina, 36(1), 51-60.
Hernádez, J. P., Zanón, E. B., Rodríguez, F. S., Fácila, L., Gil, R. R., Díaz, P., & Mercé, S. Índice de Rigidez Arterial (SIDVP) y Velocidad de Onda de Pulso (baPWV) como estimadores de la Presión Arterial.
Milon, H., Froment, A., Gaspard, P., Guidollet, J., & Ripoll, J. P. (1982). Alcohol consumption and blood pressure in a French epidemiological study. European heart journal, 3(suppl_C), 59-64.
Valenzuela Magdaleno, A. Monitoreo continuo y automonitoreo de glucosa como seguimiento glucémico y autocuidado de pacientes con diabetes mellitus tipo 2.

Exploratory Data Analysis

Sep 13, 2024

Note: These are just conclusions. More detailed information can be found on the Google Colab jupyter notebook.

Exogenous repositories

These external sources could enhance the analysis in our project.

Worldwide Alcohol Consumption

This dataset is relevant to the project as it provides a global view of alcohol consumption, which will allow for the contextualization of drinking patterns in different regions of the world. This data could enrich the analysis by showing how cultural and geographical differences affect alcohol consumption habits, which could influence health biomarkers. Additionally, it will help identify how levels of alcohol consumption correlate with indicators such as cholesterol or blood pressure, providing an additional layer of analysis to the predictive model. This approach allows for adapting predictions to specific contexts based on the region, which could improve the accuracy and generalization of the model.

Smoking Signal of Body Classification

This dataset is relevant to the project as it shares the same objective as the project: predicting smoking habits based on body signals. This dataset not only provides a solid framework for building classification models but can also offer valuable insights into the most important variables for detecting smoking habits. The machine learning techniques applied in this analysis, such as SVM or Random Forest, can be reused or adjusted to optimize our own model, as they seek the same type of prediction based on biomarkers such as cholesterol, blood pressure, and other health indicators.

Data overview

The dataset consists of 991,346 observations and 24 variables.

Data Completeness

All variables are fully populated with no missing data.

Data Types:

int64: 3 variables (e.g., age, height, weight)
float64: 19 variables (e.g., waistline, cholesterol levels, blood pressure)
object: 2 variables (e.g., sex, DRK_YN)

Next Steps:

Data Quality: Check for consistency in categorical values.
Outliers: Identify and handle any potential outliers.

We will proceed with a thorough analysis to ensure the data's reliability and readiness for further processing.

Google Colab

df_smk_drnk.info()

Data columns (total 24 columns):
     #   Column            Non-Null Count   Dtype
    ---  ------            --------------   -----
     0   sex               991346 non-null  object
     1   age               991346 non-null  int64
     2   height            991346 non-null  int64
     3   weight            991346 non-null  int64
     4   waistline         991346 non-null  float64
     5   sight_left        991346 non-null  float64
     6   sight_right       991346 non-null  float64
     7   hear_left         991346 non-null  float64
     8   hear_right        991346 non-null  float64
     9   SBP               991346 non-null  float64
     10  DBP               991346 non-null  float64
     11  BLDS              991346 non-null  float64
     12  tot_chole         991346 non-null  float64
     13  HDL_chole         991346 non-null  float64
     14  LDL_chole         991346 non-null  float64
     15  triglyceride      991346 non-null  float64
     16  hemoglobin        991346 non-null  float64
     17  urine_protein     991346 non-null  float64
     18  serum_creatinine  991346 non-null  float64
     19  SGOT_AST          991346 non-null  float64
     20  SGOT_ALT          991346 non-null  float64
     21  gamma_GTP         991346 non-null  float64
     22  SMK_stat_type_cd  991346 non-null  float64
     23  DRK_YN            991346 non-null  object
    dtypes: float64(19), int64(3), object(2)

Converting Object Variables to Categorical Types

In this dataset, converting object variables to categorical types is a crucial step for efficient data handling and analysis. This conversion is essential for the following reasons:

Memory Optimization: Converting object types to categorical reduces memory usage significantly.
Ease of Analysis: Categorical data simplifies statistical analysis and modeling processes.
Discrete Value Representation: Categorical variables represent discrete values, which is particularly important for:

Group-Wise Operations: Facilitates operations and aggregations based on categories.
Machine Learning Models: Enables the creation of dummy variables and improves model performance.

Variables such as sex, SMK_stat_type_cd, and DRK_YN are prime candidates for this conversion.

This can be done through the pd.Categorical() method. Taking sex as an example:

Google Colab

df_smk_drnk['sex'].unique()

array(['Male', 'Female'], dtype=object)

Google Colab

# Map original values to new values
df_smk_drnk['sex'] = df_smk_drnk['sex'].map({'Male': 'M', 'Female': 'F'})
# Convert column to categorical type
df_smk_drnk['sex'] = pd.Categorical(df_smk_drnk['sex'], categories=['M', 'F'], ordered=True)

Doing the same to the rest of columns:

Google Colab

df_smk_drnk.info()

Data columns (total 24 columns):
     #   Column            Non-Null Count   Dtype
    ---  ------            --------------   -----
     0   sex               991346 non-null  category
     .
     .
     .
     22  SMK_stat_type_cd  991346 non-null  category
     23  DRK_YN            991346 non-null  category
    dtypes: category(3), float64(18), int64(3)

Data visualization

Analysis of categorical variables

Sex distribution

M (Male): 526,415 individuals.
F (Female): 464,931 individuals.

Interpretation: The dataset has a fairly balanced gender distribution, with a slightly higher number of males than females. This balance is beneficial for gender-specific analysis, as it minimizes the risk of gender bias in the results.

Smoking status (SMK_stat_type_cd)

Never smoked: 602,441 individuals
Still smoke: 213,954 individuals
Used to smoke but quit 174,951 individuals

Interpretation: The majority of individuals (about 60.8%) in the dataset have never smoked. About 21.6% are current smokers, while 17.6% used to smoke but quit. This distribution allows for detailed analysis of smoking habits and their impact on health outcomes.

Drinking Status (DRK_YN)

N (Non-drinkers): 495,858 individuals
Y (Drinkers): 495,488 individuals

Interpretation: The dataset is almost perfectly balanced in terms of drinking status, with an almost equal split between drinkers and non-drinkers. This balance provides a solid foundation for comparative analysis between these two groups.

Age Distribution by Drinking Status

Drinking appears to peak between the ages of 30 and 50, with a gradual decline starting around 50 years old.
The number of non-drinkers increases with age, particularly after 60 years, potentially due to health concerns or lifestyle changes.
Younger individuals (ages 20–30) show a noticeable presence of drinkers, but non-drinkers remain relatively close in count.
Around ages 55–60, the proportion of non-drinkers surpasses that of drinkers. This shift suggests a critical age where drinking behavior declines substantially.

Age Distribution by Smoking Status

The "Still Smoking" category (blue) has the highest density across all age groups, indicating that a significant portion of the population continues to smoke.
The "Never Smoked" category (green) has a relatively consistent but lower density compared to "Still Smoking." This suggests that a smaller but steady group of individuals never pick up smoking across all ages.
The "Quit Smoking" category (orange) appears to increase slightly with age, especially beyond 40 years. This may indicate that older individuals are more likely to quit smoking, possibly due to health concerns or lifestyle changes.
Younger individuals (ages 20–30) show fewer instances of smoking (both active and quit), likely reflecting greater awareness or a shift in social norms discouraging smoking in recent years.

Pairplot of Key Body Signals by Drinking Status

Both drinkers and non-drinkers have similar hemoglobin distributions, with minor differences visible in density. This suggests hemoglobin might not strongly differentiate drinkers from non-drinkers.
Drinkers have a broader range and significantly higher values for gamma-GTP compared to non-drinkers. This confirms gamma-GTP as a potential key indicator for distinguishing drinkers due to its known association with liver stress from alcohol consumption.
Most values for serum creatinine are concentrated at lower levels for both groups, with no significant visual separation between drinkers and non-drinkers.

RELATIONSHIPS:

Hemoglobin vs. Gamma-GTP: No clear relationship is observed, but drinkers have more outliers for gamma-GTP at certain hemoglobin levels.
Gamma-GTP vs. Serum Creatinine: There is no evident correlation, as serum creatinine values remain low regardless of gamma-GTP levels.

Pairplot of Key Body Signals by Smoking Status

Never Smokers (1): Never-smokers generally have the highest hemoglobin levels, which is expected as they are not exposed to the harmful effects of smoking.
Former Smokers (2): Former smokers seem to have intermediate hemoglobin levels between current smokers and never-smokers. This suggests that quitting smoking can lead to some recovery in hemoglobin levels over time.
Smokers (3): Smokers tend to have slightly lower hemoglobin levels compared to never-smokers and former smokers. This could be due to various factors, including reduced oxygen-carrying capacity in the blood due to carbon monoxide exposure.
Gamma-GTP: There's a clear difference in distribution. The "still smoke" group has a higher concentration of gamma-GTP compared to the other two groups. This could indicate potential liver damage or dysfunction associated with smoking.
Serum Creatinine: The distribution is similar across all groups, suggesting no significant impact of smoking status on kidney function.

RELATIONSHIPS:

No clear relationship is visible between any of the variables

Data cleaning

Outliers

Outliers are data points that significantly differ from other observations in the dataset. They can be unusually high or low compared to the rest of the data and can arise due to variability in the data or errors in measurement.

Outlier Detection Using Interquartile Range (IQR)

Technique: The IQR method was used to detect outliers. It involves calculating the first quartile (Q1) and the third quartile (Q3) of the data for each numeric feature. The interquartile range (IQR) is then calculated as the difference between Q3 and Q1. Outliers are identified as data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.
Reason: The IQR method is a robust technique that is less sensitive to extreme values than methods based on mean and standard deviation. It is commonly used for datasets that may have non-normal distributions, which is often the case in medical data.
Action: Identified outliers were visualized using boxplots, and their counts were plotted to determine the extent of the issue. Outliers were removed to create a cleaner dataset for further analysis.

Google Colab

# Select only numeric columns for outlier detection
numeric_cols = df_smk_drnk.select_dtypes(include=[np.number]).columns

# Calculate the IQR for each column
Q1 = df_smk_drnk[numeric_cols].quantile(0.25)
Q3 = df_smk_drnk[numeric_cols].quantile(0.75)
IQR = Q3 - Q1

# Define outliers using the IQR method
outliers = ((df_smk_drnk[numeric_cols] < (Q1 - 1.5 * IQR)) | (df_smk_drnk[numeric_cols] > (Q3 + 1.5 * IQR)))

Plotting:

Data imputation

There are no missing values in the dataset.

Google Colab

print((df_cleaned.isnull().sum() / len(df_cleaned)) * 100)

.
    sex                 0.0
    age                 0.0
    height              0.0
    weight              0.0
    waistline           0.0
    sight_left          0.0
    sight_right         0.0
    hear_left           0.0
    hear_right          0.0
    SBP                 0.0
    DBP                 0.0
    BLDS                0.0
    tot_chole           0.0
    HDL_chole           0.0
    LDL_chole           0.0
    triglyceride        0.0
    hemoglobin          0.0
    urine_protein       0.0
    serum_creatinine    0.0
    SGOT_AST            0.0
    SGOT_ALT            0.0
    gamma_GTP           0.0
    SMK_stat_type_cd    0.0
    DRK_YN              0.0
    dtype: float64

There were 14 duplicated rows.

Google Colab

num_duplicated = df_cleaned.duplicated().sum()
print(f'Number of duplicate rows: {num_duplicated}')

Number of duplicate rows: 14

After elimination:

Google Colab

df_cleaned.drop_duplicates(inplace=True)
num_duplicated = df_cleaned.duplicated().sum()
print(f'Number of duplicate rows: {num_duplicated}')

Number of duplicate rows: 0

Data Migration to Google BigQuery

Oct 18, 2024

Migration Plan

After completing the data cleaning process and obtaining a clean dataset, the migration of the data to BigQuery was carried out. Before performing the migration, the following steps were taken:

Transformation of Categorical Columns:
- SMK_stat_type_cd: Converted to integer (int8).
- sex and DRK_YN: Converted to text (string).
Encoding of Categorical Columns:
- A label encoder was applied to transform the sex column, where "Male" was converted to 1 and "Female" to 0. Similarly, the DRK_YN column was transformed, assigning the value 1 for "Y" and 0 for "N".
Verification of Numeric Column Formats:
- It was checked that the numeric columns were correctly formatted as int64 or float64.
Loading Strategy:
- The pandas_gbq.to_gbq function was used to perform the migration in a single operation, utilizing the option if_exists='replace' to replace the table if it already existed.
- The migration was carried out using credentials from a service account with write permissions in BigQuery.

Execution of the Migration Process

Environment Setup:
- Credentials were configured using the JSON file of the service account to establish the connection with BigQuery.
Data Upload:
- The pandas_gbq.to_gbq() function was employed to load the dataset from pandas to BigQuery.
Verification of the Load:
- Once the load was completed, the correct migration was verified through simple SQL queries in BigQuery.

Challenges Encountered and Solutions

403 Access Denied:
- Problem: The service account did not have the necessary permissions to create datasets.
- Solution: The roles BigQuery Data Editor and BigQuery Admin were assigned to the service account from the Google Cloud Console.
Incorrect Table Name Format:
- Problem: Initially, an incorrect table name format was used.
- Solution: The format was adjusted to dataset.table, removing the project reference from the table name.

Model

Oct 18, 2024

For this task, the RandomForestClassifier model was selected due to its robustness and ability to handle a large number of features effectively.

Training

Data Splitting

The dataset was split into training (70%) and test (30%) sets using train_test_split with stratification to maintain the balance of classes in both sets.

Model Training

Training Data: The training data was used to fit the RandomForestClassifier.
Feature Scaling: We applied StandardScaler to ensure all features were on the same scale.

Computational Resources

The training was conducted on Google Colab, which provided ample CPU resources for model training.

Stratified 10-Fold Cross-Validation

To evaluate the model's performance effectively, we implemented Stratified 10-Fold Cross-Validation. This method was chosen specifically to maintain the class distribution across each fold, which is crucial when dealing with imbalanced datasets. By ensuring that each training and testing set reflects the overall class distribution, we can mitigate potential biases and improve the reliability of our model evaluation.

Here is an overview of the function:

Google Colab

def train_model(df, target_col, model_name):

    X = df.drop(columns=['SMK_stat_type_cd', 'DRK_YN'])  # Features (excluding target variables)
    y = df[target_col]  # Target (smoking or drinking)

    # Split data into training and test sets (70% training, 30% testing)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

    # Feature scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Initialize RandomForestClassifier
    rf_model = RandomForestClassifier(random_state=42, n_estimators=100, max_depth=10, min_samples_split=5)

    # Train the model
    rf_model.fit(X_train_scaled, y_train)

    # Model Evaluation on the test set
    y_pred = rf_model.predict(X_test_scaled)

    # Performance metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')
    f1 = f1_score(y_test, y_pred, average='macro')

    # Confusion matrix
    conf_matrix = confusion_matrix(y_test, y_pred)

    # Print formatted results
    print(f'# {model_name} Classification Results
')
    print(f'**Accuracy:** {accuracy:.3f}')
    print(f'**Precision:** {precision:.3f}')
    print(f'**Recall:** {recall:.3f}')
    print(f'**F1-Score:** {f1:.3f}
')

    print(f'## Classification Report:')
    print(classification_report(y_test, y_pred))

    # Cross-validation with StratifiedKFold
    cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
    cv_scores = cross_val_score(rf_model, X_train_scaled, y_train, cv=cv, scoring='accuracy')

    # Print cross-validation results
    print(f'## Cross-Validation Accuracy Scores: {cv_scores}')
    print(f'**Mean CV Accuracy:** {np.mean(cv_scores):.3f}
')

    # Return model, confusion matrix, test labels, and unscaled X_train for feature importance
    return rf_model, conf_matrix, y_test, X_train

Smoking Status

The classificacion report returned these results:

Classification Report

Accuracy: 0.721

Precision: 0.590

Recall: 0.579

F1-Score: 0.583

Classification Report:

Label	Precision	Recall	F1-Score	Support
1.0 (never)	0.84	0.86	0.85	126,741
2.0 (used to smoke but quit)	0.45	0.36	0.40	29,950
3.0 (still smoke)	0.49	0.51	0.50	35,138

Cross-Validation Accuracy Scores:

0.72062109, 0.72017426, 0.72231903, 0.71899017, 0.7169571, 0.72115728, 0.71925827, 0.71624218, 0.71919124, 0.72097232

Mean CV Accuracy: 0.720

Here is the confusion matrix:

Drinking Status

The classificacion report returned these results:

Classification Report

Accuracy: 0.709

Precision: 0.709

Recall: 0.709

F1-Score: 0.709

Classification Report:

Label	Precision	Recall	F1-Score	Support
0.0 (No)	0.73	0.72	0.72	100,976
1.0 (Yes)	0.69	0.70	0.69	90,853

Cross-Validation Accuracy Scores:

0.70828865, 0.70976318, 0.70989723, 0.70781948, 0.70958445, 0.70735031, 0.70504915, 0.70592046, 0.70607685, 0.70703099

Mean CV Accuracy: 0.708

Here is the confusion matrix:

Hyperparameter Tuning

We implemented hyperparameter tuning to optimize our model's performance by finding the best combination of hyperparameters, which helps improve accuracy, reduce overfitting, and ensure that the model generalizes well to unseen data.

Google Colab

def tuning(df, target_col, n_iter=20):
    """
    Train a Random Forest model with hyperparameter tuning using a subset of data.

    Parameters:
    -----------
    df : pandas.DataFrame
        Input dataframe containing features and target
    target_col : str
        Name of the target column
    n_iter : int, optional
        Number of parameter combinations to try in random search

    Returns:
    --------
    dict
        Dictionary containing the best parameters
    """

    # Separate features and target
    list_of_columns = df.columns.tolist()
    list_of_columns.remove(target_col)
    feature_cols = list_of_columns

    # Use a 30% sample of the df for tuning
    df_sample = df.sample(frac=0.4, random_state=42)

    X = df_sample[feature_cols]
    y = df_sample[target_col]

    # Split data for training the hyperparameter tuning model
    X_train, _, y_train, _ = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

    # Parameter grid for Random Forest
    params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, 40, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'class_weight': ['balanced', 'balanced_subsample', None]
}

    # Initialize Random Forest
    rf = RandomForestClassifier(random_state=42, n_jobs=-1)

    # Setup cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    # Initialize RandomizedSearchCV
    random_search = RandomizedSearchCV(
        estimator=rf,
        param_distributions=params,
        n_iter=n_iter,
        scoring='accuracy',  # Can adjust scoring if desired
        n_jobs=-1,
        cv=skf.split(X_train, y_train),
        verbose=1,
        random_state=42
    )

    # Start timing and fit
    print("Starting parameter search...")
    start_time = timer()
    random_search.fit(X_train, y_train)
    timer(start_time)

    # Return only the best parameters
    return {
        'best_params': random_search.best_params_
    }

Tuning both smoking and drinking:

Smoking status after hypertuning

Google Colab

best_params_smoking = tuning(df_smoking_important, target_col='SMK_stat_type_cd')

Time taken: 0 hours 19 minutes and 23.08 seconds.

{'best_params': {'n_estimators': 300,
  'min_samples_split': 5,
  'min_samples_leaf': 4,
  'max_features': 'log2',
  'max_depth': 10,
  'class_weight': None,
  'bootstrap': True}}

Using these parameters on the training we got these results:

Classification Report

Accuracy: 0.721

Precision: 0.591

Recall: 0.578

F1-Score: 0.583

Classification Report:

Label	Precision	Recall	F1-Score	Support
1.0 (never)	0.84	0.87	0.85	126,741
2.0 (used to smoke but quit)	0.45	0.36	0.40	29,950
3.0 (still smoke)	0.49	0.51	0.50	35,138

Cross-Validation Accuracy Scores:

0.72004021, 0.7204647, 0.72287757, 0.71970509, 0.71785076, 0.72033065, 0.72010724, 0.71726988, 0.72015192, 0.72092763

Mean CV Accuracy: 0.720

Here is the confusion matrix:

There are features that are driving the Random Forest model's predictions:

Feature Importance

The most important features identified for predicting smoking status are:

Sex
Height
Hemoglobin
Weight
Gamma-GTP
Serum Creatinine
Age

These features play a significant role in the model's decision-making process, indicating that biological and demographic factors are crucial in assessing smoking status.

Insights and Discrepancies

The model achieved an overall accuracy of 72.1%, which is indicative of a decent performance in classifying smoking status. However, the precision and recall metrics reveal some important insights:

Class Imbalance: The model performs significantly better for the first class (smoking status 1.0) with a precision of 0.84 and a recall of 0.86. This suggests that the model is effective in identifying non-smokers. However, the lower precision (0.45) and recall (0.36) for the second class (status 2.0) indicate that it struggles to accurately classify occasional smokers. The performance for the third class (status 3.0) is slightly better but still lacks robustness (precision: 0.49, recall: 0.51).
F1-Score Analysis: The F1-scores highlight the trade-offs between precision and recall for different classes. The first class achieves an F1-score of 0.85, while the second and third classes only achieve 0.40 and 0.50, respectively. This suggests that while the model is reliable for the majority class, it is not performing well for the minority classes.
Cross-Validation Consistency: The mean cross-validation accuracy of 72.0% corroborates the overall accuracy, indicating that the model's performance is consistent across different subsets of the data. However, the variability in the scores suggests there might be fluctuations in performance depending on the specific data split used.

In summary, while the smoking status classification model shows promise, particularly in identifying non-smokers, there are significant challenges in accurately classifying occasional and heavy smokers. Addressing class imbalance, perhaps through techniques like oversampling, undersampling, or model adjustments, could help improve the model's performance across all classes.

Drinking status after hypertuning

Google Colab

best_params_drinking = tuning(df_drinking_important, target_col='DRK_YN')

Time taken: 0 hours 18 minutes and 35.05 seconds.

      {'best_params': {'n_estimators': 300,
        'min_samples_split': 5,
        'min_samples_leaf': 4,
        'max_features': 'log2',
        'max_depth': None,
        'class_weight': None,
        'bootstrap': True}}

Using these parameters on the training we got these results:

Classification Report

Accuracy: 0.716

Precision: 0.715

Recall: 0.715

F1-Score: 0.715

Classification Report:

Label	Precision	Recall	F1-Score	Support
0.0 (No)	0.73	0.73	0.73	100,976
1.0 (Yes)	0.70	0.70	0.70	90,853

Cross-Validation Accuracy Scores:

0.71423146, 0.71713584, 0.71456658, 0.71416443, 0.71525916, 0.71394102, 0.71251117, 0.71389634, 0.71134942, 0.71353247

Mean CV Accuracy: 0.714

Here is the confusion matrix:

There are features that are driving the Random Forest model's predictions:

Feature Importance

The most important features identified for predicting drinking status are:

Gamma-GTP
Age
Hemoglobin
HDL Cholesterol
Triglycerides
Height
LDL Cholesterol

These features play a significant role in the model's decision-making process, suggesting that biological and lifestyle factors are crucial in assessing drinking status.

Insights and Discrepancies

The model achieved an overall accuracy of 71.2%, indicating a solid performance in classifying drinking status. The metrics reveal several key insights:

Balanced Performance: The precision, recall, and F1-scores are all approximately equal across both classes, reflecting a balanced performance between non-drinkers (0.0) and drinkers (1.0). Specifically, the model shows a precision of 0.72 and recall of 0.74 for non-drinkers, while for drinkers, the precision is 0.70 and recall is 0.69. This suggests that the model is fairly reliable for both categories.
F1-Score Consistency: The F1-scores are also close to each other, with 0.73 for non-drinkers and 0.69 for drinkers. This consistency indicates that the model maintains a good balance between false positives and false negatives across both classes.
Cross-Validation Results: The mean cross-validation accuracy of 71.0% aligns closely with the overall accuracy, suggesting that the model’s performance is stable across different subsets of the data. The slight fluctuations in individual cross-validation scores (ranging from 0.706 to 0.712) indicate a reliable but slightly sensitive model, potentially influenced by the distribution of classes in each fold.
Class Support: The support for each class indicates that the dataset has a higher number of non-drinkers (100,976) compared to drinkers (90,853). This could suggest a potential class imbalance issue that may affect the model's performance in real-world applications, although the metrics do not currently indicate severe bias.

In conclusion, the drinking status classification model demonstrates solid performance, especially in distinguishing between non-drinkers and drinkers. However, continued monitoring and possibly addressing any latent class imbalance through data augmentation or model tuning could further enhance the model's robustness and predictive capability.

Dashboard

Nov 15, 2024