Skip to content

Implementation Description

General ML Functions

Provide several general ML functions.

This module allows the user to reuse common functions allow ML projects.

The module contains the following functions:

  • 'plot_boxplot_for_variables - Plot a boxplot for all variables in variables_list.
  • 'def search_for_categorical_variables' - Identify how many unique values exists in each column from df.
  • 'plot_frequencias_valores_atributos' - Plot the frequency graphic for the attribute values for each variable in lista_atributos.
  • 'plot_correlation_heatmap' = Plot the correlation betwenn pairs of continuos variables.
  • 'def analyse_correlation_continuos_variables' - Analyse and plot the correlation betwenn pairs of continuos variables.
  • 'analyse_plot_correlation_categorical_variables' - Analyse and plot the correlation betwenn pairs of categorical variables.

@author: ulf Bergmann

analyse_correlation_continuos_variables(df, lista_variaveis, quant_maximos)

Analyse and plot the correlation betwenn pairs of continuos variables.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to be analysed

required
lista_variaveis list

variable list

required
quant_maximos

number of maximum values

required

Returns:

Name Type Description
top_pairs_df DataFrame

sorted DataFrame with Variable1 | Variable 2 | Correlation

corr_matrix Array

Correlation matrix with p-values on the upper triangle

Source code in templates\lib\funcoes_ulf.py
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
def analyse_correlation_continuos_variables(df, lista_variaveis , quant_maximos):
    """
    Analyse and plot the correlation betwenn pairs of continuos variables.

    Parameters:
        df (DataFrame): DataFrame to be analysed

        lista_variaveis (list): variable list 

        quant_maximos : number of maximum values

    Returns:
        top_pairs_df (DataFrame): sorted DataFrame with Variable1 | Variable 2 | Correlation

        corr_matrix (Array): Correlation matrix with p-values on the   upper triangle 
    """
    cv_df = df[lista_variaveis]

    # metodos: 'pearson', 'kendall', 'spearman' correlations.
    corr_matrix = cv_df.corr(method='pearson')

    # Gera uma matriz de correlação onde a parte superior contem os p-valor
    # da correlação entre as variaveis considerando o nivel de significancia
    matriz_corr_with_pvalues = pg.rcorr(cv_df, method = 'pearson', upper = 'pval', decimals = 4, pval_stars = {0.01: '***', 0.05: '**', 0.10: '*'})

    # Get the top n pairs with the highest correlation
    top_pairs = corr_matrix.unstack().sort_values(ascending=False)[
        :len(df.columns) + quant_maximos*2]

    # Create a list to store the top pairs without duplicates
    unique_pairs = []

    # Iterate over the top pairs and add only unique pairs to the list
    for pair in top_pairs.index:
        if pair[0] != pair[1] and (pair[1], pair[0]) not in unique_pairs:
            unique_pairs.append(pair)

    # Create a dataframe with the top pairs and their correlation coefficients
    top_pairs_df = pd.DataFrame(
        columns=['feature_1', 'feature_2', 'corr_coef'])
    for i, pair in enumerate(unique_pairs[:quant_maximos]):
        top_pairs_df.loc[i] = [pair[0], pair[1],
                               corr_matrix.loc[pair[0], pair[1]]]

    return top_pairs_df , matriz_corr_with_pvalues

analyse_plot_correlation_categorical_variables(df, lista_variaveis)

Analyse and plot the correlation betwenn pairs of categorical variables. Variables must be not continuos (not float).

Use the qui-quadrad and p-value for

H0: dependent variables H1: independent variables

Parameters:

Name Type Description Default
df DataFrame

DataFrame to be analysed

required
lista_variaveis list

variable list

required

Returns:

Name Type Description
resultant DataFrame

Dataframe with all p-values

lista_resultado_analise DataFrame

with Variable1 | Variable 2 | p-value

Source code in templates\lib\funcoes_ulf.py
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
def analyse_plot_correlation_categorical_variables(df, lista_variaveis):
    """
    Analyse and plot the correlation betwenn pairs of categorical variables. Variables must be not continuos (not float).

    Use the qui-quadrad and p-value for:
        H0: dependent variables
        H1: independent variables

    Parameters:
        df (DataFrame): DataFrame to be analysed

        lista_variaveis (list): variable list 

    Returns:
        resultant (DataFrame): Dataframe with all p-values

        lista_resultado_analise (DataFrame):  with Variable1 | Variable 2 | p-value
    """
    resultant = pd.DataFrame(data=[(0 for i in range(len(lista_variaveis))) for i in range(len(lista_variaveis))],
                             columns=list(lista_variaveis), dtype=float)
    resultant.set_index(pd.Index(list(lista_variaveis)), inplace=True)

    # Encontrando os p-valores para as variáveis e formatando em matriz de p-valor
    lista_resultado_analise = []
    for i in list(lista_variaveis):
        for j in list(lista_variaveis):
            if i != j:
                try:
                    chi2_val, p_val = chi2(
                        np.array(df[i]).reshape(-1, 1), np.array(df[j]).reshape(-1, 1))
                    p_val = round(p_val[0], 4)
                    resultant.loc[i, j] = p_val
                    lista_resultado_analise.append([i, j,  p_val])
                except ValueError:
                    print(f"Variavel {j} não é categórica ")
                    return

    fig = plt.figure(figsize=(25, 20))
    sns.heatmap(resultant, annot=True, cmap='Blues', fmt='.2f')
    plt.title('Resultados do teste Qui-quadrado (p-valor)')
    plt.show()
    df_lista_resultado_analise =  pd.DataFrame(lista_resultado_analise , columns=['Var 1' , 'Var 2' , 'p-value'])
    return resultant, df_lista_resultado_analise

fill_categoric_field_with_value(serie, replace_nan)

Replace categorical value with int value.

Parameters:

Name Type Description Default
serie Series

data to be replace categorical with int

required
replace_nan Boolean

flag to replace nan with an index

required

Returns:

Type Description
Series

replaced values

Source code in templates\lib\funcoes_ulf.py
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
def fill_categoric_field_with_value(serie, replace_nan):
    """Replace categorical value with int value.

    Parameters:
        serie (Series): data to be replace categorical with int
        replace_nan (Boolean): flag to replace nan with an index

    Returns:
        (Series): replaced values

    """
    names = serie.unique()
    values = list(range(1, names.size + 1))
    if not replace_nan:
        # a tabela de valores continha um float(nan) mapeado para um valor inteiro. Solução foi mudar na tabela de valores colocando o None
        nan_index = np.where(pd.isna(names))
        if len(nan_index) > 0 and len(nan_index[0]) > 0:
            nan_index = nan_index[0][0]
            values[nan_index] = None
        # else:
            # print("Não encontrou nan em " + str(names))

    return serie.replace(names, values)

plot_boxplot_for_variables(df, variables_list)

Plot a boxplot for all variables in variables_list.

Can be used to verify if the variables are in the same scale

Examples:

>>> plot_boxplot_for_variables(df, ['va1' , 'var2' , 'var3'])
return None

Parameters:

Name Type Description Default
df DataFrame

DataFrame to be analysed.

required
variables_list list

variable list.

required

Returns:

Type Description
None
Source code in templates\lib\funcoes_ulf.py
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def plot_boxplot_for_variables(df , variables_list):
    """Plot a boxplot for all variables in variables_list.

    Can be used to verify if the variables are in the same scale

    Examples:
        >>> plot_boxplot_for_variables(df, ['va1' , 'var2' , 'var3'])
        return None

    Parameters:
        df (DataFrame): DataFrame to be analysed.
        variables_list (list): variable list.

    Returns:
        (None):

    """   
    df_filtered = df[variables_list]

    plt.figure(figsize=(10,7))
    sns.boxplot(x='variable', y='value', data=pd.melt(df_filtered))
    plt.ylabel('Values', fontsize=16)
    plt.xlabel('Variables', fontsize=16)
    plt.show()

    return

plot_correlation_heatmap(df, lista_variaveis)

Plot the correlation betwenn pairs of continuos variables.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to be analysed

required
lista_variaveis list

continuos variable list

required

Returns:

Type Description
None
Source code in templates\lib\funcoes_ulf.py
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
def plot_correlation_heatmap(df, lista_variaveis ):
    """
    Plot the correlation betwenn pairs of continuos variables.

    Parameters:
        df (DataFrame): DataFrame to be analysed

        lista_variaveis (list): continuos variable list 

    Returns:
        (None):

    """
    cv_df = df[lista_variaveis]

    # metodos: 'pearson', 'kendall', 'spearman' correlations.
    corr_matrix = cv_df.corr(method='pearson')

    fig = plt.figure(figsize=(15, 10))
    sns.heatmap(corr_matrix, annot=True, annot_kws={'size': 15} , cmap="Blues")
    plt.title("Correlation Heatmap")
    #fig.tight_layout()

    fig.show()

plot_frequencias_valores_atributos(df, lista_atributos)

Plot the frequency graphic for the attribute values for each variable in lista_atributos.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to be analysed

required
lista_atributos list

variable list

required

Returns:

Type Description
None
Source code in templates\lib\funcoes_ulf.py
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def plot_frequencias_valores_atributos(df, lista_atributos):
    """
    Plot the frequency graphic for the attribute values for each variable in lista_atributos.

    Parameters:
        df (DataFrame): DataFrame to be analysed

        lista_atributos (list): variable list 

    Returns:
        (None):

    """

    for i , j in enumerate(lista_atributos):
        plt.figure(figsize=(10,7))
        sns.histplot(data=df, y=lista_atributos[i])
        plt.ylabel('Variables', fontsize=16)
        plt.xlabel('Count', fontsize=16)
        plt.show()

print_count_cat_var_values(df, lista_atributos)

Print the attribute values for each categorical variable in lista_atributos.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to be analysed

required
lista_atributos list

variable list

required

Returns:

Type Description
None
Source code in templates\lib\funcoes_ulf.py
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
def print_count_cat_var_values(df  , lista_atributos ):
    """
    Print the attribute values for each categorical variable in lista_atributos.

    Parameters:
        df (DataFrame): DataFrame to be analysed

        lista_atributos (list): variable list 

    Returns:
        (None):

    """
    for i , j in enumerate(lista_atributos):
        a = df[j].value_counts()
        a = a.sort_values(ascending=False)
        print(f"\n Count values for Variable {j}")
        for index, value in a.items():
            print(f"{index} ==> {value}")

search_for_categorical_variables(df)

Identify how many unique values exists in each column.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to be analysed.

required

Returns:

Name Type Description
cat_stats DataFrame

Result DataFrame with

  • Coluna => the variable name
  • Valores => list os values
  • Contagem de Categorias => count of unique values
Source code in templates\lib\funcoes_ulf.py
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
def search_for_categorical_variables(df):
    """Identify how many unique values exists in each column.

    Parameters:
        df (DataFrame): DataFrame to be analysed.

    Returns:
        cat_stats (DataFrame): Result DataFrame with

            - Coluna => the variable name
            - Valores => list os values
            - Contagem de Categorias => count of unique values

    """
    cat_stats = pd.DataFrame(
        columns=['Coluna', 'Valores', 'Contagem de Categorias'])
    tmp = pd.DataFrame()

    for c in df.columns:
        tmp['Coluna'] = [c]
        tmp['Valores'] = [df[c].unique()]
        tmp['Contagem de Categorias'] = f"{len(list(df[c].unique()))}"

        cat_stats = pd.concat([cat_stats, tmp], axis=0)
    return cat_stats