首页 >> 大全

《大数据+AI在大健康领域中最佳实践前瞻》

2023-10-07 大全 22 作者:考证青年

文章大纲

和 如何配置呢? 请参考之前的博文:

使用 WSL 进行 + 分类+特征重要性 简单实践

银行需要面对数量不断上升的欺诈案件。随着新技术的出现,欺诈事件的实例将会成倍增加,银行很难检查每笔交易并手动识别欺诈模式。RPA使用“if-then”方法识别潜在的欺诈行为并将其标记给相关部门。例如,如果在短时间内进行了多次交易, RPA会识别该账户并将其标记为潜在威胁。这有助于银行仔细审查账户并调查欺诈行为。

欺诈检测一般性处理流程介绍

流程图说明

正如我们在上面看到的,我们接收我们的输入,包括关于金融数据中个人保险索赔的数据(这些包含索赔特征、客户特征和保险特征)。

经过一些预处理和添加新的特征,我们使用数据来训练分类器。

在分类器被训练之后,它可以用来确定新记录是否被接受(不欺诈)或被拒绝(欺诈)。

下面将更详细地描述该过程的流程。

当我们和客户交流后,需要针对每个字段进行理解,客户会给到我们一个数据说明表格:

输入

Our input of a with lines for each claim. The data about the , the types of , claim , and other .

Data (, of Code , and of )

We first do some to the data to a . Then, based on the input, we which the based on like of , of of fraud, total , etc. These are added to the along with that the (or lack ) of Codes, Codes, etc.

我们首先做一些初始的预处理,将数据字段转换成合适的格式。然后,基于输入,我们生成特征,这些特征基于以前索赔次数、以前欺诈发生次数、索赔总额等因素来描述客户。这些客户细分特征与详细说明警告代码存在(或缺乏)的特征一起添加到现有数据集中,诊断代码等。

of :

in the data :

 of for a until a

 of for a until a

 of

 of for a until a

 of for a until a

 of

 of for a until a

Max value of for a until a

 of lines

 of lines

Ratio of lines to lines

 of lines

 of lines

 of lines

 of lines

 of lines

Ratio of Total to Total

Ratio of lines to all lines

Ratio of lines to all lines

Ratio of lines to all lines

Ratio of lines to all lines

Ratio of lines to all lines

Ratio of lines to all lines

Code are by a code field, the top 20 most items in that field, and then a for each of those 20 items, they are in the or not. The code for which we Code

in the data are :

Model :

is an of trees for speed and . The of the was for of time and . A goal was to make the best use of to train the model. We are using an to a claim is or not.

是一个梯度增强决策树的实现,旨在提高速度和性能。算法的实现是为了提高计算时间和内存资源的效率而设计的。设计目标是充分利用现有资源来训练模型。我们使用分类器来确定索赔是否具有欺诈性。

输出

The model the as or , ie, or Not

+ DEMO

# Imports and Initializationfrom xgboost import XGBClassifier, plot_importance
from sklearn.model_selection import RandomizedSearchCV
from pyspark.sql import functions as F
from pyspark.sql.types import *
from sklearn import preprocessing
import sys
import os
import numpy as np
import pandas as pdfrom sklearn.metrics import precision_recall_curve
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score, confusion_matrix, accuracy_score
from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.storagelevel import StorageLevel
import json
import math
import numbers
from pyspark.sql import SQLContext
from pyspark.sql import Window
import matplotlib.pyplot as plt
import itertoolsfrom sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_scoreday_of_week_udf = F.udf(lambda ts: ts.weekday() if ts is not None else None,StringType())def getSummary(df):summarydf = (df.withColumn('is_BEN_TYPE_Applicant',F.when(F.col("BEN_TYPE") == "Applicant", F.lit(1)).otherwise(F.lit(0))).groupby("MBR_NO").agg(F.max("NO_OF_YR").alias("MAX_NO_OF_YR"),F.sum(F.lit(1)).alias("NUM_LINES"),).withColumn("FRAC_REJECTED_AMT", F.col("TOT_REJECTED_AMT")/F.col("TOT_ORG_PRES_AMT_VALUE")).withColumn("FRAC_BEN_TYPE_Applicant", F.col("TOT_is_BEN_TYPE_Applicant")/F.col("NUM_LINES")).withColumn("FRAC_BEN_TYPE_Spouse", F.col("TOT_is_BEN_TYPE_Spouse")/F.col("NUM_LINES")).withColumn("FRAC_BEN_TYPE_Child", F.col("TOT_is_BEN_TYPE_Child")/F.col("NUM_LINES")).withColumn("FRAC_BEN_TYPE_Parent", F.col("TOT_is_BEN_TYPE_Parent")/F.col("NUM_LINES")).persist())return summarydf
def plot_confusion_matrix(cm, classes,normalize=False,title='Confusion matrix',cmap=plt.cm.Blues):"""This function prints and plots the confusion matrix.Normalization can be applied by setting `normalize=True`."""if normalize:cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]print("Normalized confusion matrix")else:print('Confusion matrix, without normalization')print(cm)plt.imshow(cm, interpolation='nearest', cmap=cmap)plt.title(title)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, rotation=45)plt.yticks(tick_marks, classes)fmt = '.2f' if normalize else 'd'thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, format(cm[i, j], fmt),horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")plt.tight_layout()plt.ylabel('True label')plt.xlabel('Predicted label')def checkContain(baseFeatures, allFeatures, transformsList):"""Description : Used to indicate that we want to use the transformed featuresand not the original features if a transform has been doneInput : baseFeatures - Features included in the dataset before anytransforms are appliedallFeatures - All the features present in the dataframe after alltransforms and prep have completedtransformsList - The features from the original dataset that shouldhave transforms applied to themOutput : List of the features we're going to use for the ML model"""resList = []for baseFeat in baseFeatures:if baseFeat not in transformsList:resList.append(baseFeat)else:for feat in allFeatures:if baseFeat in feat:if "~~" in feat or "log10" in feat:resList.append(feat)return resListdef transform_ts_fields(df, ts_cols):"""Description : Produces a timestamp in the standard dow-hod format for thesupplied fieldInput : df - dataframets_cols - timestamp features that need to be formatted correctlyOutput : dataframe with appropriately formatted timestamp features"""col_list = df.columnsfor col in ts_cols:if(col in col_list):df = (df .withColumn(col,F.col(col).cast("timestamp")) .withColumn("{}_dow".format(col),day_of_week_udf(F.col(col))) .withColumn("{}_hod".format(col),F.hour(F.col(col))))return dfdef transform_numeric_fields(df, num_cols):"""Description : Converts all numeric fields into float typeInput : df - dataframenum_cols - numeric features that need to be converted to float typeOutput : dataframe with appropriately numerical features converted to floattype"""col_list = df.columnsfor col in num_cols:if(col in col_list):df = (df.withColumn(col, F.col(col).cast("float")))return dfdef transform_log_fields(df, num_cols):"""Description : Produces the log_10 of the fields passed to itInput : df - dataframenum_cols - numeric features whose log values need to be calculatedOutput : dataframe with added log values for the required numericalfeatures"""col_list = df.columnsfor col in num_cols:if(col in col_list):df = (df.withColumn(col + "_log10", F.log(10.0, F.col(col))))return dfdef with_transform(df, param_dict):"""Description : Applies transforms on relevant data fields in the dataInput : df - dataframeparam_dict - parameter dictionaryOutput : dataframe with all appropriate transforms"""df = transform_ts_fields(df, param_dict['BASE_FEATURES_TIMESTAMP'])df = transform_numeric_fields(df, param_dict['BASE_FEATURES_NUMERIC'])df = transform_log_fields(df, param_dict['LOG_TRANSFORM_FEATURES'])df = (df .withColumn("INCUR_PERIOD_SECS",F.col("INCUR_DATE_TO").cast("long") -F.col("INCUR_DATE_FROM").cast("long")))return dfdef run_xgboost(data,feats, scale_pos_weight=1.0, old_model = None):"""Description : Generates an xgboost model based on training dataInput : X_train_pd - Pandas Dataframe, training data inputy_train - training data output/labelsparam_dict - parameter dictionarymax_depth_list - list of max depths of treesn_estimators_list - list of number of treesscoring_metric - scoring metric usedgrid_scoring - grid scoring metricscale_pos_weight - weight applied to positive valsnum_cv = cross-validation splitting strategyOutput : Trained XGBoost Classifier"""X_train, X_test, y_train, y_test = train_test_split(data[feats], data['label'], test_size=0.33)unique, counts = np.unique(y_train, return_counts=True)cdict = dict(zip(unique, counts))temp_pos_weight = cdict[0]/cdict[1]xgb_class = XGBClassifier(scale_pos_weight=temp_pos_weight)xgb_class.fit(X=X_train, y=y_train, xgb_model = old_model)y_pred_proba = xgb_class.predict_proba(X_test)threshs = np.arange(0.01,1,0.01)acc = 0prsum = 0abdist = 1bestthresh = 0for thresh in threshs:y_pred_temp = (y_pred_proba[:,1] >= thresh).astype(int)'''precision, recall, thresholds = precision_recall_curve(y_test, y_pred_temp)average_precision = average_precision_score(y_test, y_pred_temp)if ((precision[1]+recall[1])>prsum) and (recall[1]>precision[1]):prsum = precision[1]+recall[1]bestthresh = thresh''''''temp_acc = accuracy_score(np.array(y_test), np.array(y_pred_temp))if temp_acc >acc:acc = temp_accbestthresh = thresh'''cnf_matrix_temp = confusion_matrix(y_test, y_pred_temp)cm = cnf_matrix_temp.astype('float') / cnf_matrix_temp.sum(axis=1)[:, np.newaxis]fp = cm[0][1] * 1.0fn = cm[1][0] * 1.0dist = abs((fn/fp)-1)if dist<abdist:abdist = distbestthresh = threshy_pred = (y_pred_proba[:,1] >= bestthresh).astype(int)precision, recall, thresholds = precision_recall_curve(y_test, y_pred)average_precision = average_precision_score(y_test, y_pred)# Compute confusion matrixcnf_matrix = confusion_matrix(y_test, y_pred)np.set_printoptions(precision=2)# Plot non-normalized confusion matrixplt.figure()plot_confusion_matrix(cnf_matrix, classes=[0,1],title='Confusion matrix, without normalization')# Plot normalized confusion matrixplt.figure()plot_confusion_matrix(cnf_matrix, classes=[0,1], normalize=True,title='Normalized confusion matrix')plt.show()plt.step(recall, precision, color='b', alpha=0.2,where='post')plt.fill_between(recall, precision, step='post', alpha=0.2,color='b')plt.xlabel('Recall')plt.ylabel('Precision')plt.ylim([0.0, 1.05])plt.xlim([0.0, 1.0])plt.title('2-class Precision-Recall curve: AP={0:0.5f}'.format(average_precision))plt.show()auc = roc_auc_score(y_test, y_pred_proba[:,1])print('AUC: %.3f' % auc)# calculate roc curvefpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:,1])# plot no skillplt.plot([0, 1], [0, 1], linestyle='--')# plot the roc curve for the modelplt.plot(fpr, tpr, marker='.')# show the plotplt.show()unique, counts = np.unique(data['label'], return_counts=True)cdict = dict(zip(unique, counts))pos_weight = cdict[0]/cdict[1]full_model = XGBClassifier(scale_pos_weight= pos_weight)full_model.fit(data[feats], data['label'])return full_model, bestthreshdef setup_spark_session(param_dict):"""Description : Used to setup spark sessionInput : param_dict - parameter dictionaryOutput : Spark Session, Spark Context, and SQL Context"""pd.set_option('display.max_rows', 500)pd.set_option('display.max_columns', 500)os.environ["PYSPARK_PYTHON"] = "/home/hadoop/anaconda/envs/playground_py36/bin/python"try:spark.stop()print("Stopped a SparkSession")except Exception as e:print("No existing SparkSession")SPARK_DRIVER_MEMORY = param_dict["SPARK_DRIVER_MEMORY"]  # "10G"SPARK_DRIVER_CORE = param_dict["SPARK_DRIVER_CORE"]  # "5"SPARK_EXECUTOR_MEMORY = param_dict["SPARK_EXECUTOR_MEMORY"]  # "3G"SPARK_EXECUTOR_CORE = param_dict["SPARK_EXECUTOR_CORE"]  # "1"AWS_ACCESS_KEY = param_dict["AWS_ACCESS_KEY"]  AWS_SECRET_KEY = param_dict["AWS_SECRET_KEY"]AWS_S3_ENDPOINT = param_dict["AWS_S3_ENDPOINT"]conf = SparkConf().\setAppName(param_dict["APP_NAME"]).\setMaster('yarn-client').\set('spark.executor.cores', SPARK_EXECUTOR_CORE).\set('spark.executor.memory', SPARK_EXECUTOR_MEMORY).\set('spark.driver.cores', SPARK_DRIVER_CORE).\set('spark.driver.memory', SPARK_DRIVER_MEMORY).\set('spark.driver.maxResultSize', '0')spark = SparkSession.builder.\config(conf=conf).\getOrCreate()sc = spark.sparkContexthadoop_conf = sc._jsc.hadoopConfiguration()hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")hadoop_conf.set("fs.s3a.access.key", AWS_ACCESS_KEY)hadoop_conf.set("fs.s3a.secret.key", AWS_SECRET_KEY)hadoop_conf.set("fs.s3a.endpoint", AWS_S3_ENDPOINT)hadoop_conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")sqlContext = SQLContext(sc)return spark, sc, sqlContextdef loadDataset(vw_cl_lines_df, datefield, param_dict):"""Description : Runs data through appropriate transforms to convert itto a suitable formatInput : vw_cl_lines_df - input dataframedatefield - field used to establish a window for the addFeatsfunctionparam_dict - parameter dictionaryOutput : Properly formatted dataframe"""vw_cl_lines_df = (with_transform(vw_cl_lines_df, param_dict))vw_cl_lines_df = vw_cl_lines_df.withColumn(datefield + "_unix",(F.unix_timestamp(F.col(datefield),format='yyyy-MM-dd HH:mm:ss.000')))return vw_cl_lines_dfdef addOneHotsTest(df, oneHots):for item in oneHots:field = item.split('~')[0]df[item] = np.where(df[field] == item, 1, 0)return dfdef addCodes(res_df, codeField, topCodes, optUDF, knownPref):# Need to get same codes for testingfor code in topCodes:likeCode = "%" + code + "%"res_df = res_df.withColumn(code,F.when(res_df[codeField].like(likeCode),1).otherwise(0))def checkOtherCodes(x):if not x:return 0x = set(x)if x.issubset(topCodes):return 0else:return 1otherCodesUDF = F.udf(checkOtherCodes, IntegerType())if knownPref is not None:otherlabel = knownPref + "_" + codeFieldelse:otherlabel = codeFieldres_df = res_df.withColumn("OTHER_" + otherlabel,otherCodesUDF(res_df[codeField]))codesAdded = topCodesreturn res_df, codesAddeddef codeExtract(df, codeField, topCount, optUDF=None, knownPref=None):"""Description : Function to extract code featuresInput : df - input dataframecodeField - field used to establish a window for the addFeatsfunctiontopCount - number of code features to be addedoptUDF - optional udf to apply to the fieldknownPref - prefix characterizing a field, if anyOutput : dataframe with code features added"""codeEx_df = dfif optUDF is not None:codeEx_df = codeEx_df.withColumn(codeField, optUDF(codeEx_df[codeField]))codeEx_df = codeEx_df.withColumn(codeField, F.explode(F.split(codeEx_df[codeField], ",")))code_counts = codeEx_df.groupBy(codeField).count().sort(F.desc("count"))if knownPref is not None:code_counts = code_counts.filter(code_counts[codeField].like("%" + knownPref + "%"))# code_counts.show(10)xy = code_counts.toPandas()# Generating a list of the top 20 most frequently occuring Reject CodestopCodes = xy[codeField].head(topCount).tolist()topCodes = [x.strip() for x in topCodes]res_df = dfreturn addCodes(res_df, codeField, topCodes, optUDF, knownPref)# checks for presence of values in a fielddef isVal(df, field, value):return df.withColumn("is_" + field + "_" + value,F.when(F.col(field) == value,F.lit(1)).otherwise(F.lit(0)))# sums values of a field within a specified windowdef sumVal(df, field, windowval):return df.withColumn("TOT_" + field, F.sum(field).over(windowval))# finds the maximum value of a field within a specified windowdef maxVal(df, field, windowval):return df.withColumn("MAX_" + field, F.max(field).over(windowval))# finds the average value of a field within a specified windowdef meanVal(df, field, windowval):return df.withColumn("MEAN_" + field, F.mean(field).over(windowval))# finds the ratio between two fields of a recorddef fracVal(df, numfield, denomfield, fracName):return df.withColumn(fracName, F.col(numfield) / F.col(denomfield))# adds required fields to the dataframedef addFeatsTrain(vw_cl_lines_df, param_dict):orig = vw_cl_lines_dfwindowval = (Window.partitionBy(param_dict["groupField"]).orderBy(param_dict["windowField"] + "_unix").rangeBetween(Window.unboundedPreceding, -1))codes_df = orig.withColumn("NUM_LINES", F.sum(F.lit(1)).over(windowval))for field in param_dict["isFields"]:codes_df = isVal(codes_df, field[0], field[1])for field in param_dict["sumFields"]:codes_df = sumVal(codes_df, field, windowval)for field in param_dict["maxFields"]:codes_df = maxVal(codes_df, field, windowval)for field in param_dict["meanFields"]:codes_df = meanVal(codes_df, field, windowval)for fracTuple in param_dict["fracTuples"]:codes_df = fracVal(codes_df, fracTuple[0], fracTuple[1], fracTuple[2])def remPref(x):if x is None:return ""x = x.split(",")y = []for item in x:if (('T' not in item) & ('M' not in item)):y.append(item.strip())y = ','.join(y)return yremPrefUDF = F.udf(remPref, StringType())allCodes = {}for code in param_dict["codeFields"]:if len(code) == 1:codes_df, toAdd = codeExtract(codes_df, code[0], 20)if code[0] in allCodes:allCodes[code[0]] = allCodes[code[0]] + toAddelse:allCodes[code[0]] = toAddelse:codes_df, toAdd = codeExtract(codes_df,code[0],20,optUDF=remPrefUDF,knownPref=code[1])if code[0] in allCodes:allCodes[code[0]] = allCodes[code[0]] + toAddelse:allCodes[code[0]] = toAddaddedCols = list(set(codes_df.columns) - set(vw_cl_lines_df.columns))return codes_df, addedCols, allCodesdef addFeatsTest(vw_cl_lines_df, param_dict, summary_df):orig = vw_cl_lines_dfjoinfields = [param_dict['groupField'], "NUM_LINES"]for field in param_dict["sumFields"]:joinfields.append("TOT_"+field)for field in param_dict["maxFields"]:joinfields.append("MAX_"+field)for field in param_dict["meanFields"]:joinfields.append("MEAN_"+field)for fracTuple in param_dict["fracTuples"]:joinfields.append(fracTuple[2])codes_df = orig.join(summary_df[joinfields], param_dict['groupField'],how='left')for field in param_dict["isFields"]:codes_df = isVal(codes_df, field[0], field[1])def remPref(x):if x is None:return ""x = x.split(",")y = []for item in x:if (('T' not in item) & ('M' not in item)):y.append(item.strip())y = ','.join(y)return yremPrefUDF = F.udf(remPref, StringType())allCodes = {}for code in param_dict["codeFields"]:presentInTrain = param_dict["allCodes"][code[0]]if len(code) == 1:codes_df, added = addCodes(codes_df, code[0], presentInTrain, None,None)else:codes_df, added = addCodes(codes_df, code[0], presentInTrain,optUDF=remPrefUDF, knownPref=code[1])addedCols = list(set(codes_df.columns) - set(vw_cl_lines_df.columns))return codes_df, addedCols# prepares the data for use in a training or inference by adding features
# and appropriate labelsdef prepTrainData(df, baseFeatures, param_dict):trainData = loadDataset(df, param_dict["custSegOrder"], param_dict)negCount = trainData.filter(trainData[param_dict["labelField"]] ==param_dict["negativeLabel"]).count()posCount = trainData.filter(trainData[param_dict["labelField"]] ==param_dict["positiveLabel"]).count()pos_weight = negCount/posCounttrainData, extraCols, param_dict["allCodes"] = addFeatsTrain(trainData,param_dict)vw_cl_lines_pd = trainData.toPandas()prep_labelled_data_pd = pd.get_dummies(vw_cl_lines_pd,columns=param_dict["BASE_FEATURES_CATEGORICAL"],drop_first=False,prefix_sep="~~")featureCols = extraCols + checkContain(baseFeatures,prep_labelled_data_pd.columns.tolist(),param_dict["LOG_TRANSFORM_FEATURES"] +param_dict["BASE_FEATURES_CATEGORICAL"])param_dict["oneHots"] = [x for x in prep_labelled_data_pd.columns.tolist()if "~~" in x]leakageFeats = ["is_"+str(x[0])+"_"+str(x[1]) for x inparam_dict["isFields"] if x[0] == param_dict["labelField"]]featureCols = [x for x in featureCols if x not in leakageFeats]return prep_labelled_data_pd, featureCols, pos_weight, param_dictdef prepTestData(df, summary, baseFeatures, param_dict):trainData = loadDataset(df, param_dict["custSegOrder"], param_dict)trainData, extraCols = addFeatsTest(trainData, param_dict , summary)vw_cl_lines_pd = trainData.toPandas()prep_labelled_data_pd = addOneHotsTest(vw_cl_lines_pd,param_dict["oneHots"])featureCols = extraCols + checkContain(baseFeatures,prep_labelled_data_pd.columns.tolist(),param_dict["LOG_TRANSFORM_FEATURES"] +param_dict["BASE_FEATURES_CATEGORICAL"])leakageFeats = ["is_"+str(x[0])+"_"+str(x[1]) for x inparam_dict["isFields"] if x[0] == param_dict["labelField"]]featureCols = [x for x in featureCols if x not in leakageFeats]return prep_labelled_data_pd, featureCols# trains and returns an XGBoost Classifierdef trainXGBModel(df, param_dict):  # ,onlyWarn = False):pdf, feats, ratio, param_dict = prepTrainData(df, param_dict["baseFeatures"], param_dict)for col in param_dict["BASE_FEATURES_TIMESTAMP"]:pdf[col] = pd.to_datetime(pdf[col], errors='coerce')adf = pdf.replace([np.inf,-np.inf], 0)cols = pdf[feats].columnslabel = np.where(adf[param_dict["labelField"]] ==param_dict["positiveLabel"], 1, 0)x = adf[feats].values #returns a numpy arraystandard_scaler = preprocessing.StandardScaler()x_scaled = standard_scaler.fit_transform(x)adf = pd.DataFrame(x_scaled, columns=adf[feats].columns)adf['label'] = label#X_train, y_train = adf[feats], adf['label']xgb_model, bestThresh = run_xgboost(adf[feats + ['label']], feats, scale_pos_weight= ratio)param_dict["trainedCols"] = list(feats)return xgb_model, feats, param_dict, bestThreshdef updateXGBModel(df, param_dict, model):pandas_df, featureCols, pos_weight = prepTestData(df, param_dict["baseFeatures"], param_dict)pandas_df['label'] = np.where(pandas_df[param_dict["labelField"]] ==param_dict["positiveLabel"], 1, 0)pandas_df = pandas_df.fillna(0)y_train = pandas_df['label'].valuesX_train_pd = pandas_df.drop('label', 1)if len(X_train_pd) > 100000 :X = np.array_split(X_train_pd, 100000)y = np.array_split(y_train, 100000)for i in range(len(X)):xgb_class = XGBClassifier(scale_pos_weight=pos_weight)model = xgb_class.fit(X[i],y[i], xgb_model = model)xgb_model = modelreturn xgb_model, featureCols, param_dict# uses a model to predict valuesdef modelPredict(model, test_df, summary, param_dict, posThresh):test_pdf, feats1 = prepTestData(test_df, summary, param_dict["baseFeatures"], param_dict)for col in param_dict["BASE_FEATURES_TIMESTAMP"]:test_pdf[col] = pd.to_datetime(test_pdf[col], errors='coerce')test_adf = test_pdf.replace([np.inf,-np.inf], 0)x = test_adf[feats1].values #returns a numpy arraystandard_scaler = preprocessing.StandardScaler()x_scaled = standard_scaler.fit_transform(x)test_adf = pd.DataFrame(x_scaled, columns=test_adf[feats1].columns)X_test = test_adf[param_dict["trainedCols"]]result_proba = model.predict_proba(X_test)result = []result = (result_proba[:,1] >= posThresh).astype(int)#result = model.predict(X_test)return result, result_proba

参考文献

关于我们

最火推荐

小编推荐

联系我们


版权声明:本站内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容, 请发送邮件至 88@qq.com 举报,一经查实,本站将立刻删除。备案号:桂ICP备2021009421号
Powered By Z-BlogPHP.
复制成功
微信号:
我知道了