Feature Engineering SDK

Commonly used functions in feature engineering in the domain of anomaly detection are listed.

Sample API implementation of missingValue - https://mqidentity-my.sharepoint.com/:u:/g/personal/nkhadri_atomiton_com/EaCRIMfh5qRNt6ebuSgCxfkBF_n-tcymL1GCMptKMH2JXA?e=oFwHRi

 

Threshold

Threshold (fileLocation=None, indexColumn=None, thresholdColumns=None, thresholdLevels=None)

The class holds the range of thresholds for parameters in the file. Internally it parses the multiple ranges, lower limits/higher limits, and exposes a public function, through which threshold levels can be obtained.

Input parameters:

1. fileLocation: String - Path of threshold file in CSV format.
2. indexColumn: String - Column in CSV where parameters are present.
3. thresholdColumns: List of strings - columns where threshold range is present.
Eg, If Three thresholds are defined for a problem - good, moderate, and critical. Which are the columns which data which denotes the range.
4. thresholdLevels: Dictionary {column: thresholdLevel} - Dict to map the threshold levels to thresholdColumns. Eg: continuing the earlier example, good can be 1, moderate can be 2, critical can be 3 - {good:1,moderate:2, critical:3}

Example of the threshold file format:

Measure

Thr-1

Thr-2

Thr-3

Flatness

80:100

70:80

L:70

Symetry

-15:0,0:15

15:20, -20:-15

20:H,L:-20

Note:

  1. L when the lower limit is not present, <50 is represented as L:50

  2. H when the upper limit is not present, >50 is represented as 50:H

  3. If the threshold level has multiple ranges - separate with a comma. eg "20:H,L:-20"

According to the sample file format, these are the inputs to be passed.

  1. indexColumn - Measure

  2. thresholdColumns - [Thr-1, Thr-2, Thr-3]

  3. thresholdLevels - {Thr-1:1, Thr-2:2, Thr-3:3}

Methods:

getThresholdLevel (parameterValueDict)
Based on the threshold loaded from a file and for the parameters passed in parameterValueDict- threshold levels will be calculated and returned as a dictionary.

Input parameters: dictionary {parameter1:value, parameter2:value}

output: dictionary {parameter1:thresholdLevel, parameter2:thresholdLevel}

Eg: continuing with sample data-
if parameterValueDict = {Flatness:50,Symetry:18}, thresholdColumns - [Thr-1, Thr-2, Thr-3]
output - {Flatness:3,Symetry:2}

HistoricalStats

HistoricalStats(HistoricalStatFileLocation=None, indexColumn=None, statsColumns=None)
Reads the file from a location and load the content in DataFrame with index as indexColumn. It exposes data frame attribute, from which stats can be fetched for required parameters.

Input Parameter:

1. HistoricalStatFileLocation: String - File location of historical file.
2. indexColumn: String - Column in CSV where parameters are present.
3. statsColumns: List of String - Columns where stats are present.

Attributes:

historicalStatsDf: DataFrame - Dataframe loaded with the input file, stats can be fetched for the required parameter.

Eg format:

Measurements

hist-mean

hist-std

Vel

1434.423

33.88293

Vel P1

0.137383

0.163391

for the sample file, input to class is as follows

  1. HistoricalStatFileLocation - location of file

  2. indexColumn - Measurements

  3. statsColumns - [“hist-mean“,”hist-std”]

If hist-mean has to be obtained for Vel:

historicalStats.historicalStatsDf.loc[“Vel“, “hist-mean“]

Note: If the parameter fetched does not exist in the input file, the exception is raised.

MissingValue

missingValue (inputDataFrame, subset=None, valueForMissing=0, missingValueMag=1, initDurationForFirstEpisode=None, incrementalEpisodes=False)

For the input data frame and required columns, values are matched for valueForMissing or blank. The output will be filled based on the condition.

Provides 3 new columns with episodes, Value to denote missing value, duration of the episode (refer to terminology section for more detail).

Note: By default, blank will be treated as a missing value.

Returns:

DataFrames or None with 3 new columns for each input column.

  1. <ColumnName>_FMV_EPS: Missing value or not.

  2. <ColumnName>_FMV_Mag: Value to have if the input matches the rule, rest of the values 0.

  3. <ColumnName>_FMV_Dur: Duration of an episode. Duration is calculated on how many continuous rows value was missing in an episode.

Note: Input data should be continuous in time, Duration is calculated on a number of rows. If 5 continuous values meet the criteria - the duration will be 5 units. Units can be seconds or minutes or hours. The function does not check the continuity of the input Index.

Alok> How do you know it is in minutes? do you assume timeseries in continuous as column # 1? if yes, mention it. - Added a note with the assumption of continuity and independent of time unit

Input Parameters:

  1. InputDataFrame – data frame

  2. Subset: sequence of columns,
    Only consider certain columns for identifying missing values, by default use all of the columns.

  3. ValueForMissing: numeric - value that is considered as Missing value along with blank/Null. Alok> What if I have blank Null" and 0? Added a note in the definition

  4. MissingValueMag: numeric - value to have in <ColumnName>_FMV_Mag (output column) for missing values. Note, for non-missing values, <ColumnName>_FMV_Mag will be zero.

  5. InitDurationForFirstEpisode: Dictionary {columnName: Duration} key is column name and duration is numeric - Duration of ongoing episode continuing to the beginning of input. In case the episode starts from the first minute, InitDurationForFirstEpisode is considered to calculate the episode duration. eg, If this value is 10, the FMV_Dur will start from 11(when IncrementalEpisodes = True), which denotes missing value episode is on 11th minute as 10 units(secs/mins/hour based on the input data) of the episode is being carried forward. If InitDurationForFirstEpisode = 0 or for any other episodes stating not at the first minute, the duration would have started with 1 - denoting this is the first min of the episode.

  6. IncrementalEpisodes: bool, default False. Should the duration of the episode be incremented every minute or the duration should be the total duration of the episode for each minute.
    Eg if the episode is of 5 minutes (assuming minutes is the unit of input data).

    1. Duration is IncrementalEpisodes: Duration of the first minute will be 1, the second minute will 2.

    2. Duration is NOT IncrementalEpisodes: Duration will be 5 for the whole duration of an episode.
      Alok> did not understand. how do you consider a break between two episodes?: In terminology section added a definition of episode


Note: Input data should be continuous in time, which means The data should have the constant difference between consecutive rows.

SuddenValueChange

SuddenValueChange (inputDataFrame, subset=None, rollingWIndowMins=3, HistoricalStats=None, zScoreThresholdForSVC=3, maxTimeDifferenceBetweenSignificantChange = 5, maxSignChange = 1, initDurationForFirstEpisode=None, incrementalEpisodes=False)

Provides 3 new columns with episodes, magnitude, duration of episode.

Returns:

DataFrames or None

with 3 new columns for each column.

  1. <ColumnName>_SVC_Eps: sudden change or not.

  2. <ColumnName>_SVC_Mag: Magnitude of change.

  3. <ColumnName>_SVC_Dur : Duration of episode. Duration is calculated on how many rows value was significantly higher in a episode.
    ALok> These two seemed to get swapped - corrected

Input Parameters”

  1. InputDataFrame – data frame

  2. Subset: sequence of columns,
    Only consider certain columns for identifying missing values, by default use all of the columns.

  3. rollingWIndowMins: numeric - Size of the moving window. This is the number of observations used for calculating the average.

  4. HistoricalStats: Object of HistoricalStats - loaded with required HistoricalStats file. The attribute of the class - historicalStatsDf, is used to get the required stats for the parameter required.
    Alok> Why this is not DataFrame? or Array? - HistoricalStats has been made as a class, and definition, methods are documented

  5. zScoreThresholdForSVC: z-score above which values are considered significant.

  6. maxTimeDifferenceBetweenSignificantChange: Max duration between two significant changes to be considered as a single episode.

  7. maxSignChange = max number of signs change to be considered as sudden value change.

  8. InitDurationForFirstEpisode: Dictionary {columnName: Duration} key is column name and duration is numeric - Duration of ongoing episode continuing to the beginning of input. In case the episode starts from the first minute, InitDurationForFirstEpisode is considered to calculate the episode duration. If this value is 10, the SVC_Dur will start from 11 (when IncrementalEpisodes = True), which denotes missing value episode is on 11th minute as 10 minutes of the episode is being carried forwarded. If InitDurationForFirstEpisode = 0 or for any other episodes stating not at the first minute, the duration would have started with 1 - denoting this is the first min of the episode.

  9. IncrementalEpisodes: bool, default False. Should the duration of the episode be incremented every minute or the duration should be the total duration of the episode for each minute.

Note: Input data should be continuous in time, which means The data should have the constant difference between consecutive rows.

RappidValueFluctuation

rappidValueFluctuation (inputDataFrame, subset=None, rollingWIndowMins=3, HistoricalStats=None, zScoreThresholdForRVF=3, maxTimeDifferenceBetweenSignificantChange = 5, minSignChange = 2, initDurationForFirstEpisode=None, incrementalEpisodes=False)

Provides 3 new columns with episodes, magnitude, duration of episode.

Returns:

DataFrames or None

with 3 new columns for each column.

  1. <ColumnName>_SVC_Eps: rapid changes or not.

  2. <ColumnName>_SVC_Mag: Magnitude of change.

  3. <ColumnName>_SVC_Dur: Duration of episode. Duration is calculated on how many rows value was significantly higher in each episode.

Input Parameters:

  1. InputDataFrame – data frame

  2. Subset: sequence of columns,
    Only consider certain columns for identifying missing values, by default use all of the columns.

  3. rollingWIndowMins: numeric - Size of the moving window. This is the number of observations used for calculating the average.

  4. HistoricalStats: Object of HistoricalStats loaded with required HistoricalStats file. The attribute of the class - historicalStatsDf, is used to get the required stats for the parameter required.

  5. zScoreThresholdForRVF: z-score above which values are considered significant.

  6. maxTimeDifferenceBetweenSignificantChange: Max duration between two significant changes to be considered as a single episode.

  7. minSignChange = min number of sign change to be considered as Rapid Value Fluctuation.

  8. InitDurationForFirstEpisode: Dictionary {columnName: Duration} key is column name and duration is numeric - Duration of ongoing episode continuing to the beginning of input. In case the episode starts from the first minute, InitDurationForFirstEpisode is considered to calculate the episode duration. If this value is 10, the FMV_Dur will start from 11(when IncrementalEpisodes = True), which denotes missing value episode is on 11th minute as 10 minutes of the episode is being carried forwarded. If InitDurationForFirstEpisode = 0 or for any other episodes stating not at the first minute, the duration would have started with 1 - denoting this is the first min of the episode.

  9. IncrementalEpisodes: bool, default False. Should the duration of the episode be incremented every minute or the duration should be the total duration of the episode for each minute.


Note: Input data should be continuous in time, which means The data should have the constant difference between consecutive rows.

valueOutOfRange

valueOutOfRange(inputDataFrame, subset=None, Threshold=None, initDurationForFirstEpisode=None, incrementalEpisodes=False)

Provides 3 new columns with episodes, Value to denote missing value, duration of the episode.

Returns:

DataFrames or None

with 3 new columns for each column.

  1. <ColumnName>_VOR_Eps: value out of range or not.

  2. <ColumnName>_VOR_Mag: level of value out of range, the level is mentioned in threshold file.

  3. <ColumnName>_VOR_Dur: Duration of episode. Duration is calculated on how many rows value was missing in an episode.

Input Parameters

  1. InputDataFrame – data frame

  2. Subset: sequence of columns,
    Only consider certain columns for identifying missing values, by default use all of the columns.

  3. Threshold: the object of Threshold class - Object of Threshold class loaded with proper threshold file. The getThresholdLevel method is used to get the threshold level for the parameters required to calculate valueOutOfRange. If the subset is present, values of these parameters are converted into

  4. InitDurationForFirstEpisode: Dictionary {columnName: Duration} key is column name and duration is numeric - Duration of ongoing episode continuing to the beginning of input. In case the episode starts from the first minute, InitDurationForFirstEpisode is considered to calculate the episode duration. If this value is 10, the FMV_Dur will start from 11(when IncrementalEpisodes = True), which denotes missing value episode is on 11th minute as 10 minutes of the episode is being carried forwarded. If InitDurationForFirstEpisode = 0 or for any other episodes stating not at the first minute, the duration would have started with 1 - denoting this is the first min of the episode.

  5. IncrementalEpisodes: bool, default False. Should the duration of the episode be incremented every minute or the duration should be the total duration of the episode for each minute.
    Note: Input data should be continuous in time, which means The data should have the constant difference between consecutive rows.

DivergencePathValue:

DivergencePathValue(inputDataFrame, subset=None, rollingWIndowMins=3, historyStatFile=None,zScoreDeviationForDivergence=2, zScoreDeviationTimeLimit = 10, initDurationForFirstEpisode=None, incrementalEpisodes=False)

Provides 3 new columns with episodes, magnitude, duration of episode.

Returns:

DataFrames or None

with 3 new columns for each column.

  1. <ColumnName>_DPV_Eps: Divergence of two columns.

  2. <ColumnName>_DPV_Dur: Magnitude of divergence.

  3. <ColumnName>_DPV_Mag: Duration of episode. Duration is calculated on how many rows value was significantly higher in a episode.

Input Parameters

  1. InputDataFrame – data frame

  2. Subset: sequence of columns,
    Only consider certain columns for identifying missing values, by default use all of the columns.

  3. rollingWIndowMins: numeric - Size of the moving window. This is the number of observations used for calculating the average.

  4. historyStatFile: file of historical stats where mean and std deviations of the required columns are available.

  5. zScoreDeviationForDivergence: z-score above which values are considered significant.

  6. zScoreDeviationTimeLimit: Min duration of the episode to qualify as divergence. This is to filter the noise with a few minutes of divergence.

  7. InitDurationForFirstEpisode: Dictionary {columnName: Duration} key is column name and duration is numeric - Duration of ongoing episode continuing to the beginning of input. In case the episode starts from the first minute, InitDurationForFirstEpisode is considered to calculate the episode duration. If this value is 10, the FMV_Dur will start from 11(when IncrementalEpisodes = True), which denotes missing value episode is on 11th minute as 10 minutes of the episode is being carried forwarded. If InitDurationForFirstEpisode = 0 or for any other episodes stating not at the first minute, the duration would have started with 1 - denoting this is the first min of the episode.

  8. IncrementalEpisodes: bool, default False. Should the duration of the episode be incremented every minute or the duration should be the total duration of the episode for each minute.
    Note: Input data should be continuous in time, which means The data should have the constant difference between consecutive rows.