Title: | Data Science for Wind Energy |
---|---|
Description: | Data science methods used in wind energy applications. Current functionalities include creating a multi-dimensional power curve model, performing power curve function comparison, covariate matching, and energy decomposition. Relevant works for the developed functions are: funGP() - Prakash et al. (2022) <doi:10.1080/00401706.2021.1905073>, AMK() - Lee et al. (2015) <doi:10.1080/01621459.2014.977385>, tempGP() - Prakash et al. (2022) <doi:10.1080/00401706.2022.2069158>, ComparePCurve() - Ding et al. (2021) <doi:10.1016/j.renene.2021.02.136>, deltaEnergy() - Latiffianti et al. (2022) <doi:10.1002/we.2722>, syncSize() - Latiffianti et al. (2022) <doi:10.1002/we.2722>, imptPower() - Latiffianti et al. (2022) <doi:10.1002/we.2722>, All other functions - Ding (2019, ISBN:9780429956508). |
Authors: | Nitesh Kumar [aut], Abhinav Prakash [aut], Yu Ding [aut, cre], Effi Latiffianti [ctb, cph], Ahmadreza Chokhachian [ctb, cph] |
Maintainer: | Yu Ding <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.8.2 |
Built: | 2024-12-14 06:46:56 UTC |
Source: | CRAN |
An additive multiplicative kernel regression based on Lee et al. (2015).
AMK( trainX, trainY, testX, bw = "dpi_gap", nMultiCov = 3, fixedCov = c(1, 2), cirCov = NA )
AMK( trainX, trainY, testX, bw = "dpi_gap", nMultiCov = 3, fixedCov = c(1, 2), cirCov = NA )
trainX |
a matrix or dataframe of input variable values in the training dataset. |
trainY |
a numeric vector for response values in the training dataset. |
testX |
a matrix or dataframe of test input variable values to compute predictions. |
bw |
a numeric vector or a character input for bandwidth. If character, bandwidth computed internally; the input should be either |
nMultiCov |
an integer or a character input specifying the number of multiplicative covariates in each additive term. Default is 3 (same as Lee et al., 2015). The character inputs can be: |
fixedCov |
an integer vector specifying the fixed covariates column number(s), default value is |
cirCov |
an integer vector specifying the circular covariates column number(s) in |
This function is based on Lee et al. (2015). Main features are:
Flexible number of multiplicative covariates in each additive term, which can be set using nMultiCov
.
Flexible number and columns for fixed covariates, which can be set using fixedCov
. The default option c(1,2)
sets the first two columns as fixed covariates in each additive term.
Handling the data with gaps when the direct plug-in estimator used in Lee et al. fails to return a finite bandwidth. This is set using the option bw = 'dpi_gap'
for bandwidth estimation.
a numeric vector for predictions at the data points in testX
.
Lee, Ding, Genton, and Xie, 2015, “Power curve estimation with multivariate environmental factors for inland and offshore wind farms,” Journal of the American Statistical Association, Vol. 110, pp. 56-67, doi:10.1080/01621459.2014.977385.
data = data1 trainX = as.matrix(data[c(1:100),2]) trainY = data[c(1:100),7] testX = as.matrix(data[c(101:110),2]) AMK_prediction = AMK(trainX, trainY, testX, bw = 'dpi_gap', cirCov = NA)
data = data1 trainX = as.matrix(data[c(1:100),2]) trainY = data[c(1:100),7] testX = as.matrix(data[c(101:110),2]) AMK_prediction = AMK(trainX, trainY, testX, bw = 'dpi_gap', cirCov = NA)
Power curve comparison
ComparePCurve( data, xCol, xCol.circ = NULL, yCol, testCol, testSet = NULL, thrs = 0.2, conflevel = 0.95, gridSize = c(50, 50), powerbins = 15, baseline = 1, limitMemory = TRUE, opt_method = "nlminb", sampleSize = list(optimSize = 500, bandSize = 5000), rngSeed = 1 )
ComparePCurve( data, xCol, xCol.circ = NULL, yCol, testCol, testSet = NULL, thrs = 0.2, conflevel = 0.95, gridSize = c(50, 50), powerbins = 15, baseline = 1, limitMemory = TRUE, opt_method = "nlminb", sampleSize = list(optimSize = 500, bandSize = 5000), rngSeed = 1 )
data |
A list of data sets to be compared, the difference in the mean function is always computed as (f(data2) - f(data1)) |
xCol |
A numeric or vector stating column number of covariates |
xCol.circ |
A numeric or vector stating column number of circular covariates |
yCol |
A numeric value stating the column number of the response |
testCol |
A numeric/vector stating column number of covariates to used in generating test set. Maximum of two columns to be used. |
testSet |
A matrix or dataframe consisting of test points, default value NULL, if NULL computes test points internally using testCol variables. If not NULL, total number of test points must be less than or equal to 2500. |
thrs |
A numeric or vector representing threshold for each covariates |
conflevel |
A numeric between (0,1) representing the statistical significance level for constructing the band |
gridSize |
A numeric / vector to be used in constructing test set, should be provided when testSet is NuLL, else it is ignored. Default is |
powerbins |
A numeric stating the number of power bins for computing the scaled difference, default is 15. |
baseline |
An integer between 0 to 2, where 1 indicates to use power curve of first dataset as the base for metric calculation, 2 indicates to use the power curve of second dataset as the base, and 0 indicates to use the average of both power curves as the base. Default is set to 1. |
limitMemory |
A boolean (True/False) indicating whether to limit the memory use or not. Default is true. If set to true, 5000 datapoints are randomly sampled from each dataset under comparison for inference |
opt_method |
A string specifying the optimization method to be used for hyperparameter estimation. Current options are: |
sampleSize |
A named list of two integer items: |
rngSeed |
Random seed for sampling data when |
a list containing :
weightedDiff - a numeric, % difference between the functions weighted using the density of the covariates
weightedStatDiff - a numeric, % statistically significant difference between the functions weighted using the density of the covariates
scaledDiff - a numeric, % difference between the functions scaled to the orginal data
scaledStatDiff - a numeric, % statistically significant difference between the functions scaled to the orginal data
unweightedDiff - a numeric, % difference between the functions unweighted
unweightedStatDiff - a numeric, % statistically significant difference between the functions unweighted
reductionRatio - a list consisting of shrinkage ratio of features used in testSet
mu1 - a vector of prediction on testset using the first data set
mu2 - a vector of prediction on testset using the second data set
muDiff - a vector of the difference in prediction (mu2 - mu1) for each test point
band - a vector for the confidence band at all the testpoints for the two functions to be the same at a given cofidence level.
confLevel - a numeric representing the statistical significance level for constructing the band
testSet - a vector/matrix of the test points either provided by user, or generated internally
estimatedParams - a list of estimated hyperaparameters for the Gaussian process model
matchedData - a list of two matched datasets as generated by covariate matching
For details, see Ding et al. (2021) available doi:10.1016/j.renene.2021.02.136.
data1 = data1[1:100, ] data2 = data2[1:100, ] data = list(data1, data2) xCol = 2 xCol.circ = NULL yCol = 7 testCol = 2 testSet = NULL thrs = 0.2 confLevel = 0.95 gridSize = 20 function_comparison = ComparePCurve(data, xCol, xCol.circ, yCol, testCol, testSet, thrs, confLevel, gridSize)
data1 = data1[1:100, ] data2 = data2[1:100, ] data = list(data1, data2) xCol = 2 xCol.circ = NULL yCol = 7 testCol = 2 testSet = NULL thrs = 0.2 confLevel = 0.95 gridSize = 20 function_comparison = ComparePCurve(data, xCol, xCol.circ, yCol, testCol, testSet, thrs, confLevel, gridSize)
Computes percentage weighted difference between power curves based on user provided weights instead of the weights computed from the data. Please see details
for more information.
ComputeWeightedDifference( muDiff, weights, base, statDiff = FALSE, confBand = NULL )
ComputeWeightedDifference( muDiff, weights, base, statDiff = FALSE, confBand = NULL )
muDiff |
a vector of pointwise difference between two power curves on a testset as obtained from |
weights |
a vector of user specified weights for each element of |
base |
a vector of predictions from a power curve; to be used as the denominator in computing the percentage difference. It can be either |
statDiff |
a boolean specifying whether to compute the statistical significant difference or not. Default is set to |
confBand |
a vector of pointwise confidence band for all the points in the testset as obtained from |
The function is a modification to the percentage weighted difference defined in Ding et al. (2021). It computes a weighted difference between power curves on a testset, where the weights have to be provided by the user based on any probability distribution of their choice rather than the weights being computed from the data. The weights must sum to 1 to be valid.
a numeric percentage weighted difference or statistical significant percetage weighted difference based on whether statDiff is set to FALSE
or TRUE
.
For details, see Ding et al. (2021) available at doi:10.1016/j.renene.2021.02.136.
ws_test = as.matrix(seq(4.5,8.5,length.out = 10)) userweights = dweibull(ws_test, shape = 2.25, scale = 6.5) userweights = userweights/sum(userweights) data1 = data1[1:100, ] data2 = data2[1:100, ] datalist = list(data1, data2) xCol = 2 xCol.circ = NULL yCol = 7 testCol = 2 output = ComparePCurve(data = datalist, xCol = xCol, yCol = yCol, testCol = testCol, testSet = ws_test) weightedDiff = ComputeWeightedDifference(output$muDiff, userweights, output$mu1) weightedStatDiff = ComputeWeightedDifference(output$muDiff, userweights, output$mu1, statDiff = TRUE, confBand = output$band)
ws_test = as.matrix(seq(4.5,8.5,length.out = 10)) userweights = dweibull(ws_test, shape = 2.25, scale = 6.5) userweights = userweights/sum(userweights) data1 = data1[1:100, ] data2 = data2[1:100, ] datalist = list(data1, data2) xCol = 2 xCol.circ = NULL yCol = 7 testCol = 2 output = ComparePCurve(data = datalist, xCol = xCol, yCol = yCol, testCol = testCol, testSet = ws_test) weightedDiff = ComputeWeightedDifference(output$muDiff, userweights, output$mu1) weightedStatDiff = ComputeWeightedDifference(output$muDiff, userweights, output$mu1, statDiff = TRUE, confBand = output$band)
The function aims to take list of two data sets and returns the after matched data sets using user specified covariates and threshold
CovMatch(data, xCol, xCol.circ, thrs, priority)
CovMatch(data, xCol, xCol.circ, thrs, priority)
data |
a list, consisting of data sets to match, also each of the individual data set can be dataframe or a matrix |
xCol |
a vector stating the column position of covariates used |
xCol.circ |
a vector stating the column position of circular variables |
thrs |
a numerical or a vector of threshold values for each covariates, against which matching happens It should be a single value or a vector of values representing threshold for each of the covariate |
priority |
a boolean, default value False, otherwise computes the sequence of matching |
a list containing :
originalData - The data sets provided for matching
matchedData - The data sets after matching
MinMaxOriginal - The minimum and maximum value in original data for each covariate used in matching
MinMaxMatched - The minimum and maximum value in matched data for each covariates used in matching
Ding, Y. (2019). Data Science for Wind Energy. Chapman & Hall, Boca Raton, FL.
data1 = data1[1:100, ] data2 = data2[1:100, ] data = list(data1, data2) xCol = 2 xCol.circ = NULL thrs = 0.1 priority = FALSE matched_data = CovMatch(data, xCol, xCol.circ, thrs, priority)
data1 = data1[1:100, ] data2 = data2[1:100, ] data = list(data1, data2) xCol = 2 xCol.circ = NULL thrs = 0.1 priority = FALSE matched_data = CovMatch(data, xCol, xCol.circ, thrs, priority)
A dataset containing the power produced and other attributes of almost 47,542 records.
data(data1)
data(data1)
A data frame with 47,542 rows and 7 variables
Data.point - sequence of integers displaying each record
V - wind speed
D - wind direction
air.density - air density
I - turbulence intensity
S_b - wind shear
Y - wind power
A dataset containing the power produced and other attributes of almost 48,068 records.
data(data2)
data(data2)
A data frame with 48,068 rows and 7 variables
Data.point - sequence of integers displaying each record
V - wind speed
D - wind direction
air.density - air density
I - turbulence intensity
S_b - wind shear
Y - wind power
Energy decomposition compares energy production from two datasets and separates it into turbine effects (deltaE.turb) and weather/environment effects (deltaE.weather).
deltaEnergy( data, powercol, timecol = 0, xcol, sync.method = "minimum power", imput = TRUE, vcol = NULL, vrange = NULL, rated.power = NULL, sample = TRUE, size = 2500, timestamp.min = 10 )
deltaEnergy( data, powercol, timecol = 0, xcol, sync.method = "minimum power", imput = TRUE, vcol = NULL, vrange = NULL, rated.power = NULL, sample = TRUE, size = 2500, timestamp.min = 10 )
data |
A list of two data sets to be compared. A difference is always computed as (data2 - data1). |
powercol |
A numeric stating the column number of power production. |
timecol |
A numeric stating the column number of data time stamp. Default value is zero. A value other than zero should be provided when |
xcol |
A numeric or vector stating the column number(s) of power curve input covariates/features (environmental or weather variables are recommended). |
sync.method |
A string specifying data synchronization method. Default value |
imput |
A boolean (TRUE/FALSE) indicating whether power imputation should be performed before calculating energy decomposition. The recommended and default value is TRUE. Change to FALSE when data have been preprocessed or imputed before.#' @param vcol A numeric stating the column number of wind speed. It is required when |
vcol |
A numeric stating the column number of wind speed. |
vrange |
A vector of cut-in, rated, and cut-out wind speed. Values should be provided when |
rated.power |
A numerical value stating the wind turbine rated power. |
sample |
A boolean (TRUE/FALSE) indicating whether to use sample or the whole data sets to train the power curve to be used for power imputation. Default value is TRUE. It is only used when |
size |
A numeric stating the size of sample when |
timestamp.min |
A numerical value stating the resolution of the datasets in minutes. It is the difference between two consecutive time stamps at which data were recorded. Default value is 10. |
a list containing :
deltaE.turb - A numeric,
deltaE.weather - A numeric,
deltaE.hat - A numeric,
deltaE.obs - A numeric,
estimated.energy - A numeric vector of the total energy calculated from each of f1(x2), f1(x1), f2(x2), f1(x2). If power is in kW, these values will be in kWh.
data - A list of two datasets used to calculate energy decomposition, i.e. synchronized. When imput = TRUE
, the power column is the result from imputation.
Latiffianti, E, Ding, Y, Sheng, S, Williams, L, Morshedizadeh, M, Rodgers, M (2022). "Analysis of leading edge protection application on wind turbine performance through energy and power decomposition approaches". Wind Energy. 2022; 1-19. doi:10.1002/we.2722.
data = list(data1[1:50,], data2[1:60,]) powercol = 7 timecol = 1 xcol = c(2:6) sync.method = 'time' imput = TRUE vcol = 2 vrange = c(5,12,25) rated.power = 100 sample = FALSE Decomposition = deltaEnergy(data, powercol, timecol, xcol, sync.method, imput, vcol, vrange, rated.power, sample)
data = list(data1[1:50,], data2[1:60,]) powercol = 7 timecol = 1 xcol = c(2:6) sync.method = 'time' imput = TRUE vcol = 2 vrange = c(5,12,25) rated.power = 100 sample = FALSE Decomposition = deltaEnergy(data, powercol, timecol, xcol, sync.method, imput, vcol, vrange, rated.power, sample)
Function comparison using Gaussian Process and Hypothesis testing
funGP( datalist, xCol, yCol, confLevel = 0.95, testset, limitMemory = TRUE, opt_method = "nlminb", sampleSize = list(optimSize = 500, bandSize = 5000), rngSeed = 1 )
funGP( datalist, xCol, yCol, confLevel = 0.95, testset, limitMemory = TRUE, opt_method = "nlminb", sampleSize = list(optimSize = 500, bandSize = 5000), rngSeed = 1 )
datalist |
A list of data sets to compute a function for each of them |
xCol |
A numeric or vector stating the column number of covariates |
yCol |
A numeric value stating the column number of target |
confLevel |
A single value representing the statistical significance level for constructing the band |
testset |
Test points at which the functions will be compared |
limitMemory |
A boolean (True/False) indicating whether to limit the memory use or not. Default is true. If set to true, 5000 datapoints are randomly sampled from each dataset under comparison for inference. |
opt_method |
A string specifying the optimization method to be used for hyperparameter estimation. Current options are: |
sampleSize |
A named list of two integer items: |
rngSeed |
Random seed for sampling data when |
a list containing :
muDiff - A vector of pointwise difference between the predictions from the two datasets (mu2- mu1)
mu1 - A vector of test prediction for first data set
mu2 - A vector of test prediction for second data set
band - A vector of the allowed statistical difference between functions at testpoints in testset
confLevel - A numeric representing the statistical significance level for constructing the band
testset - A matrix of test points to compare the functions
estimatedParams - A list of estimated hyperparameters for GP
Prakash, A., Tuo, R., & Ding, Y. (2022). "Gaussian process aided function comparison using noisy scattered data," Technometrics, Vol. 64, No. 1, pp. 92-102, doi:10.1080/00401706.2021.1905073.
datalist = list(data1[1:50,], data2[1:50, ]) xCol = 2 yCol = 7 confLevel = 0.95 testset = seq(4,10,length.out = 10) function_diff = funGP(datalist, xCol, yCol, confLevel, testset)
datalist = list(data1[1:50,], data2[1:50, ]) xCol = 2 yCol = 7 confLevel = 0.95 testset = seq(4,10,length.out = 10) function_diff = funGP(datalist, xCol, yCol, confLevel, testset)
Good power curve modeling requires valid power values in the region between cut-in and cut-out wind speed. However, when turbine is not operating, the power production will be recorded as zero or negative. This function replaces those values with predicted values obtained from the estimated tempGP power curve model using one input variable - the wind speed.
imptPower( data, powercol, vcol, vrange, rated.power = NULL, sample = TRUE, size = 2500 )
imptPower( data, powercol, vcol, vrange, rated.power = NULL, sample = TRUE, size = 2500 )
data |
A list of two data sets that require imputation. |
powercol |
A numeric stating the column number of power production. |
vcol |
A numeric stating the column number of wind speed. |
vrange |
A vector of cut-in, rated, and cut-out wind speed. |
rated.power |
A numerical value stating the wind turbine rated power. |
sample |
A boolean (TRUE/FALSE) indicating whether to use sample or the whole data sets to train the power curve. |
size |
A numeric stating the size of sample when |
a list containing datasets with the imputed power.
Latiffianti, E, Ding, Y, Sheng, S, Williams, L, Morshedizadeh, M, Rodgers, M (2022). "Analysis of leading edge protection application on wind turbine performance through energy and power decomposition approaches". Wind Energy. 2022; 1-19. doi:10.1002/we.2722.
data = list(data1[1:100,], data2[1:120, ]) powercol = 7 vcol = 2 vrange = c(5,12,25) rated.power = 100 sample = FALSE imputed.dat = imptPower(data, powercol, vcol, vrange, rated.power, sample)
data = list(data1[1:100,], data2[1:120, ]) powercol = 7 vcol = 2 vrange = c(5,12,25) rated.power = 100 sample = FALSE imputed.dat = imptPower(data, powercol, vcol, vrange, rated.power, sample)
The function models the powercurve using KNN, against supplied arguments
KnnPCFit(data, xCol, yCol, subsetSelection = FALSE)
KnnPCFit(data, xCol, yCol, subsetSelection = FALSE)
data |
a dataframe or a matrix, to be used in modelling |
xCol |
a vector or numeric values stating the column number of features |
yCol |
a numerical or a vector value stating the column number of target |
subsetSelection |
a boolean, default value is FALSE, if TRUE returns the best feature column number as xCol |
a list containing :
data - The data set provided by user
xCol - The column number of features provided by user or the best subset column number
yCol - The column number of target provided by user
bestK - The best k nearest neighbor calculated using the function
RMSE - The RMSE calculated using the function for provided data using user defined features and best obtained K
MAE - The MAE calculated using the function for provided data using user defined features and best obtained K
data = data1[c(1:100),] xCol = 2 yCol = 7 subsetSelection = FALSE knn_model = KnnPCFit(data, xCol, yCol, subsetSelection)
data = data1[c(1:100),] xCol = 2 yCol = 7 subsetSelection = FALSE knn_model = KnnPCFit(data, xCol, yCol, subsetSelection)
The function can be used to make prediction on test data using trained model
KnnPredict(knnMdl, testData)
KnnPredict(knnMdl, testData)
knnMdl |
a list containing:
|
testData |
a data frame or matrix, to compute the predictions |
a numeric / vector with prediction on test data using model generated by KnnFit
data = data1[c(1:100),] xCol = 2 yCol = 7 subsetSelection = FALSE knn_model = KnnPCFit(data, xCol, yCol, subsetSelection) testData = data1[c(101:110), ] prediction = KnnPredict(knn_model, testData)
data = data1[c(1:100),] xCol = 2 yCol = 7 subsetSelection = FALSE knn_model = KnnPCFit(data, xCol, yCol, subsetSelection) testData = data1[c(101:110), ] prediction = KnnPredict(knn_model, testData)
The function can be used to update KNN model when new data is provided
KnnUpdate(knnMdl, newData)
KnnUpdate(knnMdl, newData)
knnMdl |
a list containing:
|
newData |
a dataframe or a matrix, to be used for updating the model |
a list containing :
data - The updated data using old data set and new data
xCol - The column number of features provided by user or the best subset column number
yCol - The column number of target provided by user
bestK - The best k nearest neighbor calculated for the new data using user specified features and target
data = data1[c(1:100),] xCol = 2 yCol = 7 subsetSelection = FALSE knn_model = KnnPCFit(data, xCol, yCol, subsetSelection) newData = data1[c(101:110), ] knn_newmodel = KnnUpdate(knn_model, newData)
data = data1[c(1:100),] xCol = 2 yCol = 7 subsetSelection = FALSE knn_model = KnnPCFit(data, xCol, yCol, subsetSelection) newData = data1[c(101:110), ] knn_newmodel = KnnUpdate(knn_model, newData)
predict function for tempGP objects. This function computes the prediction f(x)
or f(x) + g(t)
depending on the temporal distance between training and test points and whether the time indices for the test points are provided.
## S3 method for class 'tempGP' predict(object, testX, testT = NULL, trainT = NULL, ...)
## S3 method for class 'tempGP' predict(object, testX, testT = NULL, trainT = NULL, ...)
object |
An object of class tempGP. |
testX |
A matrix with each column corresponding to one input variable. |
testT |
A vector of time indices of the test points. When |
trainT |
Optional argument to override the existing trainT indices of the |
... |
additional arguments for future development |
A vector of predictions at the testpoints in testX
.
data = DSWE::data1 trainindex = 1:50 #using the first 50 data points to train the model traindata = data[trainindex,] xCol = 2 #input variable columns yCol = 7 #response column trainX = as.matrix(traindata[,xCol]) trainY = as.numeric(traindata[,yCol]) tempGPObject = tempGP(trainX, trainY) testdata = DSWE::data1[101:110,] # defining test data testX = as.matrix(testdata[,xCol, drop = FALSE]) predF = predict(tempGPObject, testX)
data = DSWE::data1 trainindex = 1:50 #using the first 50 data points to train the model traindata = data[trainindex,] xCol = 2 #input variable columns yCol = 7 #response column trainX = as.matrix(traindata[,xCol]) trainY = as.numeric(traindata[,yCol]) tempGPObject = tempGP(trainX, trainY) testdata = DSWE::data1[101:110,] # defining test data testX = as.matrix(testdata[,xCol, drop = FALSE]) predF = predict(tempGPObject, testX)
Smoothing spline Anova method
SplinePCFit(data, xCol, yCol, testX, modelFormula = NULL)
SplinePCFit(data, xCol, yCol, testX, modelFormula = NULL)
data |
a matrix or dataframe to be used in modelling |
xCol |
a numeric or vector stating the column number of feature covariates |
yCol |
a numeric value stating the column number of target |
testX |
a matrix or dataframe, to be used in computing the predictions |
modelFormula |
default is NULL else a model formula specifying target and features.Please refer 'gss' package documentation for more details |
a vector or numeric predictions on user provided test data
data = data1[c(1:100),] xCol = 2 yCol = 7 testX = data1[c(101:110), ] Spline_prediction = SplinePCFit(data, xCol, yCol, testX)
data = data1[c(1:100),] xCol = 2 yCol = 7 testX = data1[c(101:110), ] Spline_prediction = SplinePCFit(data, xCol, yCol, testX)
SVM based power curve modelling
SvmPCFit(trainX, trainY, testX, kernel = "radial")
SvmPCFit(trainX, trainY, testX, kernel = "radial")
trainX |
a matrix or dataframe to be used in modelling |
trainY |
a numeric or vector as a target |
testX |
a matrix or dataframe, to be used in computing the predictions |
kernel |
default is 'radial' else can be 'linear', 'polynomial' and 'sigmoid' |
a vector or numeric predictions on user provided test data
data = data1 trainX = as.matrix(data[c(1:100),2]) trainY = data[c(1:100),7] testX = as.matrix(data[c(101:110),2]) Svm_prediction = SvmPCFit(trainX, trainY, testX)
data = data1 trainX = as.matrix(data[c(1:100),2]) trainY = data[c(1:100),7] testX = as.matrix(data[c(101:110),2]) Svm_prediction = SvmPCFit(trainX, trainY, testX)
Data synchronization is meant to make a pair of data to have the same size. It is performed by removing some data points from the larger dataset. This step is important when comparing energy production between two data sets because energy production is time-based.
syncSize(data, powercol, timecol = 0, xcol, method = "minimum power")
syncSize(data, powercol, timecol = 0, xcol, method = "minimum power")
data |
A list of two data sets to be synchronized. |
powercol |
A numeric stating the column number of power production. |
timecol |
A numeric stating the column number of data time stamp. Default value is zero. A value other than zero should be provided when |
xcol |
A numeric or vector stating the column number(s) of power curve input covariates/features (to be used for energy decomposition). |
method |
A string specifying data synchronization method. Default value |
a list containing the synchronized datasets.
Latiffianti, E, Ding, Y, Sheng, S, Williams, L, Morshedizadeh, M, Rodgers, M (2022). "Analysis of leading edge protection application on wind turbine performance through energy and power decomposition approaches". Wind Energy. 2022; 1-19. doi:10.1002/we.2722.
data = list(data1[1:200,], data2[1:180, ]) powercol = 7 timecol = 1 xcol = c(2:6) method = 'random' sync.dat = syncSize(data, powercol, timecol, xcol, method) data = list(data1[500:700,], data2[600:750, ]) powercol = 7 timecol = 1 xcol = c(2:6) method = 'time' sync.dat = syncSize(data, powercol, timecol, xcol, method)
data = list(data1[1:200,], data2[1:180, ]) powercol = 7 timecol = 1 xcol = c(2:6) method = 'random' sync.dat = syncSize(data, powercol, timecol, xcol, method) data = list(data1[500:700,], data2[600:750, ]) powercol = 7 timecol = 1 xcol = c(2:6) method = 'time' sync.dat = syncSize(data, powercol, timecol, xcol, method)
A Gaussian process based power curve model which explicitly models the temporal aspect of the power curve. The model consists of two parts: f(x)
and g(t)
.
tempGP( trainX, trainY, trainT = NULL, fast_computation = TRUE, limit_memory = 5000L, max_thinning_number = 20L, vecchia = TRUE, optim_control = list(batch_size = 100L, learn_rate = 0.05, max_iter = 5000L, tol = 1e-06, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-08, logfile = NULL) )
tempGP( trainX, trainY, trainT = NULL, fast_computation = TRUE, limit_memory = 5000L, max_thinning_number = 20L, vecchia = TRUE, optim_control = list(batch_size = 100L, learn_rate = 0.05, max_iter = 5000L, tol = 1e-06, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-08, logfile = NULL) )
trainX |
A matrix with each column corresponding to one input variable. |
trainY |
A vector with each element corresponding to the output at the corresponding row of |
trainT |
A vector for time indices of the data points. By default, the function assigns natural numbers starting from 1 as the time indices. |
fast_computation |
A Boolean that specifies whether to do exact inference or fast approximation. Default is |
limit_memory |
An integer or |
max_thinning_number |
An integer specifying the max lag to compute the thinning number. If the PACF does not become insignificant till |
vecchia |
A Boolean that specifies whether to do exact inference or vecchia approximation. Default is |
optim_control |
A list parameters passed to the Adam optimizer when
|
An object of class tempGP
with the following attributes:
trainX - same as the input matrix trainX
.
trainY - same as the input vector trainY
.
thinningNumber - the thinning number computed by the algorithm.
modelF - A list containing the details of the model for predicting function f(x)
:
X - The input variable matrix for computing the cross-covariance for predictions, same as trainX
unless the model is updated. See updateData.tempGP
method for details on updating the model.
y - The response vector, again same as trainY
unless the model is updated.
weightedY - The weighted response, that is, the response left multiplied by the inverse of the covariance matrix.
modelG - A list containing the details of the model for predicting function g(t)
:
residuals - The residuals after subtracting function f(x)
from the response. Used to predict g(t)
. See updateData.tempGP
method for updating the residuals.
time_index - The time indices of the residuals, same as trainT
.
estimatedParams - Estimated hyperparameters for function f(x)
.
llval - log-likelihood value of the hyperparameter optimization for f(x)
.
gradval - gradient vector at the optimal log-likelihood value.
Prakash, A., Tuo, R., & Ding, Y. (2022). "The temporal overfitting problem with applications in wind power curve modeling." Technometrics. doi:10.1080/00401706.2022.2069158.
Katzfuss, M., & Guinness, J. (2021). "A General Framework for Vecchia Approximations of Gaussian Processes." Statistical Science. doi:10.1214/19-STS755.
Guinness, J. (2018). "Permutation and Grouping Methods for Sharpening Gaussian Process Approximations." Technometrics. doi:10.1080/00401706.2018.1437476.
predict.tempGP
for computing predictions and updateData.tempGP
for updating data in a tempGP object.
data = DSWE::data1 trainindex = 1:50 #using the first 50 data points to train the model traindata = data[trainindex,] xCol = 2 #input variable columns yCol = 7 #response column trainX = as.matrix(traindata[,xCol]) trainY = as.numeric(traindata[,yCol]) tempGPObject = tempGP(trainX, trainY)
data = DSWE::data1 trainindex = 1:50 #using the first 50 data points to train the model traindata = data[trainindex,] xCol = 2 #input variable columns yCol = 7 #response column trainX = as.matrix(traindata[,xCol]) trainY = as.numeric(traindata[,yCol]) tempGPObject = tempGP(trainX, trainY)
updateData
is a generic function to update data in a model.
updateData(object, ...)
updateData(object, ...)
object |
A model object |
... |
additional arguments for passing to specific methods |
The returned value would depend on the class of its argument object
.
This function updates trainX
, trainY
, and trainT
in a tempGP
object. By default, if the new data has m
data points, the function removes top m
data points from the tempGP object and appends the new data at the bottom, thus keeping the total number of data points the same. This can be overwritten by setting replace = FALSE
to keep all the data points (old and new). The method also updates modelG
by computing and updating residuals at the new data points. modelF
can be also be updated by setting the argument updateModelF
to TRUE
, though not required generally (see comments in the Arguments
.)
## S3 method for class 'tempGP' updateData( object, newX, newY, newT = NULL, replace = TRUE, updateModelF = FALSE, ... )
## S3 method for class 'tempGP' updateData( object, newX, newY, newT = NULL, replace = TRUE, updateModelF = FALSE, ... )
object |
An object of class tempGP. |
newX |
A matrix with each column corresponding to one input variable. |
newY |
A vector with each element corresponding to the output at the corresponding row of |
newT |
A vector with time indices of the new datapoints. If |
replace |
A boolean to specify whether to replace the old data with the new one, or to add the new data while still keeping all the old data. Default is TRUE, which replaces the top |
updateModelF |
A boolean to specify whether to update |
... |
additional arguments for future development |
An updated object of class tempGP
.
data = DSWE::data1 trainindex = 1:50 #using the first 50 data points to train the model traindata = data[trainindex,] xCol = 2 #input variable columns yCol = 7 #response column trainX = as.matrix(traindata[,xCol]) trainY = as.numeric(traindata[,yCol]) tempGPObject = tempGP(trainX, trainY) newdata = DSWE::data1[101:110,] # defining new data newX = as.matrix(newdata[,xCol, drop = FALSE]) newY = as.numeric(newdata[,yCol]) tempGPupdated = updateData(tempGPObject, newX, newY)
data = DSWE::data1 trainindex = 1:50 #using the first 50 data points to train the model traindata = data[trainindex,] xCol = 2 #input variable columns yCol = 7 #response column trainX = as.matrix(traindata[,xCol]) trainY = as.numeric(traindata[,yCol]) tempGPObject = tempGP(trainX, trainY) newdata = DSWE::data1[101:110,] # defining new data newX = as.matrix(newdata[,xCol, drop = FALSE]) newY = as.numeric(newdata[,yCol]) tempGPupdated = updateData(tempGPObject, newX, newY)
xgboost based power curve modelling
XgbPCFit( trainX, trainY, testX, max.depth = 8, eta = 0.25, nthread = 2, nrounds = 5 )
XgbPCFit( trainX, trainY, testX, max.depth = 8, eta = 0.25, nthread = 2, nrounds = 5 )
trainX |
a matrix or dataframe to be used in modelling |
trainY |
a numeric or vector as a target |
testX |
a matrix or dataframe, to be used in computing the predictions |
max.depth |
maximum depth of a tree |
eta |
learning rate |
nthread |
This parameter specifies the number of CPU threads that XGBoost |
nrounds |
number of boosting rounds or trees to build |
a vector or numeric predictions on user provided test data
Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794. doi:10.1145/2939672.2939785.
data = data1 trainX = as.matrix(data[c(1:100),2]) trainY = data[c(1:100),7] testX = as.matrix(data[c(101:110),2]) Xgb_prediction = XgbPCFit(trainX, trainY, testX)
data = data1 trainX = as.matrix(data[c(1:100),2]) trainY = data[c(1:100),7] testX = as.matrix(data[c(101:110),2]) Xgb_prediction = XgbPCFit(trainX, trainY, testX)