Tips for Production

Manage Dependencies

Finn continues to get updated on a regular basis. A best practice is to ensure you are using a specific version of Finn for your production forecast. This can be done through the use of the renv package while using Finn on your local machine, or using docker containers for running Finn in the cloud.

Azure ML Pipelines

Finn was built to run at scale in Azure, leveraging spark as the parallel back end. Check out the parallel processing vignette to learn how to get Finn running on Azure services like Databricks. The best way to run Finn in production is through the use of Azure Machine Learning, specifically Azure ML Pipelines.

Below are a few tips for leveraging Azure ML Pipelines

  • Leverage the Azure ML CLI v2 interface to submit R scripts in a pipeline.
  • Use the sub components of Finn instead of forecast_time_series() so you can have different pipeline steps ran for each step of the Finn forecast process.
  • If you do use the sub components of Finn, then definitely set add_unique_id to FALSE within set_run_info(). That will let you call the exact same run info in each separate pipeline step. Make sure that you are using your own unique run_name within set_run_info(), one that is different than previous Finn runs but the same name when using set_run_info() in each pipeline step.
  • Connect your Azure ML Pipeline to a spark compute cluster and mount an Azure Data Lake Storage connection. Either through Azure Databricks or Azure Synapse. Also consider using different spark cluster configurations depending on the Finn forecast sub component you are running in each pipeline step.
    • For prep_data() and prep_models(), use the default spark cluster settings. Where each spark task gets sent to a specific core on an executor.
    • For all other functions like train_models(), ensemble_models(), and final_models() consider adjusting the spark cluster settings like “spark.executor.cores” equal to 1 and set inner_parallel to TRUE within each function. That way only a single task/time series gets sent to each spark executor node, and all cores within that node can be used during the modeling process for that task/time series. Use num_cores within each function to control how many cores on the executor to use, with the default being all available cores minus one. This can significantly speed up run time, but if you have many time series and want to run all of the models within Finn ensure that you have a spark cluster that can scale to many VM’s.