< Back
Add-DatabricksSparkSubmitJob
Post
NAME Add-DatabricksSparkSubmitJob
SYNOPSIS
Creates Spark-Submit Job in Databricks. Script uses Databricks API 2.0 create job query:
https://docs.azuredatabricks.net/api/la ... tml#create
SYNTAX
Add-DatabricksSparkSubmitJob [[-BearerToken] <String>] [[-Region] <String>] [-JobName] <String> [-SparkVersion]
<String> [-NodeType] <String> [[-DriverNodeType] <String>] [-MinNumberOfWorkers] <Int32> [-MaxNumberOfWorkers]
<Int32> [[-Timeout] <Int32>] [[-MaxRetries] <Int32>] [[-ScheduleCronExpression] <String>] [[-Timezone] <String>]
[[-SparkSubmitParameters] <String[]>] [[-PythonVersion] <String>] [[-Spark_conf] <Hashtable>] [[-CustomTags]
<Hashtable>] [[-InitScripts] <String[]>] [[-SparkEnvVars] <Hashtable>] [[-ClusterLogPath] <String>]
[[-InstancePoolId] <String>] [<CommonParameters>]
DESCRIPTION
Creates Spark-Submit Job in Databricks. Script uses Databricks API 2.0 create job query:
https://docs.azuredatabricks.net/api/la ... tml#create
If the job name exists it will be updated instead of creating a new job.
Spark-Submit does not support including libraries on the cluster. Instead, use --jars in the SparkSubmitParameters.
Spark-Submit does not support using existing clusters.
PARAMETERS
-BearerToken <String>
Your Databricks Bearer token to authenticate to your workspace (see User Settings in Datatbricks WebUI)
Required? false
Position? 1
Default value
Accept pipeline input? false
Accept wildcard characters? false
-Region <String>
Azure Region - must match the URL of your Databricks workspace, example: northeurope
Required? false
Position? 2
Default value
Accept pipeline input? false
Accept wildcard characters? false
-JobName <String>
Name of the job that will appear in the Job list. If a job with this name exists
it will be updated.
Required? true
Position? 3
Default value
Accept pipeline input? false
Accept wildcard characters? false
-SparkVersion <String>
Spark version for cluster that will run the job. Example: 5.3.x-scala2.11
Required? true
Position? 4
Default value
Accept pipeline input? false
Accept wildcard characters? false
-NodeType <String>
Type of worker for cluster that will run the job. Example: Standard_D3_v2.
Required? true
Position? 5
Default value
Accept pipeline input? false
Accept wildcard characters? false
-DriverNodeType <String>
Type of driver for cluster that will run the job. Example: Standard_D3_v2.
If not provided the NodeType will be used.
Required? false
Position? 6
Default value
Accept pipeline input? false
Accept wildcard characters? false
-MinNumberOfWorkers <Int32>
Number of workers for cluster that will run the job.
Note: If Min & Max Workers are the same autoscale is disabled.
Required? true
Position? 7
Default value 0
Accept pipeline input? false
Accept wildcard characters? false
-MaxNumberOfWorkers <Int32>
Number of workers for cluster that will run the job.
Note: If Min & Max Workers are the same autoscale is disabled.
Required? true
Position? 8
Default value 0
Accept pipeline input? false
Accept wildcard characters? false
-Timeout <Int32>
Timeout, in seconds, applied to each run of the job. If not set, there will be no timeout.
Required? false
Position? 9
Default value 0
Accept pipeline input? false
Accept wildcard characters? false
-MaxRetries <Int32>
An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it
completes with a FAILED result_state or INTERNAL_ERROR life_cycle_state. The value -1 means to retry
indefinitely and the value 0 means to never retry. If not set, the default behavior will be never retry.
Required? false
Position? 10
Default value 0
Accept pipeline input? false
Accept wildcard characters? false
-ScheduleCronExpression <String>
By default, job will run when triggered using Jobs UI or sending API request to run. You can provide cron
schedule expression for job's periodic run. How to compose cron schedule expression:
http://www.quartz-scheduler.org/documen ... on-06.html
Required? false
Position? 11
Default value
Accept pipeline input? false
Accept wildcard characters? false
-Timezone <String>
Timezone for Cron Schedule Expression. Required if ScheduleCronExpression provided. See here for all possible
timezones: http://joda-time.sourceforge.net/timezones.html
Example: UTC
Required? false
Position? 12
Default value
Accept pipeline input? false
Accept wildcard characters? false
-SparkSubmitParameters <String[]>
Array for parameters for job, for example "--pyFiles", "dbfs:/myscript.py", "myparam"
Required? false
Position? 13
Default value
Accept pipeline input? false
Accept wildcard characters? false
-PythonVersion <String>
2 or 3 - defaults to 2.
Required? false
Position? 14
Default value 3
Accept pipeline input? false
Accept wildcard characters? false
-Spark_conf <Hashtable>
Hashtable.
Example @{"spark.speculation"=$true; "spark.streaming.ui.retainedBatches"= 5}
Required? false
Position? 15
Default value
Accept pipeline input? false
Accept wildcard characters? false
-CustomTags <Hashtable>
Custom Tags to set, provide hash table of tags. Example: @{CreatedBy="SimonDM";NumOfNodes=2;CanDelete=$true}
Required? false
Position? 16
Default value
Accept pipeline input? false
Accept wildcard characters? false
-InitScripts <String[]>
Init scripts to run post creation. Example: "dbfs:/script/script1", "dbfs:/script/script2"
Required? false
Position? 17
Default value
Accept pipeline input? false
Accept wildcard characters? false
-SparkEnvVars <Hashtable>
An object containing a set of optional, user-specified environment variable key-value pairs. Key-value pairs
of the form (X,Y) are exported as is (i.e., export X='Y') while launching the driver and workers.
Example: '@{SPARK_WORKER_MEMORY="29000m";SPARK_LOCAL_DIRS="/local_disk0"}
Required? false
Position? 18
Default value
Accept pipeline input? false
Accept wildcard characters? false
-ClusterLogPath <String>
Required? false
Position? 19
Default value
Accept pipeline input? false
Accept wildcard characters? false
-InstancePoolId <String>
Required? false
Position? 20
Default value
Accept pipeline input? false
Accept wildcard characters? false
<CommonParameters>
This cmdlet supports the common parameters: Verbose, Debug,
ErrorAction, ErrorVariable, WarningAction, WarningVariable,
OutBuffer, PipelineVariable, and OutVariable. For more information, see
about_CommonParameters (https:/go.microsoft.com/fwlink/?LinkID=113216).
INPUTS
OUTPUTS
NOTES
Author: Simon D'Morias / Data Thirst Ltd
-------------------------- EXAMPLE 1 --------------------------
PS C:\\>Add-DatabricksSparkSubmitJob -BearerToken $BearerToken -Region $Region -JobName "Job1" -SparkVersion
"5.3.x-scala2.11" -NodeType "Standard_D3_v2" -MinNumberOfWorkers 2 -MaxNumberOfWorkers 2 -Timeout 100 -MaxRetries
3 -ScheduleCronExpression "0 15 22 ? * *" -Timezone "UTC" -SparkSubmitParameters "--pyFiles", "dbfs:/myscript.py",
"myparam" -Libraries '{"pypi":{package:"simplejson"}}', '{"jar": "DBFS:/mylibraries/test.jar"}'
The above example create a job on a new cluster.
RELATED LINKS
SYNOPSIS
Creates Spark-Submit Job in Databricks. Script uses Databricks API 2.0 create job query:
https://docs.azuredatabricks.net/api/la ... tml#create
SYNTAX
Add-DatabricksSparkSubmitJob [[-BearerToken] <String>] [[-Region] <String>] [-JobName] <String> [-SparkVersion]
<String> [-NodeType] <String> [[-DriverNodeType] <String>] [-MinNumberOfWorkers] <Int32> [-MaxNumberOfWorkers]
<Int32> [[-Timeout] <Int32>] [[-MaxRetries] <Int32>] [[-ScheduleCronExpression] <String>] [[-Timezone] <String>]
[[-SparkSubmitParameters] <String[]>] [[-PythonVersion] <String>] [[-Spark_conf] <Hashtable>] [[-CustomTags]
<Hashtable>] [[-InitScripts] <String[]>] [[-SparkEnvVars] <Hashtable>] [[-ClusterLogPath] <String>]
[[-InstancePoolId] <String>] [<CommonParameters>]
DESCRIPTION
Creates Spark-Submit Job in Databricks. Script uses Databricks API 2.0 create job query:
https://docs.azuredatabricks.net/api/la ... tml#create
If the job name exists it will be updated instead of creating a new job.
Spark-Submit does not support including libraries on the cluster. Instead, use --jars in the SparkSubmitParameters.
Spark-Submit does not support using existing clusters.
PARAMETERS
-BearerToken <String>
Your Databricks Bearer token to authenticate to your workspace (see User Settings in Datatbricks WebUI)
Required? false
Position? 1
Default value
Accept pipeline input? false
Accept wildcard characters? false
-Region <String>
Azure Region - must match the URL of your Databricks workspace, example: northeurope
Required? false
Position? 2
Default value
Accept pipeline input? false
Accept wildcard characters? false
-JobName <String>
Name of the job that will appear in the Job list. If a job with this name exists
it will be updated.
Required? true
Position? 3
Default value
Accept pipeline input? false
Accept wildcard characters? false
-SparkVersion <String>
Spark version for cluster that will run the job. Example: 5.3.x-scala2.11
Required? true
Position? 4
Default value
Accept pipeline input? false
Accept wildcard characters? false
-NodeType <String>
Type of worker for cluster that will run the job. Example: Standard_D3_v2.
Required? true
Position? 5
Default value
Accept pipeline input? false
Accept wildcard characters? false
-DriverNodeType <String>
Type of driver for cluster that will run the job. Example: Standard_D3_v2.
If not provided the NodeType will be used.
Required? false
Position? 6
Default value
Accept pipeline input? false
Accept wildcard characters? false
-MinNumberOfWorkers <Int32>
Number of workers for cluster that will run the job.
Note: If Min & Max Workers are the same autoscale is disabled.
Required? true
Position? 7
Default value 0
Accept pipeline input? false
Accept wildcard characters? false
-MaxNumberOfWorkers <Int32>
Number of workers for cluster that will run the job.
Note: If Min & Max Workers are the same autoscale is disabled.
Required? true
Position? 8
Default value 0
Accept pipeline input? false
Accept wildcard characters? false
-Timeout <Int32>
Timeout, in seconds, applied to each run of the job. If not set, there will be no timeout.
Required? false
Position? 9
Default value 0
Accept pipeline input? false
Accept wildcard characters? false
-MaxRetries <Int32>
An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it
completes with a FAILED result_state or INTERNAL_ERROR life_cycle_state. The value -1 means to retry
indefinitely and the value 0 means to never retry. If not set, the default behavior will be never retry.
Required? false
Position? 10
Default value 0
Accept pipeline input? false
Accept wildcard characters? false
-ScheduleCronExpression <String>
By default, job will run when triggered using Jobs UI or sending API request to run. You can provide cron
schedule expression for job's periodic run. How to compose cron schedule expression:
http://www.quartz-scheduler.org/documen ... on-06.html
Required? false
Position? 11
Default value
Accept pipeline input? false
Accept wildcard characters? false
-Timezone <String>
Timezone for Cron Schedule Expression. Required if ScheduleCronExpression provided. See here for all possible
timezones: http://joda-time.sourceforge.net/timezones.html
Example: UTC
Required? false
Position? 12
Default value
Accept pipeline input? false
Accept wildcard characters? false
-SparkSubmitParameters <String[]>
Array for parameters for job, for example "--pyFiles", "dbfs:/myscript.py", "myparam"
Required? false
Position? 13
Default value
Accept pipeline input? false
Accept wildcard characters? false
-PythonVersion <String>
2 or 3 - defaults to 2.
Required? false
Position? 14
Default value 3
Accept pipeline input? false
Accept wildcard characters? false
-Spark_conf <Hashtable>
Hashtable.
Example @{"spark.speculation"=$true; "spark.streaming.ui.retainedBatches"= 5}
Required? false
Position? 15
Default value
Accept pipeline input? false
Accept wildcard characters? false
-CustomTags <Hashtable>
Custom Tags to set, provide hash table of tags. Example: @{CreatedBy="SimonDM";NumOfNodes=2;CanDelete=$true}
Required? false
Position? 16
Default value
Accept pipeline input? false
Accept wildcard characters? false
-InitScripts <String[]>
Init scripts to run post creation. Example: "dbfs:/script/script1", "dbfs:/script/script2"
Required? false
Position? 17
Default value
Accept pipeline input? false
Accept wildcard characters? false
-SparkEnvVars <Hashtable>
An object containing a set of optional, user-specified environment variable key-value pairs. Key-value pairs
of the form (X,Y) are exported as is (i.e., export X='Y') while launching the driver and workers.
Example: '@{SPARK_WORKER_MEMORY="29000m";SPARK_LOCAL_DIRS="/local_disk0"}
Required? false
Position? 18
Default value
Accept pipeline input? false
Accept wildcard characters? false
-ClusterLogPath <String>
Required? false
Position? 19
Default value
Accept pipeline input? false
Accept wildcard characters? false
-InstancePoolId <String>
Required? false
Position? 20
Default value
Accept pipeline input? false
Accept wildcard characters? false
<CommonParameters>
This cmdlet supports the common parameters: Verbose, Debug,
ErrorAction, ErrorVariable, WarningAction, WarningVariable,
OutBuffer, PipelineVariable, and OutVariable. For more information, see
about_CommonParameters (https:/go.microsoft.com/fwlink/?LinkID=113216).
INPUTS
OUTPUTS
NOTES
Author: Simon D'Morias / Data Thirst Ltd
-------------------------- EXAMPLE 1 --------------------------
PS C:\\>Add-DatabricksSparkSubmitJob -BearerToken $BearerToken -Region $Region -JobName "Job1" -SparkVersion
"5.3.x-scala2.11" -NodeType "Standard_D3_v2" -MinNumberOfWorkers 2 -MaxNumberOfWorkers 2 -Timeout 100 -MaxRetries
3 -ScheduleCronExpression "0 15 22 ? * *" -Timezone "UTC" -SparkSubmitParameters "--pyFiles", "dbfs:/myscript.py",
"myparam" -Libraries '{"pypi":{package:"simplejson"}}', '{"jar": "DBFS:/mylibraries/test.jar"}'
The above example create a job on a new cluster.
RELATED LINKS