Post

Sat Jan 11, 2020 9:51 am

NAME Add-DatabricksSparkSubmitJob

SYNOPSIS

Creates Spark-Submit Job in Databricks. Script uses Databricks API 2.0 create job query:

https://docs.azuredatabricks.net/api/la ... tml#create

SYNTAX

Add-DatabricksSparkSubmitJob [[-BearerToken] <String>] [[-Region] <String>] [-JobName] <String> [-SparkVersion]

<String> [-NodeType] <String> [[-DriverNodeType] <String>] [-MinNumberOfWorkers] <Int32> [-MaxNumberOfWorkers]

<Int32> [[-Timeout] <Int32>] [[-MaxRetries] <Int32>] [[-ScheduleCronExpression] <String>] [[-Timezone] <String>]

[[-SparkSubmitParameters] <String[]>] [[-PythonVersion] <String>] [[-Spark_conf] <Hashtable>] [[-CustomTags]

<Hashtable>] [[-InitScripts] <String[]>] [[-SparkEnvVars] <Hashtable>] [[-ClusterLogPath] <String>]

[[-InstancePoolId] <String>] [<CommonParameters>]

DESCRIPTION

Creates Spark-Submit Job in Databricks. Script uses Databricks API 2.0 create job query:

https://docs.azuredatabricks.net/api/la ... tml#create

If the job name exists it will be updated instead of creating a new job.

Spark-Submit does not support including libraries on the cluster. Instead, use --jars in the SparkSubmitParameters.

Spark-Submit does not support using existing clusters.

PARAMETERS

-BearerToken <String>

Your Databricks Bearer token to authenticate to your workspace (see User Settings in Datatbricks WebUI)

Required? false

Position? 1

Default value

Accept pipeline input? false

Accept wildcard characters? false

-Region <String>

Azure Region - must match the URL of your Databricks workspace, example: northeurope

Required? false

Position? 2

Default value

Accept pipeline input? false

Accept wildcard characters? false

-JobName <String>

Name of the job that will appear in the Job list. If a job with this name exists

it will be updated.

Required? true

Position? 3

Default value

Accept pipeline input? false

Accept wildcard characters? false

-SparkVersion <String>

Spark version for cluster that will run the job. Example: 5.3.x-scala2.11

Required? true

Position? 4

Default value

Accept pipeline input? false

Accept wildcard characters? false

-NodeType <String>

Type of worker for cluster that will run the job. Example: Standard_D3_v2.

Required? true

Position? 5

Default value

Accept pipeline input? false

Accept wildcard characters? false

-DriverNodeType <String>

Type of driver for cluster that will run the job. Example: Standard_D3_v2.

If not provided the NodeType will be used.

Required? false

Position? 6

Default value

Accept pipeline input? false

Accept wildcard characters? false

-MinNumberOfWorkers <Int32>

Number of workers for cluster that will run the job.

Note: If Min & Max Workers are the same autoscale is disabled.

Required? true

Position? 7

Default value 0

Accept pipeline input? false

Accept wildcard characters? false

-MaxNumberOfWorkers <Int32>

Number of workers for cluster that will run the job.

Note: If Min & Max Workers are the same autoscale is disabled.

Required? true

Position? 8

Default value 0

Accept pipeline input? false

Accept wildcard characters? false

-Timeout <Int32>

Timeout, in seconds, applied to each run of the job. If not set, there will be no timeout.

Required? false

Position? 9

Default value 0

Accept pipeline input? false

Accept wildcard characters? false

-MaxRetries <Int32>

An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it

completes with a FAILED result_state or INTERNAL_ERROR life_cycle_state. The value -1 means to retry

indefinitely and the value 0 means to never retry. If not set, the default behavior will be never retry.

Required? false

Position? 10

Default value 0

Accept pipeline input? false

Accept wildcard characters? false

-ScheduleCronExpression <String>

By default, job will run when triggered using Jobs UI or sending API request to run. You can provide cron

schedule expression for job's periodic run. How to compose cron schedule expression:

http://www.quartz-scheduler.org/documen ... on-06.html

Required? false

Position? 11

Default value

Accept pipeline input? false

Accept wildcard characters? false

-Timezone <String>

Timezone for Cron Schedule Expression. Required if ScheduleCronExpression provided. See here for all possible

timezones: http://joda-time.sourceforge.net/timezones.html

Example: UTC

Required? false

Position? 12

Default value

Accept pipeline input? false

Accept wildcard characters? false

-SparkSubmitParameters <String[]>

Array for parameters for job, for example "--pyFiles", "dbfs:/myscript.py", "myparam"

Required? false

Position? 13

Default value

Accept pipeline input? false

Accept wildcard characters? false

-PythonVersion <String>

2 or 3 - defaults to 2.

Required? false

Position? 14

Default value 3

Accept pipeline input? false

Accept wildcard characters? false

-Spark_conf <Hashtable>

Hashtable.

Example @{"spark.speculation"=$true; "spark.streaming.ui.retainedBatches"= 5}

Required? false

Position? 15

Default value

Accept pipeline input? false

Accept wildcard characters? false

-CustomTags <Hashtable>

Custom Tags to set, provide hash table of tags. Example: @{CreatedBy="SimonDM";NumOfNodes=2;CanDelete=$true}

Required? false

Position? 16

Default value

Accept pipeline input? false

Accept wildcard characters? false

-InitScripts <String[]>

Init scripts to run post creation. Example: "dbfs:/script/script1", "dbfs:/script/script2"

Required? false

Position? 17

Default value

Accept pipeline input? false

Accept wildcard characters? false

-SparkEnvVars <Hashtable>

An object containing a set of optional, user-specified environment variable key-value pairs. Key-value pairs

of the form (X,Y) are exported as is (i.e., export X='Y') while launching the driver and workers.

Example: '@{SPARK_WORKER_MEMORY="29000m";SPARK_LOCAL_DIRS="/local_disk0"}

Required? false

Position? 18

Default value

Accept pipeline input? false

Accept wildcard characters? false

-ClusterLogPath <String>

Required? false

Position? 19

Default value

Accept pipeline input? false

Accept wildcard characters? false

-InstancePoolId <String>

Required? false

Position? 20

Default value

Accept pipeline input? false

Accept wildcard characters? false

<CommonParameters>

This cmdlet supports the common parameters: Verbose, Debug,

ErrorAction, ErrorVariable, WarningAction, WarningVariable,

OutBuffer, PipelineVariable, and OutVariable. For more information, see

about_CommonParameters (https:/go.microsoft.com/fwlink/?LinkID=113216).

INPUTS

OUTPUTS

NOTES

Author: Simon D'Morias / Data Thirst Ltd

-------------------------- EXAMPLE 1 --------------------------

PS C:\\>Add-DatabricksSparkSubmitJob -BearerToken $BearerToken -Region $Region -JobName "Job1" -SparkVersion

"5.3.x-scala2.11" -NodeType "Standard_D3_v2" -MinNumberOfWorkers 2 -MaxNumberOfWorkers 2 -Timeout 100 -MaxRetries

3 -ScheduleCronExpression "0 15 22 ? * *" -Timezone "UTC" -SparkSubmitParameters "--pyFiles", "dbfs:/myscript.py",

"myparam" -Libraries '{"pypi":{package:"simplejson"}}', '{"jar": "DBFS:/mylibraries/test.jar"}'

The above example create a job on a new cluster.

RELATED LINKS

Post a reply in the forum

Add-DatabricksSparkSubmitJob

Share this page: