Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark-submit job.py can't use sparkConf to pass configuration parameters #24

Open
bizarte opened this issue Apr 3, 2017 · 1 comment

Comments

@bizarte
Copy link

bizarte commented Apr 3, 2017

I'm submitting my python job using the bin/spark-submit script. I want to configure the parameter in the python code through SparkConf & SparkContext class. I tried to set appName, master, spark.cores.max and spark.scheduler.mode in SparkConf object and pass it as an argument(conf) to the SparkContext. It turned out that the job was not sent to the master of the standalone cluster I set up(it just ran locally). Through the print statements below, it's clearly to see the SparkConf I passed to the SparkContext has all 4 configurations() but the conf variable from the SparkContext object doesn't have any of my updates despite of the default conf.

For other ways, I tried using conf/spark-defaults.conf or --properties-file my-spark.conf --master spark://spark-master:7077. It works.
I also tried to set master using master parameter seperately sc = pyspark.SparkContext(master="spark://spark-master:7077", conf=conf) and it worked as well!

So it seems that only the conf parameter cannot be congested by the SparkContext correctly.

You can find the master got updated.

SparkContextConf: [('spark.app.name', 'test.py'), ('spark.rdd.compress', 'True'), ('spark.driver.port', '41147'), ('spark.app.id', 'app-20170403234627-0001'), ('spark.master', 'spark://spark-master:7077'), ('spark.serializer.objectStreamReset', '100'), ('spark.executor.id', 'driver'), ('spark.submit.deployMode', 'client'), ('spark.files', 'file:/mnt/scratch/yidi/docker-volume/test.py'), ('spark.driver.host', '10.0.0.5')]

This is my python job code.

import operator
import pyspark

def main():
    '''Program entry point'''

    import socket
    print(socket.gethostbyname(socket.gethostname()))
    print('!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!')
    num_cores = 4
    conf = pyspark.SparkConf()
    conf.setAppName("word-count").setMaster("spark://spark-master:7077")#setExecutorEnv("CLASSPATH", path)
    conf.set("spark.scheduler.mode", "FAIR")
    conf.set("spark.cores.max", num_cores)
    print("Conf: {}".format(conf.getAll()))
    sc = pyspark.SparkContext(conf=conf)
    print("SparkContextConf: {}".format(sc.getConf().getAll()))

    #Intialize a spark context
    # with pyspark.SparkContext(conf=conf) as sc:
    #     #Get a RDD containing lines from this script file  
    #     lines = sc.textFile("/mnt/scratch/yidi/docker-volume/war_and_peace.txt")
    #     #Split each line into words and assign a frequency of 1 to each word
    #     words = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))
    #     #count the frequency for words
    #     counts = words.reduceByKey(operator.add)
    #     #Sort the counts in descending order based on the word frequency
    #     sorted_counts =  counts.sortBy(lambda x: x[1], False)
    #     #Get an iterator over the counts to print a word and its frequency
    #     for word,count in sorted_counts.toLocalIterator():
    #         print("{} --> {}".format(word.encode('utf-8'), count))
    # print "Spark conf: ", conf.getAll()
if __name__ == "__main__":
    main()

Console output running the job above:

root@bdd1ba99bf4f:/usr/spark-2.1.0# bin/spark-submit /mnt/scratch/yidi/docker-volume/test.py 
10.0.0.5
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Conf: dict_items([('spark.scheduler.mode', 'FAIR'), ('spark.app.name', 'word-count'), ('spark.master', 'spark://spark-master:7077'), ('spark.cores.max', '4')])
17/04/03 23:24:27 INFO spark.SparkContext: Running Spark version 2.1.0
17/04/03 23:24:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/03 23:24:27 INFO spark.SecurityManager: Changing view acls to: root
17/04/03 23:24:27 INFO spark.SecurityManager: Changing modify acls to: root
17/04/03 23:24:27 INFO spark.SecurityManager: Changing view acls groups to: 
17/04/03 23:24:27 INFO spark.SecurityManager: Changing modify acls groups to: 
17/04/03 23:24:27 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
17/04/03 23:24:27 INFO util.Utils: Successfully started service 'sparkDriver' on port 38193.
17/04/03 23:24:27 INFO spark.SparkEnv: Registering MapOutputTracker
17/04/03 23:24:27 INFO spark.SparkEnv: Registering BlockManagerMaster
17/04/03 23:24:27 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/04/03 23:24:27 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/04/03 23:24:27 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-7b545c86-9344-4f73-a3db-e891c562714d
17/04/03 23:24:27 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
17/04/03 23:24:27 INFO spark.SparkEnv: Registering OutputCommitCoordinator
17/04/03 23:24:27 INFO util.log: Logging initialized @1090ms
17/04/03 23:24:27 INFO server.Server: jetty-9.2.z-SNAPSHOT
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3950ddbb{/jobs,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@286430d5{/jobs/json,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@404eb034{/jobs/job,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@79e3feec{/jobs/job/json,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4661596e{/stages,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4f8a3bef{/stages/json,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7a70fd3a{/stages/stage,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1c826806{/stages/stage/json,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@50e4e8d1{/stages/pool,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4e2fe461{/stages/pool/json,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@33cb59b3{/storage,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3c06c594{/storage/json,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4b5300a5{/storage/rdd,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7a6ef942{/storage/rdd/json,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1301217d{/environment,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@19217cec{/environment/json,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4a241145{/executors,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@470d45aa{/executors/json,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5d9d8eff{/executors/threadDump,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4fc94fbc{/executors/threadDump/json,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@258dd139{/static,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@880f037{/,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@395b7dae{/api,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3cea7196{/jobs/job/kill,null,AVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@77257b2b{/stages/stage/kill,null,AVAILABLE}
17/04/03 23:24:27 INFO server.ServerConnector: Started ServerConnector@788dbc45{HTTP/1.1}{0.0.0.0:4040}
17/04/03 23:24:27 INFO server.Server: Started @1154ms
17/04/03 23:24:27 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
17/04/03 23:24:27 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://localhost:4040
17/04/03 23:24:27 INFO spark.SparkContext: Added file file:/mnt/scratch/yidi/docker-volume/test.py at file:/mnt/scratch/yidi/docker-volume/test.py with timestamp 1491261867734
17/04/03 23:24:27 INFO util.Utils: Copying /mnt/scratch/yidi/docker-volume/test.py to /tmp/spark-5bb22b93-172d-40d6-8a37-6e3da5dc15a6/userFiles-ed0a6692-b96a-4b3d-be16-215a21cf836b/test.py
17/04/03 23:24:27 INFO executor.Executor: Starting executor ID driver on host localhost
17/04/03 23:24:27 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44223.
17/04/03 23:24:27 INFO netty.NettyBlockTransferService: Server created on 10.0.0.5:44223
17/04/03 23:24:27 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/04/03 23:24:27 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.0.5, 44223, None)
17/04/03 23:24:27 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.0.0.5:44223 with 366.3 MB RAM, BlockManagerId(driver, 10.0.0.5, 44223, None)
17/04/03 23:24:27 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.0.5, 44223, None)
17/04/03 23:24:27 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.0.5, 44223, None)
17/04/03 23:24:27 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@c06133a{/metrics/json,null,AVAILABLE}
SparkContextConf: [('spark.app.name', 'test.py'), ('spark.rdd.compress', 'True'), ('spark.serializer.objectStreamReset', '100'), ('spark.master', 'local[*]'), ('spark.executor.id', 'driver'), ('spark.submit.deployMode', 'client'), ('spark.files', 'file:/mnt/scratch/yidi/docker-volume/test.py'), ('spark.driver.host', '10.0.0.5'), ('spark.app.id', 'local-1491261867756'), ('spark.driver.port', '38193')]
17/04/03 23:24:27 INFO spark.SparkContext: Invoking stop() from shutdown hook
17/04/03 23:24:27 INFO server.ServerConnector: Stopped ServerConnector@788dbc45{HTTP/1.1}{0.0.0.0:4040}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@77257b2b{/stages/stage/kill,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3cea7196{/jobs/job/kill,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@395b7dae{/api,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@880f037{/,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@258dd139{/static,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4fc94fbc{/executors/threadDump/json,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5d9d8eff{/executors/threadDump,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@470d45aa{/executors/json,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4a241145{/executors,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@19217cec{/environment/json,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1301217d{/environment,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7a6ef942{/storage/rdd/json,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4b5300a5{/storage/rdd,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3c06c594{/storage/json,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@33cb59b3{/storage,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4e2fe461{/stages/pool/json,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@50e4e8d1{/stages/pool,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1c826806{/stages/stage/json,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7a70fd3a{/stages/stage,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4f8a3bef{/stages/json,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4661596e{/stages,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@79e3feec{/jobs/job/json,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@404eb034{/jobs/job,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@286430d5{/jobs/json,null,UNAVAILABLE}
17/04/03 23:24:27 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3950ddbb{/jobs,null,UNAVAILABLE}
17/04/03 23:24:27 INFO ui.SparkUI: Stopped Spark web UI at http://localhost:4040
17/04/03 23:24:27 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/04/03 23:24:27 INFO memory.MemoryStore: MemoryStore cleared
17/04/03 23:24:27 INFO storage.BlockManager: BlockManager stopped
17/04/03 23:24:27 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
17/04/03 23:24:27 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/04/03 23:24:27 INFO spark.SparkContext: Successfully stopped SparkContext
17/04/03 23:24:27 INFO util.ShutdownHookManager: Shutdown hook called
17/04/03 23:24:27 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-5bb22b93-172d-40d6-8a37-6e3da5dc15a6/pyspark-50d75419-36b8-4dd2-a7c6-7d500c7c5bca
17/04/03 23:24:27 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-5bb22b93-172d-40d6-8a37-6e3da5dc15a6

@OneCricketeer
Copy link

OneCricketeer commented Jan 20, 2020

As of Spark 2, SparkSession is encouraged over SparkContext

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate(). # add .config() calls in here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants