spark安装

spark安装

  • local
  • on yarn(on yarn模式就是将资源管理交给hadoop的yarn,自己本身只做计算与任务调度)
  • standalone(自己的资源管理与调度器)

standalone模式安装

构建一个由Master+Slave构成的Spark集群,Spark运行在集群中。
所有安装都必须要保证安装的环境一致,否则成功只能看运气
环境 系统:Centos6.10;3台虚拟机 软件:JDK1.8 + Spark2.4.4

1
2
3
4
|     Host       |   HostName    | Master | Slave |
| 100.80.128.34 | server01 | √ | √ |
| 100.80.128.253 | server02 | | √ |
| 100.80.128.168 | server03 | | √ |

虚拟机:
启动无窗口VirtualBox虚拟机 以便用远程桌面连接
虚拟机配置eth2网卡,开机不能自动获取ip

集群机器环境初始化

集群部署需要节点间互信才能启动所有节点的 Worker,互信的意思就是无需输入密码,只根据 hostname 就可以登录到其他节点上,遍历3个节点都进行配置

创建hadoop用户

使用hadoop单独用户操作数据,安全

1
2
3
4
5
6
7
8
# 创建hadoop用户
sudo adduser hadoop
passwd hadoop
# 创建组
sudo usermod -a -G hadoop hadoop
# 给hadoop用户增加sudo权限
vim /etc/sudoers
hadoop ALL=(ALL) ALL
修改主机名

spark配置文件中配置的是机器名

修改主机名(临时):
sudo hostname server01
修改主机名(永久)

1
2
3
4
5
6
[root@server01 ~]# cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=server01
[root@server01 ~]# cat /etc/hosts
127.0.0.1 localhost server01 localhost4 localhost4.localdomain4
::1 localhost server01 localhost6 localhost6.localdomain6
修改/ect/hosts文件

spark是根据域名来查找slave节点的,域名解析,所以需要单独配置hosts文件

1
2
3
100.80.128.253 server02
100.80.128.168 server03
100.80.128.34 server01
安装ssh, 配置开机启动,root远程访问
1
2
3
4
5
6
7
8
yum install openssh-server -y

vim /etc/ssh/sshd_config
RSAAuthentication yes
PubkeyAuthentication yes

service sshd restart
chkconfig sshd on #开机自启
配置ssh互信
1
2
3
4
5
6
// 生成公钥
ssh-keygen -t rsa
// 循环拷贝公钥,自己向自己也要拷贝
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@server01
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@server02
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@server03
关闭防火墙

spark的启动会开放一些端口(8080,4040),并且机器远程访问需要ssh(22),所以需要关闭防火墙

1
2
3
4
5
6
7
8
# 默认清空表中所有链的内容
sudo iptables -F
# 查看防火墙状态
sudo service iptables status
# 关闭防火墙(centos6操作)
sudo service iptables stop
# 设置开机不启动(centos6操作)
sudo chkconfig iptables off

安装jdk8

CentOS使用yum安装jdk

安装spark

下载地址

解压拷贝
1
2
3
4
5
tar -zxvf spark-2.4.4-bin-hadoop2.7.tgz

cd spark-2.4.4-bin-hadoop2.7/conf
cp slaves.template slaves
cp spark-env.sh.template spark-env.sh
修改slaves配置文件

server01将会是master,所以只配置这两个

1
2
server02
server03
修改spark-env.sh配置文件
1
2
3
4
5
6
7
8
9
10
# java环境变量
export JAVA_HOME=/java/jdk1.8.0_161
# spark集群master进程主机host,要用真实的IP地址,不能用/etc/hosts文件中配置的server01,否则在UI界面上看不到worker节点的信息
export SPARK_MASTER_HOST=100.80.128.177
# spark集群master的端口号
export SPARK_MASTER_PORT=7077
# worker数量
export SPARK_WORKER_CORES=3
# worker机器的内存设置
export SPARK_WORKER_MEMORY=1g

启动Spark在UI界面上看不到worker节点的信息

启动

启动集群

1
2
3
server01机器启动节点
start-all.sh
ps -ef | grep spark

使用

指定master,这样会使用spark集群
./bin/spark-shell --master spark://100.80.128.177:7077
不指定,则使用本机器spark
./bin/spark-shell

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[hadoop@server02 spark-2.4.4-bin-hadoop2.7]$ ./bin/spark-shell
19/10/25 18:02:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://server02:4040
Spark context available as 'sc' (master = local[*], app id = local-1571997739840).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_232)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc.textFile("LICENSE").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
res0: Array[(String, Int)] = Array((Unless,3), (agree,1), (reproduce,,1), ((or,3), (org.scalanlp:breeze-macros_2.11,1), (org.iq80.snappy:snappy,1), (MERCHANTABILITY,,1), (However,,1), (been,2), (2-Clause,1), (appropriateness,1), (com.squareup.okio:okio,1), (direct,,1), (com.fasterxml.jackson.module:jackson-module-paranamer,1), (https://github.com/jax-rs,1), (9.,1), (com.esotericsoftware:minlog,1), (CSS,1), (-------------,1), (file,6), (org.scala-lang:scala-library,1), (file.,1), (harmless,1), (------------------------------------------------------------------------------------,2), (are,7), (2.,1), (part,4), (reproduction,,3), (alone,1), (different,1), (grant,1), (org.jodd:jodd-core,1), (io.fabric8:kubernetes-model,1), ("[]",1), (FITNESS,1), (be,5), (distribution,,1), (org.apache.xbean:x...
scala>

spark local模式搭建mac版本