Hadoop-lab

1. 安装CentOS7

1. 下载镜像

ISO: CentOS7 X86_64

2. 安装系统

3. 虚拟机完成

4. 安装Docker

Docker要求CentOS系统的内核版本高于 3.10，查看CentOS版本是否支持 Docker

uname -r

确保yum包更新到最新

1	sudo yum update

安装过Docker则卸载旧版本

1	sudo yum remove docker docker-common docker-selinux docker-engine

安装需要的软件包， yum-util 提供yum-config-manager功能，另外两个是devicemapper驱动依赖的

1	sudo yum install -y yum-utils device-mapper-persistent-data lvm2

设置yum源（使用阿里云地址）

1	sudo yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

可以查看所有仓库中所有的docker版本，并选择特定版本安装

1	yum list docker-ce --showduplicates \| sort -r

安装最新版本的 Docker Engine-Community 和 containerd

1	sudo yum install docker-ce docker-ce-cli containerd.io

验证是否安装成功

1	docker version

启动Docker

1	sudo systemctl start docker

运行Hello World

1	sudo docker run hello-world

2. 搭建Hadoop集群

1. 下载并配置JDK

下载 jdk

1	yum install java-1.8.0-openjdk* -y

获取JAVA_HOME

1	dirname $(readlink $(readlink $(which java)))

配置环境变量

vi /etc/profile

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH

查看JAVA_HOME

1 2	source /etc/profile echo $JAVA_HOME

测试

1	java -version

2. 配置Hosts列表

由于两台虚拟机是同一台虚拟机复制而来，所以必须先重新生成Mac地址

master 和 slave 禁用防火墙

停止防火墙

1	systemctl stop firewalld

禁用防火墙

1	systemctl disable firewalld

master 和 slave 修改主机名

1	vi /etc/sysconfig/network

确认修改

master 和 slave 执行 ifconfig 查询 IP

ifconfig

master 和 slave 将IP地址和主机名分别添加至/etc/hosts中

1	vi /etc/hosts

master 和 slave 之间互ping

3. 集群ssh免密登录

对master操作：

master 生成公钥

1	ssh-keygen -t rsa

将公钥追加到授权列表

1	cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

修改authorized_keys文件的权限

1	chmod 600 ~/.ssh/authorized_keys

将authorized_keys文件复制到slave节点

1	scp ~/.ssh/authorized_keys parak@slave:~/.ssh

查看slave的.ssh的目录

修改master和slave的SSH配置

1 2	su root vi /etc/ssh/sshd_config

使用ssh-add指令将私钥加入并重启sshd服务

1 2	ssh-add ~/.ssh/id_rsa service sshd restart

测试免密登录

ssh登录依然需要密码，我们可以去看一下master的日志文件

cat var/log/secure

原因

sshd为了安全，对属主的目录和文件权限有所要求。

如果权限不对，则ssh的免密码登陆不生效。

解决

将.ssh目录的权限改为755

id_rsa.pub和authorized_keys权限改为644

id_rsa权限必须为600

chmod 755 .ssh
cd .ssh
chmod 644 id_rsa.pub authorized_keys
chmod 600 id_rsa

再次测试免密登录

4. 下载并配置Hadoop

下载：hadoop-2.9.2 选择bsfu的北京外国语学院的镜像, 速度流批

解压下载的压缩包

1	tar -zxvf hadoop-2.3.2.tar.gz

修改文件权限

1	chmod 777 hadoop-env.sh core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml slaves

配置环境变量

gedit /home/parak/hadoop/hadoop-2.9.2/etc/hadoop/hadoop-env.sh

添加
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64
export HADOOP_HOME=/home/parak/hadoop/hadoop-2.9.2
export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:$PATH

配置核心组件 core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://master:9000</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/home/parak/hadoopdata</value>
	</property>
</configuration>

配置文件系统 hdfs.site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>dfs.replication</name>
		<value>1</value>
	</property>
</configuration>

配置文件系统 yarn-site.xml

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->

	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
	<property>
		<name>yarn.resourcemanager.address</name>
		<value>master:18040</value>
	</property>
        <property>
                <name>yarn.resourcemanager.scheduler.address</name>
                <value>master:18030</value>
        </property>
        <property>
        	<name>yarn.resourcemanager.resource-tracker.address</name>
        	<value>master:18025</value>
        </property>
        <property>
        	<name>yarn.resourcemanager.admin.address</name>
        	<value>master:18141</value>
        </property>
        <property>
        	<name>yarn.resourcemanager.webapp.address</name>
        	<value>master:18088</value>
        </property>

</configuration>

配置计算框架 mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>
</configuration>

在master节点配置slaves文件

5. 启动Hadoo集群

切换用户 parak

su parak

配置Hadoop启动的系统环境变量

cd
gedit ~/.bash_profile

添加：
#HADOOP
export HADOOP_HOME=/home/parak/hadoop/hadoop-2.9.2
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

执行：
source ~/.bash_profile

验证：
echo ${HADOOP_HOME}

创建数据目录 hadoopdata

1 2	mkdir /home/parak/hadoopdata chmod 777 hadoopdata/

格式化文件系统

1	hdfs namenode -format

启动Hadoop

1	sbin/start-all.sh

查看进程

jps

进入 FireFox 输入：http://master:50070/

检查 namenode 和 datanode 是否正常

进入 FireFox 输入：http://master:18088/

检查 Yarn 是否正常

运行PI实例检查Hadoop集群是否搭建成功

1 2	cd ~/hadoop/hadoop-2.9.2/share/hadoop/mapreduce/ hadoop jar hadoop-mapreduce-examples-2.9.2.jar pi 10 10

报错：未找到主机路由

分析：slave的防火墙没有关闭

检查slave的防火墙

1	systemctl status firewalld

关闭且禁用防火墙

1 2	systemctl stop firewalld systemctl disable firewalld

再次运行PI实例

可以看到运行结果：PI = 3.20000000000000000000

综上，集群正常启动

关闭Hadoop集群

3. 分布式文件系统HDFS上的操作

1. 利用Shell命令与HDFS进行交互

1
2
3

hadoop fs    //适用于任何不同的文件系统，比如本地文件系统和HDFS文件系统
hadoop dfa   //只能适用于HDFS文件系统
hdfs dfs     //跟hadoop dfs的命令作用一样，也只能适用于HDFS文件系统

1	hadoop fs -help put

(1) 目录操作

在HDFS中为parak用户创建一个用户目录，命令如下：

1	hadoop fs -mkdir -p /user/parak

可以使用如下命令显示HDFS中/user目录下的内容：

1	hadoop fs -ls /user

创建/user/parak/input目录

1	hadoop fs -mkdir -p /user/parak/input

查看/user/parak/input目录是否创建成功

1	hadoop fs -ls /user/parak

创建/input目录

1	hadoop fs -mkdir -p /input

删除/input目录

1	hadoop fs -rm -r /input

查看/input目录是否删除成功

1	hadoop fs -ls /

(2) 文件操作

在实际应用中，经常需要从本地文件系统向HDFS中上传文件，或者把HDFS中的文件下载到本地文件系统中。

首先，使用vim编辑器，在本地Linux文件系统的“/home/parak/”目录下创建一个文件myLocalFile.txt，里面输入：

Hadoop
MapReduce
Spark
Khighness

然后，把本地文件系统的“/home/parak/myLocalFile.txt”上传到HDFS中的当前用户目录的input目录下，也就是上传到HDFS的“/user/parak/input/”目录下：

1	hadoop fs -put /home/parak/myLocalFile.txt /user/parak/input

查看一下文件是否成功上传到HDFS中：

1	hadoop fs -ls /user/parak/input

查看HDFS中的myLocalFile.txt这个文件的内容

1	hadoop fs -cat /user/parak/input/myLocalFile.txt

把HDFS中的myLocalFile.txt文件下载到本地文件系统中的“/home/parak/下载/”这个目录下

1	hadoop fs -get /user/parak/input/myLocalFile.txt /home/parak/下载

在本地查看下载下来的文件myLocalFile.txt

1
2
3

cd 下载
ll
cat myLocalFile.txt

把HDFS的“/user/parak/input/myLocalFile.txt”文件，拷贝到HDFS的另外一个目录“/input”中

1 2	hadoop fs -mkdir /input hadoop fs -cp /user/parak/input/myLocalFile.txt /input

2. 利用Web界面管理HDFS

在本机Chorme输入 http://192.168.117.141:50070 , 即可看到HDFS的Web管理界面

4. 分布式文件系统HDFS上的编程实践

1. 安装Eclipse

(1) 官网下载安装包

进入FireFox打开下载: eclipe

或者

百度网盘下载

链接: https://pan.baidu.com/s/1P4vDgBEj_eOSakabM93rew

密码: eujp

(2) 解压安装包

1
2
3

tar xzvf eclipse-inst-linux64.tar.gz 
cd eclipse-installer/
ll

(3) 安装Eclipse For JavaEE

1	eclipse-inst.ini

(4) 创建桌面快捷方式

1、切换root身份：su root

2、进入usr/share/applications目录：cd /usr/share/applications

3、创建eclipase.desktop文件：touch eclipase.desktop

4、输入以下内容后保存：vi eclipase.desktop

5、最后将快捷方式复制到桌面，并添加信任即可

[Desktop Entry]
Name=Eclipse
Exec=/home/parak/eclipse/eclipse-j2e/eclipse/eclipse
Type=Application
Icon=/home/parak/eclipse/eclipse-j2e/eclipse/icon.xpm
Terminal=false

2. 创建eclipse工程

双击eclipse桌面快捷方式

进入eclipseIDE

点击：File —> Project —> Java Project

设置JRE

添加JAR

/home/parak/hadoop/hadoop-2.9.2/share/hadoop/common/: hadoop-common-2.9.2.jar和hadoop-nfs-2.9.2.jar
/home/parak/hadoop/hadoop-2.9.2/share/hadoop/common/: lib目录下的所有jar包
/home/parak/hadoop/hadoop-2.9.2/share/hadoop/hdfs/: hadoop-hdfs-2.9.2.jar和hadoop-hdfs-nfs-2.9.2.jar
/home/parak/hadoop/hadoop-2.9.2/share/hadoop/hdfs/: lib目录下的所有jar包

点击Finish，点击Open Perspective，并且勾选Remember my decision

3. 编写一个Java应用程序检测HDFS中是否存在一个文件

(1) 编写Java程序

右击HDFSExample —> New —> Class

Name = HDFSFileIfExist，然后Finish

编写代码

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

/**
 * @author parak
 * @date   2020-9-15 
 */
public class HDFSFileIfExist {
    public static void main(String[] args){
        try{
            String fileName = "/user/parak/input/myLocalFile.txt";
            Configuration conf = new Configuration();
            conf.set("fs.defaultFS", "hdfs://master:9000");
            conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
            FileSystem fs = FileSystem.get(conf);
            if(fs.exists(new Path(fileName))){
                System.out.println("文件存在");
            }else{
                System.out.println("文件不存在");
            }
        }catch (Exception e){
            e.printStackTrace();
        }
    }
}

(2) 运行程序

启动Hadoop集群

运行程序

运行结果

(3) 部署到Hadoop平台上运行

在hadoop安装目录下新建myapp目录，存放Hadoop应用程序

1	mkdir ~/hadoop/hadoop-2.9.2/myapp

右击HDFSExample —> Export —> Java —> Runnable JAR file —> Next

Lauch configuration: HDFSFileIfExist -HDFSExsmple

Export destination: /home/parak/hadoop/hadoop-2.9.2/myapp/HDFSExample.jar

出现警告，选择OK

查看myapp中生成的HDFSExample.jar文件

运行程序

1	java -jar ~/hadoop/hadoop-2.9.2/myapp/HDFSExample.jar

4. 编写一个Java应用程序读/写HDFS文件

(1) 读取HDFS文件程序

import org.apache.hadoop.fs.Path;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;

/**
 * @author parak
 * @date 2020-9-15
 */
public class HDFSReadFile {
	public static void main(String[] args) {
		try {
			Configuration conf = new Configuration();
			conf.set("fs.defaultFS", "hdfs://master:9000");
			conf.set("fs.hdfs.impl",
					"org.apache.hadoop.hdfs.DistributedFileSystem");
			FileSystem fs = FileSystem.get(conf);
			Path file = new Path("/user/parak/input/myLocalFile.txt");
			FSDataInputStream getIt = fs.open(file);
			BufferedReader d = new BufferedReader(new InputStreamReader(getIt));
			String content = null;
			while((content = d.readLine()) != null) {
				System.out.println(content);
			}
			d.close(); // 关闭文件
			fs.close(); // 关闭hdfs
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}

运行

(2) 写HDFS文件程序

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;

/**
 * @author parak
 * @date 2020-9-15
 */
public class HDFSWriteFile {
	public static void main(String[] args) {
		try {
			Configuration conf = new Configuration();
			conf.set("fs.defaultFS", "hdfs://master:9000");
			conf.set("fs.hdfs.impl",
					"org.apache.hadoop.hdfs.DistributedFileSystem");
			FileSystem fs = FileSystem.get(conf);
			byte[] buff = "Hello, Khighness".getBytes(); // 要写入的内容
			String filename = "/user/parak/test.txt"; // 要写入的文件名
			FSDataOutputStream os = fs.create(new Path(filename));
			os.write(buff, 0, buff.length);
			System.out.println("Create:" + filename);
			os.close();
			fs.close();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}

运行

5. Eclipse上的HDFS操作

1. 安装Hadoop-Eclipse-Plugin

(1) 下载相关插件：hadoop-eclipse-plugin

我是vpn下载下来的，分享一下：

完整包

网盘链接：https://pan.baidu.com/s/1Q72HyCSUnh-Q0JRQK03Jjg

提取码：kkkk

插件包

网盘链接：https://pan.baidu.com/s/1dfBm7JB4ZXTR3bwF7PSaKA

提取码：kkkk

(2) 下载后将release中的hadoop-eclipse-plugin-2.6.0.jar复制到 Eclipse 安装目录的 plugins 文件夹中

(3) 运行 eclipse-clean 重启 Eclipse

2. 配置Hadoop-Eclipse-Plugin

配置前开启Hadoop

(1) 切换到“Map/Reduce”开发视图

(2) 建立与Hadoop集群的连接

(3) 填写Hadoop连接配置

3. 在Eclipse中操作HDFS中的文件

点击左侧DFS Locations即可查看HDFS的文件列表

6. 在Eclipse中运行”Wold Count”MapReduce程序

1. 在Eclipse中创建”WordCount”MapReduce项目

选择File -> New -> Project -> Map/Reduce Project

项目名称=MyWordCount，然后Configure Hadoop install directory -> Finish

新建Java文件，name = WordCountTest

代码如下

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
 
/**
 * @author parak
 * @date   2020-9-28
 */
public class WordCountTest {
    public WordCountTest() {
    }
 
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
        if(otherArgs.length < 2) {
            System.err.println("Usage: wordcount <in> [<in>...] <out>");
            System.exit(2);
        }
 
        Job job = Job.getInstance(conf, "word count test");
        job.setJarByClass(WordCountTest.class);
        job.setMapperClass(WordCountTest.TokenizerMapper.class);
        job.setCombinerClass(WordCountTest.IntSumReducer.class);
        job.setReducerClass(WordCountTest.IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
 
        for(int i = 0; i < otherArgs.length - 1; ++i) {
            FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
 
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
        System.exit(job.waitForCompletion(true)?0:1);
    }
 
    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();
 
        public IntSumReducer() {
        }
 
        public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            int sum = 0;
 
            IntWritable val;
            for(Iterator itr = values.iterator(); itr.hasNext(); sum += val.get()) {
                val = (IntWritable)itr.next();
            }
 
            this.result.set(sum);
            context.write(key, this.result);
        }
    }
 
    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private static final IntWritable one = new IntWritable(1);
        private Text word = new Text();
 
        public TokenizerMapper() {
        }
 
        public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
 
            while(itr.hasMoreTokens()) {
                this.word.set(itr.nextToken());
                context.write(this.word, one);
            }
 
        }
    }
}

2. 添加log4j.properties配置文件到src目录下

新建文件

内容如下

log4j.rootLogger = INFO,KAG,CONSOLE

log4j.appender.KAG.Threshold=INFO
log4j.appender.KAG.encoding=UTF-8

log4j.appender.KAG = org.apache.log4j.DailyRollingFileAppender
log4j.appender.KAG.File=log/sHadoop.log
log4j.appender.KAG.ImmediateFlush=true
log4j.appender.KAG.DatePattern='_'yyyy-MM-dd
log4j.appender.KAG.layout=org.apache.log4j.PatternLayout
log4j.appender.KAG.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss} KAG %-5p [%c] - %m%n

log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.Threshold=INFO

log4j.appender.CONSOLE.Target=System.out
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss} KAG %-5p [%c] - %m%n

3. 通过Eclipse运行“MyWordCount” MapReduce项目

更改运行配置

运行结果

查看/user/parak/output目录，和目录下的文件

重启Eclipse，在Eclipse中查看HDFS文件系统，/user/parak/output

4. 在Hadoop平台上部署WordCount程序

右键 WordCountTest —> Export

选择 Java —> Runnable JAR File —> Next

填入内容：

Lauch configuration: WordCountTest - MyWordCount

Export destination: /home/parak/hadoop/hadoop-2.9.2/myapp/WordCount.jar

然后Finish即可

运行程序

1	java -jar /home/parak/hadoop/hadoop-2.9.2/myapp/WordCount.jar

7. 统计某电商网站买家收藏商品数量

要求

现有某电商网站用户对商品的收藏数据，记录了用户收藏的商品id以及收藏日期，名为buyer_favorite1。buyer_favorite1包含：买家id，商品id，收藏日期这三个字段，数据以“\t”分割，样本数据及格式如下：

买家id 商品id 收藏日期
10181 1000481 2010-04-04 16:54:31
20001 1001597 2010-04-07 15:07:52
20001 1001560 2010-04-07 15:08:27
20042 1001368 2010-04-08 08:20:30
20067 1002061 2010-04-08 16:45:33
20056 1003289 2010-04-12 10:50:55
20056 1003290 2010-04-12 11:57:35
20056 1003292 2010-04-12 12:05:29
20054 1002420 2010-04-14 15:24:12
20055 1001679 2010-04-14 19:46:04
20054 1010675 2010-04-14 15:23:53
20054 1002429 2010-04-14 17:52:45
20076 1002427 2010-04-14 19:35:39
20054 1003326 2010-04-20 12:54:44
20056 1002420 2010-04-15 11:24:49
20064 1002422 2010-04-15 11:35:54
20056 1003066 2010-04-15 11:43:01
20056 1003055 2010-04-15 11:43:06
20056 1010183 2010-04-15 11:45:24
20056 1002422 2010-04-15 11:45:49
20056 1003100 2010-04-15 11:45:54
20056 1003094 2010-04-15 11:45:57
20056 1003064 2010-04-15 11:46:04
20056 1010178 2010-04-15 16:15:20
20076 1003101 2010-04-15 16:37:27
20076 1003103 2010-04-15 16:37:05
20076 1003100 2010-04-15 16:37:18
20076 1003066 2010-04-15 16:37:31
20054 1003103 2010-04-15 16:40:14
20054 1003100 2010-04-15 16:40:16

要求编写MapReduce程序，统计每个买家收藏商品数量。

1. 在文档下新建文件buyer_favourite9

写入数据

10181   1000481   2010-04-04添加到日历 16:54:31
20001   1001597   2010-04-07 15:07:52
20001   1001560   2010-04-07 15:08:27
20042   1001368   2010-04-08 08:20:30
20067   1002061   2010-04-08 16:45:33
20056   1003289   2010-04-12 10:50:55
20056   1003290   2010-04-12 11:57:35
20056   1003292   2010-04-12 12:05:29
20054   1002420   2010-04-14 15:24:12
20055   1001679   2010-04-14 19:46:04
20054   1010675   2010-04-14 15:23:53
20054   1002429   2010-04-14 17:52:45
20076   1002427   2010-04-14 19:35:39
20054   1003326   2010-04-20 12:54:44
20056   1002420   2010-04-15 11:24:49
20064   1002422   2010-04-15 11:35:54
20056   1003066   2010-04-15 11:43:01
20056   1003055   2010-04-15 11:43:06
20056   1010183   2010-04-15 11:45:24
20056   1002422   2010-04-15 11:45:49
20056   1003100   2010-04-15 11:45:54
20056   1003094   2010-04-15 11:45:57
20056   1003064   2010-04-15 11:46:04
20056   1010178   2010-04-15 16:15:20
20076   1003101   2010-04-15 16:37:27
20076   1003103   2010-04-15 16:37:05
20076   1003100   2010-04-15 16:37:18
20076   1003066   2010-04-15 16:37:31
20054   1003103   2010-04-15 16:40:14
20054   1003100   2010-04-15 16:40:16

2. 将buyer_favourite9上传到HDFS文件系统

右键input —> Upload files to DFS

选择buyer_favourite1后确定

3. 新建java文件ProductNumber

代码如下

package com.kag;

import java.io.IOException;  
import java.util.StringTokenizer;  
import org.apache.log4j.Logger;
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.mapreduce.Mapper;  
import org.apache.hadoop.mapreduce.Reducer;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 

/**
 * @author parak
 * @date 2020-9-28
 */
public class ProductNumber {
	
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		Job job = Job.getInstance();
		job.setJobName("ProductNumber");
		job.setJarByClass(ProductNumber.class);
		job.setMapperClass(MapperHandler.class);  
		job.setReducerClass(ReducerHandler.class);  
		job.setOutputKeyClass(Text.class);  
		job.setOutputValueClass(IntWritable.class);  
		Path inputPath = new Path("hdfs://master:9000/user/parak/input/buyer_favourite1");
		Path outputPath = new Path("hdfs://master:9000/user/parak/output/buyer_favourite1_analysis_result");
		FileInputFormat.addInputPath(job, inputPath);
		FileOutputFormat.setOutputPath(job, outputPath);
		boolean flag = job.waitForCompletion(true);
		Logger log = Logger.getLogger(ProductNumber.class);
		log.info("Falg = " + flag);
		System.exit(flag ? 0 : 1);
	}
	
	public static class MapperHandler extends Mapper<Object, Text, Text, IntWritable> {
		public static final IntWritable intWritable = new IntWritable(1);
		public static Text word = new Text();
		@Override
		protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
			StringTokenizer tokenizer = new StringTokenizer(value.toString(), "   ");
			word.set(tokenizer.nextToken());
			context.write(word, intWritable);
		}
	}
	
	public static class ReducerHandler extends Reducer<Text, IntWritable, Text, IntWritable> {
		private IntWritable intWritable = new IntWritable(1);
		@Override
		protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
			int sum = 0;
			for (IntWritable value : values) {
				sum += value.get();
			}
			intWritable.set(sum);
			context.write(key, intWritable);
		}
	}

}

运行结果

运行后的HDFS文件系统

part-r-00000文件内容即为统计结果

8. 安装部署HBase

HBase与Hadoop版本支持关系

1. 下载安装HBase-1.6.0

使用镜像：https://mirrors.bfsu.edu.cn/apache/hbase/

选择1.6.0，下载tar.gz压缩包

解压

1	tar xzvf hbase-1.6.0-bin.tar.gz

查看

2. 配置Hbase

进入HBase安装主目录的conf目录，然后修改配置文件

cd conf/

(1) 修改配置文件hbase-env.sh

gedit hbase-env.sh

修改JAVA_HOME:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64

(2) 修改配置文件hbase-site.xml

将hbase-site.xml修改为‘

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
/**
 *
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
-->
<configuration>
<configuration>
	<property>
		<name>hbase.cluster.distributed</name>
		<value>true</value>
	</property>
	<property>
		<name>hbase.rootdir</name>
		<value>hdfs://master:9000/hbase</value>
	</property>
	<property>
		<name>hbase.zookeeper.quorum</name>
		<value>master</value>
	</property>
	<property>
		<name>hbase.master.info.port</name>
		<value>60010</value>
	</property>
</configuration>

(3) 设置 regionservers

将regionservers中的localhost修改为: slave

(4) 设置环境变量

gedit ~/.bash_profile

将下面代码添加到文件末尾：
#HBase
export HBASE_HOME=/home/parak/HBase/hbase-1.6.0
export PATH=$HBASE_HOME/bin:$PATH
export HADOOP_CLASSPATH=$HBASE_HOME/lib/*

(5) 将HBase复制到Slave结点

1	scp -r /home/parak/HBase/hbase-1.6.0 slave:/home/parak/HBase/

查看slave结点的HBase文件夹

3. 验证并启动HBase

先启动Hadoop: start-all.sh

再启动Hbase: start-hbase.sh

启动HBase出现错误

我意识到是Hadoop-2.9.2与HBase-1.6.0的版本匹配问题，于是我下载HBase-2.2.6。

4. 重新下载配置HBase-2.2.6

下载

解压: tar xzvf hbase-2.2.6-bin.tar.gz

修改配置文件hbase-env.sh

修改配置文件hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
-->
<configuration>
  <!--
    The following properties are set for running HBase as a single process on a
    developer workstation. With this configuration, HBase is running in
    "stand-alone" mode and without a distributed file system. In this mode, and
    without further configuration, HBase and ZooKeeper data are stored on the
    local filesystem, in a path under the value configured for `hbase.tmp.dir`.
    This value is overridden from its default value of `/tmp` because many
    systems clean `/tmp` on a regular basis. Instead, it points to a path within
    this HBase installation directory.

    Running against the `LocalFileSystem`, as opposed to a distributed
    filesystem, runs the risk of data integrity issues and data loss. Normally
    HBase will refuse to run in such an environment. Setting
    `hbase.unsafe.stream.capability.enforce` to `false` overrides this behavior,
    permitting operation. This configuration is for the developer workstation
    only and __should not be used in production!__

    See also https://hbase.apache.org/book.html#standalone_dist
  -->
	<property>
		<name>hbase.cluster.distributed</name>
		<value>true</value>
	</property>
	<property>
		<name>hbase.tmp.dir</name>
		<value>./tmp</value>
	</property>
	<property>
		<name>hbase.unsafe.stream.capability.enforce</name>
		<value>false</value>
	</property>
	<property>
		<name>hbase.rootdir</name>
		<value>hdfs://master:9000/hbase</value>
	</property>
	<property>
		<name>hbase.zookeeper.quorum</name>
		<value>master</value>
	</property>
	<property>
		<name>hbase.master.info.port</name>
		<value>60010</value>
	</property>
</configuration>

修改regionservers文件

将HBase复制到Slave节点

1	scp -r /home/parak/HBase/hbase-2.2.6 slave:/home/parak/HBase

再次启动Hbase

打开FireFox，进入http://master:60010

说明HBase启动成功

4. 关闭HBase

先关闭HBase: stop-hbase.sh

再关闭Hadoop: stop-all.sh

9. HBase Shell命令操作

1. 启动HBase Shell

(1) 启动HBase Shell界面

1	hbase shell

(2) HBase Shell的help命令

help

查看HBase建表命令create的用法

1	help "create"

2. 创建HBase数据表

HBase中用create命令创建表

1	create 'student', 'Sname', 'Ssex', 'Sage', 'Sdept', 'course'

查看’student’表的属性

1	describe 'student'

3. HBase数据库基本操作

(1) 添加数据

put命令添加数据，一次只能为一个表的一行数据的一个列添加一个数据

put 'student', '95001', 'Sname', 'Li Ying'
put 'student', '95001', 'Sdept', 'CS'
put 'student', '95001', 'course:math', '81'
put 'student', '95001', 'course:english', '85'
put 'student', '95002', 'course:math', '83'

(2) 查看数据

查看表某一行的数据

1	get 'student', '95001'

查看表的全部数据

1	scan 'student'

(3) 删除数据

delete命令删除某一项数据

1	delete 'student', '95001', 'Ssex'

delete命令删除某行的全部数据

1	delete 'student','95001'

(4) 删除表

1 2	disable 'student' drop 'student'

4. 查询HBase数据表的历史数据

在创建表的时候，指定保存的版本数

1	create 'teacher', {NAME=>'username', VERSIONS=>5}

插入数据然后更新数据，使其产生历史版本数据

put 'teacher', '91001', 'username', 'Mary'
put 'teacher', '91001', 'username', 'Mary1'
put 'teacher', '91001', 'username', 'Mary2'
put 'teacher', '91001', 'username', 'Mary3'
put 'teacher', '91001', 'username', 'Mary4'
put 'teacher', '91001', 'username', 'Mary5'