mahout实现协同过滤

协同过滤

协同过滤简单来说是利用某兴趣相投、拥有共同经验之群体的喜好来推荐用户感兴趣的信息,个人通过合作的机制给予信息相当程度的回应(如评分)并记录下来以达到过滤的目的进而帮助别人筛选信息,回应不一定局限于特别感兴趣的,特别不感兴趣信息的纪录也相当重要。

Mahout

Mahout 是 Apache Software Foundation(ASF) 旗下的一个开源项目,提供一些可扩展的机器学习领域经典算法的实现,旨在帮助开发人员更加方便快捷地创建智能应用程序。Mahout包含许多实现,包括聚类、分类、推荐过滤、频繁子项挖掘。此外,通过使用 Apache Hadoop 库,Mahout 可以有效地扩展到云中。

Mahout安装

1
$ git clone https://github.com/apache/mahout.git mahout

git clone速度慢怎么办?

配置ss代理,假设已有ss代理服务,则在本地安装shadowsocks客户端,开启ss代理。以下为终端shadowsocks客户端安装开启教程:

1
$ yum install python-pip
1
$ pip install shadowsocks

编辑配置文件/path/to/shadowsocks.json:

1
2
3
4
5
6
7
8
{
"server":"my_server_ip",
"server_port":8388,
"local_port":1080,
"password":"barfoo!",
"timeout":600,
"method":"chacha20-ietf-poly1305"
}

开启ss:

1
$ sslocal -c /path/to/shadowsocks.json -d start

假设ss本地端口号为1080,则为git设置socks5代理:

1
2
$ git config --global http.proxy socks5://127.0.0.1:1080
$ git config --global https.proxy socks5://127.0.0.1:1080

配置文件目录为:~/.gitconfig

若想取消代理,则命令如下(或直接修改~/.gitconfig文件):

1
2
$ git config --global --unset http.proxy
$ git config --global --unset https.proxy

设置环境

vim ~/.bashrc or ~/.bash_profile

1
2
3
export MAHOUT_HOME=/path/to/mahout
export MAHOUT_LOCAL=true # for running standalone on your dev machine,
# unset MAHOUT_LOCAL for running on a cluster

MAHOUT_LOCAL=true时表示在本地服务器运行,不设置MAHOUT_LOCAL时,表示在hadoop集群上运行。

编译

运行任何使用Mahout的应用程序都需要安装二进制或源代码版本并设置环境。
从源代码编译:

1
$ mvn -DskipTests clean install

注意:jdk9编译源码时,因tools.jar已被删除,所以会造成编译失败。详情请见:Removed rt.jar and tools.jar
解决方法如下:

  • 使用低版本的jdk版本。
  • 或创建tools.jar软链接ln -s $JAVA_HOME/lib/jrt-fs.jar $JAVA_HOME/lib/tools.jar

安装Hadoop

安装hadoop

1
export HADOOP_HOME=/path/to/hadoop

Mahout实现协同过滤(命令行)

1
mahout recommenditembased --input score-item-based3.csv --output ./result --tempDir ./temp --numRecommendations 300 -s SIMILARITY_PEARSON_CORRELATION

similarityClassname包括:

  • SIMILARITY_COOCCURRENCE => 基于同现相似度
  • SIMILARITY_LOGLIKELIHOOD => 基于对数似然比的相似度
  • SIMILARITY_TANIMOTO_COEFFICIENT => 基于谷本系数计算相似度
  • SIMILARITY_CITY_BLOCK => 基于Manhattan距离相似度
  • SIMILARITY_COSINE => 计算 Cosine 相似度
  • SIMILARITY_PEARSON_CORRELATION => 基于皮尔逊相关系数计算相似度
  • SIMILARITY_EUCLIDEAN_DISTANCE => 基于欧几里德距离计算相似度

Mahout实现协同过滤(代码)

item.csv:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1,101,4.0
1,102,5.0
1,103,3.5
2,101,3.0
2,102,3.5
2,103,4.0
2,104,3.0
3,101,1.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,4.0
4,103,3.0
4,104,4.5
4,106,3.0
5,101,4.0
5,102,3.0
5,103,5.0
5,104,4.0
5,105,2.5
5,106,4.0

UserCF:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
package org.conan.mymahout.recommendation;

import java.io.File;
import java.io.IOException;
import java.util.List;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.EuclideanDistanceSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;

public class UserCF {

final static int NEIGHBORHOOD_NUM = 2;
final static int RECOMMENDER_NUM = 3;

public static void main(String[] args) throws IOException, TasteException {
String file = "datafile/item.csv";
DataModel model = new FileDataModel(new File(file));
UserSimilarity user = new EuclideanDistanceSimilarity(model);
NearestNUserNeighborhood neighbor = new NearestNUserNeighborhood(NEIGHBORHOOD_NUM, user, model);
Recommender r = new GenericUserBasedRecommender(model, neighbor, user);
LongPrimitiveIterator iter = model.getUserIDs();

while (iter.hasNext()) {
long uid = iter.nextLong();
List<RecommendedItem> list = r.recommend(uid, RECOMMENDER_NUM);
System.out.printf("uid:%s", uid);
for (RecommendedItem ritem : list) {
System.out.printf("(%s,%f)", ritem.getItemID(), ritem.getValue());
}
System.out.println();
}
}
}

ItemCF:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
package org.conan.mymahout.recommendation;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.ItemSimilarity;

import java.io.File;
import java.io.IOException;
import java.util.List;

public class ItemCF {
final static int RECOMMENDER_NUM = 50;

public static void main(String[] args) throws IOException, TasteException {
String file = "datafile/item.csv";
DataModel model = new FileDataModel(new File(file));
ItemSimilarity similarity = new PearsonCorrelationSimilarity(model);
Recommender recommender = new GenericItemBasedRecommender(model, similarity);

LongPrimitiveIterator iter = model.getUserIDs();
while (iter.hasNext()) {
long uid = iter.nextLong();
List<RecommendedItem> list = recommender.recommend(uid, RECOMMENDER_NUM);
System.out.printf("uid:%s", uid);
for (RecommendedItem ritem : list) {
System.out.printf("(%s,%f)", ritem.getItemID(), ritem.getValue());
}
System.out.println();
}
}
}

参考链接