Spark使用HanLP分词

# Spark使用HanLP分词

将HanLP的data(包含词典和模型)放到hdfs (opens new window)上，然后在项目配置文件hanlp.properties中配置root的路径，比如：root=hdfs://localhost:9000/tmp/

实现com.hankcs.hanlp.corpus.io.IIOAdapter接口

    public static class HadoopFileIoAdapter implements IIOAdapter {

        @Override
        public InputStream open(String path) throws IOException {
            Configuration conf = new Configuration();
            FileSystem fs = FileSystem.get(URI.create(path), conf);
            return fs.open(new Path(path));
        }

        @Override
        public OutputStream create(String path) throws IOException {
            Configuration conf = new Configuration();
            FileSystem fs = FileSystem.get(URI.create(path), conf);
            OutputStream out = fs.create(new Path(path));
            return out;
        }
    }

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

设置IoAdapter，创建分词器
```
private static Segment segment;

static {
    HanLP.Config.IOAdapter = new HadoopFileIoAdapter();
    segment = new CRFSegment();
}
```
1
2
3
4
5
6
然后，就可以在Spark的操作中使用segment进行分词了。

原文链接：https://blog.csdn.net/l294265421/article/details/72932042

上次更新: 2023/03/10, 16:49:38

← Spark stage如何划分 Spark RDD分区2G限制→