K - 文档的id类型public class ClusterAnalyzer<K> extends Object
| Modifier and Type | Field and Description |
|---|---|
protected HashMap<K,Document<K>> |
documents_ |
protected Segment |
segment |
protected MutableDoubleArrayTrieInteger |
vocabulary |
| Constructor and Description |
|---|
ClusterAnalyzer() |
| Modifier and Type | Method and Description |
|---|---|
Document<K> |
addDocument(K id,
List<String> document)
添加文档
|
Document<K> |
addDocument(K id,
String document)
添加文档
|
static double |
evaluate(String folderPath,
String algorithm)
训练模型
|
protected int |
id(String word) |
List<Set<K>> |
kmeans(int nclusters)
k-means聚类
|
protected List<String> |
preprocess(String document)
重载此方法实现自己的预处理逻辑(预处理、分词、去除停用词)
|
List<Set<K>> |
repeatedBisection(double limit_eval)
repeated bisection 聚类
|
List<Set<K>> |
repeatedBisection(int nclusters)
repeated bisection 聚类
|
List<Set<K>> |
repeatedBisection(int nclusters,
double limit_eval)
repeated bisection 聚类
|
int |
size()
已向聚类分析器添加的文档数量
|
protected SparseVector |
toVector(List<String> wordList) |
protected Segment segment
protected MutableDoubleArrayTrieInteger vocabulary
protected int id(String word)
protected List<String> preprocess(String document)
document - 文档protected SparseVector toVector(List<String> wordList)
public Document<K> addDocument(K id, String document)
id - 文档iddocument - 文档内容public Document<K> addDocument(K id, List<String> document)
id - 文档iddocument - 文档内容public List<Set<K>> kmeans(int nclusters)
nclusters - 簇的数量public int size()
public List<Set<K>> repeatedBisection(int nclusters)
nclusters - 簇的数量public List<Set<K>> repeatedBisection(double limit_eval)
limit_eval - 准则函数增幅阈值public List<Set<K>> repeatedBisection(int nclusters, double limit_eval)
nclusters - 簇的数量limit_eval - 准则函数增幅阈值public static double evaluate(String folderPath, String algorithm)
folderPath - 分类语料的根目录.目录必须满足如下结构:algorithm - kmeans 或 repeated bisectionIOException - 任何可能的IO异常Copyright © 2014–2021 码农场. All rights reserved.