怎么去掉stanford分型怎么读6 parser的POS,改用自己的POS

Algorithm, NLP/IR, Data Mining, Machine Learning, Math.
基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树
最近做一个项目需要对给定的文本中的句子做Parse,根据POS tag及句子成分信息找出词语/短语之间的dependency,然后根据dependency构建句子的parse tree. 需要用到Stanford Parser和OpenNLP 中的Shallow Parser,这两个Parser都用JAVA实现,提供API方式调用,可以根据句子输出语法解析树。下面总结两类Parser的作用及JAVA程序调用方法。
1 Shallow Parser
Shallow Parser主要作用是找出句子中的短语信息,包括名词短语NP,动词短语VP,形容词短语ADJP,副词短语ADVP等等,示例程序如下
/**a Shallow Parser based on opennlp
* @author yangliu
* @blog http://blog.csdn.net/yangliuy
* @mail yang.liu@pku.edu.cn
public class ShallowParser {
private static ShallowParser instance =
private static POSM
private static ChunkerModel cM
//Singleton pattern
public static ShallowParser getInstance() throws InvalidFormatException, IOException{
if(ShallowParser.instance == null){
POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
InputStream is = new FileInputStream("en-chunker.bin");
ChunkerModel cModel = new ChunkerModel(is);
ShallowParser.instance = new ShallowParser(model, cModel);
return ShallowParser.
public ShallowParser(POSModel model, ChunkerModel cModel){
ShallowParser.model =
ShallowParser.cModel = cM
/** A shallow Parser, chunk a sentence and return a map for the phrase
labels of words &wordsIndex, phraseLabel&
Notice: There should be " " BEFORE and after ",", " ","(",")" etc.
* @param input The input sentence
* @param model The POSModel of the chunk
* @param cModel The ChunkerModel of the chunk
public HashMap&Integer,String& chunk(String input) throws IOException {
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
ObjectStream&String& lineStream = new PlainTextByLineStream(
new StringReader(input));
String whitespaceTokenizerLine[] =
String[] tags =
while ((line = lineStream.read()) != null) {
whitespaceTokenizerLine = WhitespaceTokenizer.INSTANCE
tags = tagger.tag(whitespaceTokenizerLine);
POSSample posTags = new POSSample(whitespaceTokenizerLine, tags);
// chunker
ChunkerME chunkerME = new ChunkerME(cModel);
String result[] = chunkerME.chunk(whitespaceTokenizerLine, tags);
HashMap&Integer,String& phraseLablesMap = new HashMap&Integer, String&();
Integer wordCount = 1;
Integer phLableCount = 0;
for (String phLable : result){
if(phLable.equals("O")) phLable += "-Punctuation"; //The phLable of the last word is OP
if(phLable.split("-")[0].equals("B")) phLableCount++;
phLable = phLable.split("-")[1] + phLableC
//if(phLable.equals("ADJP")) phLable = "NP"; //Notice: ADJP included in NP
//if(phLable.equals("ADVP")) phLable = "VP"; //Notice: ADVP included in VP
System.out.println(wordCount + ":" + phLable);
phraseLablesMap.put(wordCount, phLable);
//Span[] span = chunkerME.chunkAsSpans(whitespaceTokenizerLine, tags);
//for (Span phLable : span)
return phraseLablesM
/** Just for testing
* @param tdl Typed Dependency List
* @return WDTreeNode root of WDTree
public static void main(String[] args) throws IOException {
//Notice: There should be " " BEFORE and after ",", " ","(",")" etc.
String input = "We really enjoyed using the Canon PowerShot SD500 .";
//String input = "Bell , based in Los Angeles , makes and distributes electronic , computer and building products .";
ShallowParser swParser = ShallowParser.getInstance();
注意要配置好POS Model及Chunker Model的路径,这两个Model的数据文件都可以从OpenNLP的官网下载。
Loading POS Tagger model ... done (1.563s)
Average: 9.3 sent/s
Total: 1 sent
Runtime: 0.107s
We_PRP really_RB enjoyed_VBD using_VBG the_DT Canon_NNP PowerShot_NNP SD500_NNP ._.
从结果中可以看出,Shallow Parser首先输出了POS tag信息,然后从句子中找出了两个名词短语NP1和NP4,一个动词短语VP3和一个副词短语ADVP2
2 Stanford Parser
Stanford Parser可以找出句子中词语之间的dependency关联信息,并且以Stanford Dependency格式输出,包括有向图及树等形式。示例代码如下
package edu.pku.yangliu.nlp.
import java.io.IOE
import java.io.StringR
import java.util.HashM
import java.util.L
import opennlp.tools.util.InvalidFormatE
import edu.stanford.nlp.ling.CoreL
import edu.stanford.nlp.ling.HasW
import edu.stanford.nlp.objectbank.TokenizerF
import edu.stanford.nlp.parser.lexparser.LexicalizedP
import edu.stanford.nlp.process.CoreLabelTokenF
import edu.stanford.nlp.process.DocumentP
import edu.stanford.nlp.process.PTBT
import edu.stanford.nlp.trees.GrammaticalS
import edu.stanford.nlp.trees.GrammaticalStructureF
import edu.stanford.nlp.trees.PennTreebankLanguageP
import edu.stanford.nlp.trees.T
import edu.stanford.nlp.trees.TreebankLanguageP
import edu.stanford.nlp.trees.TypedD
/**Phrase sentences based on stanford parser
* @author yangliu
* @blog http://blog.csdn.net/yangliuy
* @mail yang.liu@pku.edu.cn
public class StanfordParser {
private static StanfordParser instance =
private static LexicalizedP
//Singleton pattern
public static StanfordParser getInstance(){
if(StanfordParser.instance == null){
LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz","-retainTmpSubcategories");
StanfordParser.instance = new StanfordParser(lp);
return StanfordParser.
public StanfordParser(LexicalizedParser lp){
StanfordParser.lp =
/**Parse sentences in a file
* @param SentFilename The input file
public void DPFromFile(String SentFilename) {
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
for (List&HasWord& sentence : new DocumentPreprocessor(SentFilename)) {
Tree parse = lp.apply(sentence);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
List&TypedDependency& tdl = (List&TypedDependency&)gs.typedDependenciesCollapsedTree();
/**Parse sentences from a String
* @param sent The input sentence
List&TypedDependency& The list for type dependency
public List&TypedDependency& DPFromString(String sent) {
TokenizerFactory&CoreLabel& tokenizerFactory =
PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
List&CoreLabel& rawWords =
tokenizerFactory.getTokenizer(new StringReader(sent)).tokenize();
Tree parse = lp.apply(rawWords);
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
//Choose the type of dependenciesCollapseTree
//so that dependencies which do not
//preserve the tree structure are omitted
return (List&TypedDependency&) gs.typedDependenciesCollapsedTree();
/**Just for testing
* @param args
* @throws IOException
* @throws InvalidFormatException
public static void main(String[] args) throws InvalidFormatException, IOException {
// TODO Auto-generated method stub
//Notice: There should be " " BEFORE and after ",", " ","(",")" etc.
String sent = "We really enjoyed using the Canon PowerShot SD500 .";
//String sent = "Bell , based in Los Angeles , makes and distributes electronic , computer and building products .";
//String sent = "It has an exterior design that combines form and function more elegantly than any point-and-shoot we've ever tested . ";
//String sent = "A Digic II-powered image-processing system enables the SD500 to snap a limitless stream of 7-megapixel photos at a respectable clip , its start-up time is tops in its class , and it delivers decent photos when compared to its competition . ";
//String sent = "I've had it for about a month and it is simply the best point-and-shoot your money can buy . ";
StanfordParser sdPaser = StanfordParser.getInstance();
List&TypedDependency& tdl = sdPaser.DPFromString(sent);
for(TypedDependency oneTdl : tdl){
ShallowParser swParser = ShallowParser.getInstance();
HashMap&Integer,String& phraseLablesMap = new HashMap&Integer, String&();
phraseLablesMap = swParser.chunk(sent);
WDTree wdtree = new WDTree();
WDTreeNode root = wdtree.bulidWDTreeFromList(tdl, phraseLablesMap);
输出的词语之间的dependency关联,POS tag信息及句子语法解析树如下
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [2.1 sec].
nsubj(enjoyed-3, We-1)
advmod(enjoyed-3, really-2)
root(ROOT-0, enjoyed-3)
xcomp(enjoyed-3, using-4)
det(SD500-8, the-5)
nn(SD500-8, Canon-6)
nn(SD500-8, PowerShot-7)
dobj(using-4, SD500-8)
Loading POS Tagger model ... done (1.492s)
We_PRP really_RB enjoyed_VBD using_VBG the_DT Canon_NNP PowerShot_NNP SD500_NNP ._.
Average: 200.0 sent/s
Total: 1 sent
Runtime: 0.0050s
children of ROOT-0_ (phLable:null):
rel:root phLable:VP3
children of enjoyed-3_ (phLable:VP3):
rel:nsubj phLable:NP1
rel:advmod phLable:ADVP2
rel:xcomp phLable:VP3
children of using-4_ (phLable:VP3):
rel:dobj phLable:NP4
children of SD500-8_ (phLable:NP4):
rel:det phLable:NP4
rel:nn phLable:NP4
rel:nn phLable:NP4
Stanford Dependency-Parser 分享
NLP之Stanford Parser using NLTK
没有更多推荐了,python 怎么访问stanford parser的解析树_百度知道
python 怎么访问stanford parser的解析树
我们会通过消息、邮箱等方式尽快将举报结果通知您。Stanford Parser 进行词法语法分析的详细使用
日期: 10:04:07
Stanford Parser 进行词法语法分析的详细使用
2、在eclipse中新建一个java project,把解压得到根目录下的stanford-parser.jar和stanford-parser-3.*.*-models.jar两个包导入项目到项目引用包中,
2.如果要String[] sent从文本输入:
eclipse & run & run configuration & arguments & program arguments:
输入: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gzC:\Users\minglan\Desktop\test2.txt
The screen is really big, but the price
is too expensive!
The price is expensive, students don't buy it usually.
The screen is beautiful, but the price is not!
The screen is big and beautiful!
String[] sent = {
"这", "是", "第一个", "测试", "句子", "。" };
Stringgrammar = args.length & 0 ? args[0] : "edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz";
TreebankLanguagePacktlp = new ChineseTreebankLanguagePack();
String[] options = {"-maxLength", "80"};
二、Stanford Parser自带图形化操作界面
点击“Load File”导入需要解析文件也可以直接在上面大的输入框中输入要解析内容,
点击“Load Parser”载入模型文件,稍等片刻(载入模型文件可能需要几秒钟)进度条完成载入后“Parser”按钮变成可用状态,点击即可解上输入框中高亮的内容,解析得到的树形结果在下框中显示,
三、Stanford Parser还提供了命令行的方式lexparser-gui.bat(win)和lexparser.sh(linux)具体使用见官方文档: FAQ
四、Stanford Parser有个在线的解释效果示例在:http://nlp.stanford.edu:8080/parser/index.jsp
Stanford parser句法树分析时候占用内存可能较大,所以要调整eclipse虚拟内存空间,方法是在“运行——运行——自变量——VM自变量中填上-Xms256M -Xmx800M”,大小就要看实际情况和机子性能。
当句子较长时会出现报“FactoredParser: exceeded MAX_ITEMS work limit [200000 items]; aborting.”错误...
String[] options = { "-maxLength", "140", "-MAX_ITEMS","500000"};
lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/", options);
CC: conjunction, coordinatin 表示连词
CD: numeral, cardinal 表示基数词
DT: determiner 表示限定词
EX: existential there 存在句
FW: foreign word 外来词
IN: preposition or conjunction, subordinating 介词或从属连词
JJ: adjective or numeral, ordinal 形容词或序数词
JJR: adjective, comparative 形容词比较级
JJS: adjective, superlative 形容词最高级
LS: list item marker 列表标识
MD: modal auxiliary 情态助动词
NN: noun, common, singular or mass
NNS: noun, common, plural
NNP: noun, proper, singular
NNPS: noun, proper, plural
PDT: pre-determiner 前位限定词
POS: genitive marker 所有格标记
PRP: pronoun, personal 人称代词
PRP$: pronoun, possessive 所有格代词
RB: adverb 副词
RBR: adverb, comparative 副词比较级
RBS: adverb, superlative 副词最高级
RP: particle 小品词
SYM: symbol 符号
TO:"to" as preposition or infinitive marker 作为介词或不定式标记
UH: interjection 插入语
VB: verb, base form
VBD: verb, past tense
VBG: verb, present participle or gerund
VBN: verb, past participle
VBP: verb, present tense, not 3rd person singular
VBZ: verb, present tense,3rd person singular
WDT: WH-determiner WH限定词
WP: WH-pronoun WH代词
WP$: WH-pronoun, possessive WH所有格代词
WRB:Wh-adverb WH副词
ROOT:要处理文本的语句IP:简单从句NP:名词短语VP:动词短语PU:断句符,通常是句号、问号、感叹号等标点符号LCP:方位词短语PP:介词短语CP:由‘的’构成的表示修饰性关系的短语DNP:由‘的’构成的表示所属关系的短语ADVP:副词短语ADJP:形容词短语DP:限定词短语QP:量词短语NN:常用名词NR:固有名词NT:时间名词PN:代词VV:动词VC:是CC:表示连词VE:有VA:表语形容词AS:内容标记(如:了)VRD:动补复合词CD: 表示基数词DT: determiner 表示限定词EX: existential there 存在句FW: foreign word 外来词IN: preposition or conjunction, subordinating 介词或从属连词JJ: adjective or numeral, ordinal 形容词或序数词JJR: adjective, comparative 形容词比较级JJS: adjective, superlative 形容词最高级LS: list item marker 列表标识MD: modal auxiliary 情态助动词PDT: pre-determiner 前位限定词POS: genitive marker 所有格标记PRP: pronoun, personal 人称代词RB: adverb 副词RBR: adverb, comparative 副词比较级RBS: adverb, superlative 副词最高级RP: particle 小品词 SYM: symbol 符号TO:”to” as preposition or infinitive marker 作为介词或不定式标记 WDT: WH-determiner WH限定词WP: WH-pronoun WH代词WP$: WH-pronoun, possessive WH所有格代词WRB:Wh-adverb WH副词 关系表示abbrev: abbreviation modifier,缩写acomp: adjectival complement,形容词的补充;advcl : adverbial clause modifier,状语从句修饰词advmod: adverbial modifier状语agent: agent,代理,一般有by的时候会出现这个amod: adjectival modifier形容词appos: appositional modifier,同位词attr: attributive,属性aux: auxiliary,非主要动词和助词,如BE,HAVE SHOULD/COULD等到auxpass: passive auxiliary 被动词cc: coordination,并列关系,一般取第一个词ccomp: clausal complement从句补充complm: complementizer,引导从句的词好重聚中的主要动词conj : conjunct,连接两个并列的词。cop: copula。系动词(如be,seem,appear等),(命题主词与谓词间的)连系csubj : clausal subject,从主关系csubjpass: clausal passive subject 主从被动关系dep: dependent依赖关系det: determiner决定词,如冠词等dobj : direct object直接宾语expl: expletive,主要是抓取thereinfmod: infinitival modifier,动词不定式iobj : indirect object,非直接宾语,也就是所以的间接宾语;mark: marker,主要出现在有“that” or “whether”“because”, “when”,mwe: multi-word expression,多个词的表示neg: negation modifier否定词nn: noun compound modifier名词组合形式npadvmod: noun phrase as adverbial modifier名词作状语nsubj : nominal subject,名词主语nsubjpass: passive nominal subject,被动的名词主语num: numeric modifier,数值修饰number: element of compound number,组合数字parataxis: parataxis: parataxis,并列关系partmod: participial modifier动词形式的修饰pcomp: prepositional complement,介词补充pobj : object of a preposition,介词的宾语poss: possession modifier,所有形式,所有格,所属possessive: possessive modifier,这个表示所有者和那个’S的关系preconj : preconjunct,常常是出现在 “either”, “both”, “neither”的情况下predet: predeterminer,前缀决定,常常是表示所有prep: prepositional modifierprepc: prepositional clausal modifier
