ѥե for Ver.2.1

ãͺ(tatuo-y@is.aist-nara.ac.jp)
last update 991014

SUFARY(http://cl.aist-nara.ac.jp/lab/nlt/ss/)


 ɽ
 $SUFRAY : SUFARYѥåŸǥ쥯ȥ


 Ϥ

SUFARY ǤϸоݥƥȤФưʲΣθǤޤ

ʸ󸡺

  ɤƥβʸܤˤ뤫Ĵ٤ޤ

ڥƥȥꥢ

  ƥȥꥢȤϡΤʥǰϤޤ줿ʬʸ̣ޤ㤨С
  <atrticle></atrticle>ȤǰϤޤ줿ֵפ䡢<title>
  </title>ȤǰϤޤ줿֥ȥפʤɤޤƥȥ
  ꥢϡɤޤळΥƥȥꥢõޤ

ɤθԤˤƤ⡢SUFARY ǤϡоݥƥȰʳ array
եȤեɬפޤ

ޤƥȥꥢԤˤDocIDեȤե
ɬפޤ

+--------------+--------------------+----------------+
|ѥե|Ū|ѥץ|
+--------------+--------------------+----------------+
|array ե|SUFARYǤθɬ|mkary           |
|DocID ե|ƥȥꥢ|mkdid           |
+--------------+--------------------+----------------+

SUFARY version 2.1b3 ޤǤϡarray ե DocID եϥХȥ
ǡĶ(OS)˰¸Ƥޤversion 2.1 ϥ
ȥϴĶ˰¸˥ӥåǥ(big endian)ˤʤä
ޤ

 array եκ

ɤñ̤Ǹ򤷤Ȥ׵ˤäơ array ե
뤬äƤޤ

㤨Сʸñ̤ array եȡΥƥȤ˴ޤޤ
ʬʸ󤬸Ǥޤsamp1.txt Фʸñ̤ array ե
ȡ"YAMASITA"  "Tatuo" Ϥ "ASITA T"  "st-na" Ȥ
Ⱦü(?)ʸǤ⸫Ĥ뤳ȤǤޤ

------ samp1.txt
YAMASITA Tatuo
tatuo-y@cl.aist-nara.ac.jp
http://cl.aist-nara.ac.jp/~tatuo-y/
------

ñ̤ array եȡƹƬϤޤƤʸ(prefix 
Ȥޤ)Ǥޤsamp1.txt Фƹñ̤ array ե
ȡ"YAMASITA", "YAM", "http" ϸĤޤ"aist"  "Tatuo" 
ϹƬϤޤʸǤϤʤΤǡĤޤ󡣹ñ̤ array ե
 samp2.txt Τ褦ʼθ˸Ƥޤarray եΥ
ʸñ̤ΤȤ⾮ʤޤ

------ samp2.txt
fish 
boy ˤλ
girl λ
------

ºݤ array ե뤿Υץ $SUFARY/mkary/mkary 
Ǥ

¹

------ ʸñ̤ array ե
% mkary /home/tatuo-y/data/ecoli
Save to "/home/tatuo-y/data/ecoli.ary"
Reading text file "/home/tatuo-y/data/ecoli"
++++++++++++++++++++ 1M
++++++++++++++++++++ 2M
++++++++++++++++++++ 3M
++++++++++++++++++++ 4M
++++++++++++
Sorting...
Saving...
Done.
------

------ ñ̤ array ե
% mkary -l samp2.txt
Save to "samp2.txt.ary"
Reading text file "samp2.txt.ary"
 
Sorting...
Saving...
Done.
------

̤ñ̤ɽñʸץ sass
($SUFARY/tools/sass)ȤäƸƤߤޤ
------
% sass girl samp2.txt
19:0:girl λ
% sass boy samp2.txt
8:0:boy ˤλ
------



mkary [-c|w|l] [-#] [-q] [-ns] [-so] [-8] [-J] [-o ARRAY_FILE]
	[-b NUM] TEXT_FILE

TEXT_FILE ˥ƥȥե̾ꤷޤǥեȤǤϡ 
TEXT_FILE.ary Ȥ̾ array ե뤬ޤ

ץ

[-o ARRAY_FILE]

array ե̾ǤޤǥեȤ TEXT_FILE.ary

[-c]

ʸñ̤ array եޤƥΤʸ󤬤
Фޤܸʸ(EUC)2ХȤǰʸȤߤʤޤ
ǥեȤǤ

[-l]

ñ̤ array եޤƹƬϤޤƤʸ(prefix 
Ȥޤ)Ǥޤ񸡺˸Ƥޤ

[-w]

ññ̤ array եޤǤñȤϡʸ
ڡ֤Ƕڤʸؤޤ

[-J]

ʸñ̤ array եȤܸʸ(EUC) '<' ʳ̵
ޤܸ(EUC)ʳǻϤޤʸ '<' ǻϤޤ륿ơ
Ǥʤʤޤarray եΥϾʤޤ

[-q]

¹˥åɽޤ

[-ns]

No Sort: ȽԤäƤʤ(ˤϻȤޤ)Ǥ array ե
ޤ

[-so]

Sort Only: ¸ array եФƥȽԤ array 
ե(˻Ȥޤ)ޤ

[-#]

"#" ǻϤޤԤ̵뤷ޤ[-l]ץΤȤΤͭ

[-8]

ܸʸ(EUC)Ԥʤ[-c]ץΤȤΤͭ

[-b NUM]

ƥȤʬ䤷ƥȤԤǸ˥ޡ롣
NUM ʬꡣ­ΤȤˤɬܡ

㡧mkary -b 10 sample.txt


 DocID եκ

samp3.txt Τ褦ʣε(<ARTICLE></ARTICLE>ǰϤޤƤ
ƥȥꥢ)ޤޤ뵭Ȥޤʸؼ
ޤ൭򸫤ĤƼФפȤԤȤͤޤ

------ samp3.txt
<ARTICLE>
ǥƥ䥡٤ϣüǳȯ줿ά
ե꡼եȤȤƸάˡ
</ARTICLE>
<ARTICLE>
Ƭμƥ೫ȯؤβˤꡢưʤ
¤졦άˡԤμФʤ줬άˡ
ɲ桹ʹ֤ˤ鲿ؤǤʤȤȤ´롣
</ARTICLE>
------

ּפȤʸ󤬤ɤˤ뤫ϡarray ե뤬и
Ĥ뤳ȤǤޤ줬ɤε˴ޤޤƤ뤫ȤȤϡ
array եǤϴñˤʬޤ󡣤ǡSUFARY Ǥ DocID 
եȤեѤޤDocID եˤϵγϥ
λ(ξ<ARTICLE></ARTICLE>)ΰ֥ǡǼƤơ
ˤƥȥꥢΨŪ˹ԤޤޤܤȤϡ
SUFARYۡڡ(http://cl.aist-nara.ac.jp/lab/nlt/ss/)ˤ뵭
ФϢɥȤ滲Ȳ

¹

ǤϤä DocID եäƤߤޤ礦
DocID եץ $SUFARY/mkdid/mkdid Ǥ

ޤʤˤϤȤ⤢졢array ե뤬ɬפʤΤǡmkary Ǻޤ
------
% mkary samp3.txt
Save to "samp3.txt.ary"
Reading text file "samp3.txt"

Sorting...
Saving...
Done.
------

оݥƥȥꥢ()ɽꤷơDocIDե
ޤǥեȤǤ DocID ե samp3.txt.did Ȥ̾ˤʤ

------
% mkdid '<ARTICLE>' '</ARTICLE>' samp3.txt
Number of Documents = 2
sorting...
writting...
done.
------

ƥȥꥢԤñʥץ af ($SUFARY/tools/af)Ȥä
Ƥߤޤ
------
% af '' samp3.txt samp3.txt.did
FOUND 1
<ARTICLE>
Ƭμƥ೫ȯؤβˤꡢưʤ
¤졦άˡԤμФʤ줬άˡ
ɲ桹ʹ֤ˤ鲿ؤǤʤȤȤ´롣
</ARTICLE>
------

桢ƥȥꥢλϤޤȽ꤬ƱȤΤ¿
Τ¤Ǥ㤨аʲΤ褦ʥեޥåȡ
------ samp4.txt
#ID-001
ǥƥ䥡٤ϣüǳȯ줿ά
ե꡼եȤȤƸάˡ
#ID-002
Ƭμƥ೫ȯؤβˤꡢưʤ
¤졦άˡɲ桹ʹ֤ˤ鲿ؤǤʤȤ
Ȥ´롣
#ID-003
΢ΤΤ餻壳άˡդä
滲ò
------

ʤȤϥĤꤹУϣˤǤ
------
% mkdid '#ID-' samp4.txt
Number of Documents = 3
sorting...
writting...
done.
------



mkdid [-q] [-o DOCID_FILE] START_TAG [END_TAG] TEXT_FILE 

TEXT_FILE ˥ƥȥե̾ꤷޤTEXT_FILE.ary Ȥ̾ 
array եɬפˤʤޤSTART_TAG, END_TAG ǡƥȥꥢ
ϤॿꤷޤEND_TAG ϾάǤޤ

ץ

[-o DOCID_FILE]

DocID ե̾ǤޤǥեȤ TEXT_FILE.did

[-q]

¹˥åɽޤ

