Andante: 2017

2017年9月16日土曜日

OpenNMTで英日翻訳できるようにする

OpenNMTをインストールして機械翻訳（英日）を使えるようにする

環境

Python 3.5.4
Linux

手順

$ git clone https://github.com/OpenNMT/OpenNMT-py.git

$ pip3 install -r requirements.txt

実行しようとすると、torchがないと怒られる…

$ python translate.py -model demo-model_epochX_PPL.pt -src data/src-test.txt -output pred.txt -replace_unk -verbose

Traceback (most recent call last):

  File "translate.py", line 7, in <module>

    import torch

ImportError: No module named torch

torchのインストール

pyTorchを見てみる

http://pytorch.org

pytthon 3.6.*のときとかは、cp36とかに勝手に変更すると入る

$ sudo pip3 install http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp35-cp35m-manylinux1_x86_64.whl

$ sudo pip3 install torchvision

プログラムの実行

以下のコマンドで実行できるが、これは英独の翻訳

$ python3 preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo

英日にしたいので、新しくディレクトリを作成し、その下に英語と日本語のパラレルコーパスを置いた。

行数は英独のものと同様にしたが、データ量としては少ないことに留意する。

*-train.txtが10000行、*-val.txtが3000行。

日本語のコーパスはmecabを用いて、単語をspace区切りに加工している。

モデルの学習

以下のコマンドでデフォルト設定で動く

$ python3 train.py -data data/demo -save_model demo-model

学習するのに時間がかかるので、nohupで動かすのが良い

$ nohup python3 -u train.py -data data/demo -save_model demo-model > log.txt &

翻訳の実行

モデルのファイルは１番最後に出力されているのを指定した。

src-test.txtは自分で作っても良いかも

$ python3 translate.py -model demo-model_acc_26.58_ppl_194.39_e13.pt -src data/src-test.txt -output pred.txt -replace_unk -verbose

ex. ）
SENT 1: please, don't forget me.
PRED 1: 「それはどうしたんですか？」
PRED SCORE: -16.1014

ひどさしかない。
gitにも小さいデータでやると恐ろしい結果になると書いてあるので、次は大きいコーパスで試してみる。

なんか見てる人が結構いるっぽいので続きの記事
こちらは大きいコーパスで試してみた

2017年8月4日金曜日

Swagger+nodejsの導入

APIを作って管理するさい、ドキュメントを作るのが面倒
そんな場合、Swaggerを使うと楽できる

パッケージのインストール

$ mkdir swagger_sample

$ cd swagger_sample

$ npm init  #適当に情報を入れてpackage.jsonを作成

$ npm install express --save

$ npm install swagger-express --save-dev

Swagger-ui

$ cd ../    #先ほど作ったディレクトリから出る

$ git clone https://github.com/swagger-api/swagger-ui.git

$ cd swagger-ui

$ mv dist ../swagger_sample

You can use the swagger-ui code AS-IS! No need to build or recompile--just clone this repo and use the pre-built files in the dist folder. If you like swagger-ui as-is, stop here.

とあるので、distを利用する

Code

app.js

var express = require('express');

var app = express();

var swagger = require('swagger-express');

app.use(swagger.init(app, {

  apiVersion: '1.0',

  swaggerVersion: '1.0',

  swaggerURL: '/docs',           // swaggerページのパス

  swaggerJSON: '/api-docs',      // swagger表示用のデータアクセス先 

  swaggerUI: './dist',           // swagger-uiが置いてあるパス

  basePath: 'http://localhost:3000',

  apis: ['./api.js'],            // ドキュメントが記載されているファイル

  middleware: function(req, res){}

}));

app.listen(3000);

urlの書き換え

dist/index.htmlのL78を上記で指定したパス「http://localhost:3000/api-docs」に書き換え

2017年7月26日水曜日

【準備】
// hts_engine_APIのダウンロードと解凍
$ wget http://downloads.sourceforge.net/hts-engine/hts_engine_API-1.10.tar.gz
$ tar -zxvf hts_engine_API-1.10.tar.gz

// インストール
$ ./configure
$ make
$ sudo make install

// Sinsyのダウンロードと解凍
$ wget http://downloads.sourceforge.net/sinsy/sinsy-0.92.tar.gz
$ tar -zxvf sinsy-0.92.tar.gz

// インストール（pathは適宜変更すること）
$ ./configure \
--with-hts-engine-header-path=/usr/local/include \
--with-hts-engine-library-path=/usr/local/lib
$ make
$ sudo make install

【利用】
// xmlのダウンロード
$ wget http://sinsy.sp.nitech.ac.jp/sample/song070_f001_063.xml

// htsvoiceの用意
SinsyのページからHTS voiceのバイナリをダウンロード

// 実行
$ sinsy -x /usr/local/dic/ -m nitech_jp_song070_f001.htsvoice -o out.wav song070_f001_063.xml

htsvoiceファイルは、歌声の学習済みのものでないと、とんでもなくおかしなことになる

2017年6月20日火曜日

深層学習と自然言語処理

ディープラーニングを用いた機械翻訳の勉強のため、参考になりそうな記事

2017年5月20日土曜日

Operation not permitted

pythonでauthのmoduleをインストールしようとしたら以下のエラーになった。
error: could not create '/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7/greenlet': Operation not permitted

mac OS X 10.11以上になってから、いろいろ変わった模様
system integrity protection

$ sudo pip install --user auth

で解決した

2017年4月29日土曜日

input-buffer overflow. The line is split. use -b #SIZE option.

jawikiなどの大きなファイルをわかち書きしようとすると、以下のようなエラーが出る
$ input-buffer overflow. The line is split. use -b #SIZE option.

-b オプションを付けることで解決できる
$ mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/ -Owakati -b 781361152 jawiki.txt > jawiki_wakachi.txt

指定するサイズはwcコマンドで確認
$ wc jawiki.txt
2351310 3832483 781361152 jawiki.txt

2017年1月26日木曜日

日本語wikiコーパスの作成

wikiコーパスのダウンロード
$ curl https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 -o jawiki-latest-pages-articles.xml.bz2

xml形式なので、テキストを抜き出す。
そういったプログラムを用意してくださっているので利用する。
$ git clone https://github.com/attardi/wikiextractor

実行
$ python wikiextractor/WikiExtractor.py jawiki-latest-pages-articles.xml.bz2

抽出された内容はフォルダに分けられるので、catで1つのファイルにまとめる
$ cat text/*/* > jawiki_org.txt

内容に<documetn ...>とあったり、空白行があったりするのでトリミング
$ cat ./jawiki_org.txt | sed -e 's/<.*>//g' | sed -e '/^ *$/d' > ./jawiki.txt

必要あればmecabで分かち書き
$ mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/ -Owakati jawiki.txt > jawiki_wakachi.txt

以上の操作で完了

2017年1月22日日曜日

ImportError: No module named queue

pythonでword2vecを使おうとしたときのエラー

$ python word2vec.py 

Traceback (most recent call last):

  File "word2vec.py", line 3, in <module>

    from gensim.models import word2vec

  File "/Library/Python/2.7/site-packages/gensim/__init__.py", line 6, in <module>

    from gensim import parsing, matutils, interfaces, corpora, models, similarities, summarization

  File "/Library/Python/2.7/site-packages/gensim/summarization/__init__.py", line 4, in <module>

    from .keywords import keywords

  File "/Library/Python/2.7/site-packages/gensim/summarization/keywords.py", line 13, in <module>

    from six.moves.queue import Queue as _Queue

ImportError: No module named queue

これの原因としては、sixまわりのファイルが以下の2つに重複して入ってたことによる

1. /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/

2. /Library/Python/2.7/site-packages

そもそもの原因は、バージョンが低かったこと。
pipでアップグレードすると、2. のパスが更新される。
でも、実際にプログラムを実行すると1. を見に行く。
なので、バージョンの低いファイルを読み込んでしまい、エラーになっていた。

仕方ないので、1. にあったsix.pyとsix.pycを削除したら動くようになった。
本当なら削除するよりも、パスを変更する等の方がいいのだけれど…

ValueError: numpy.dtype has the wrong size

pythonでword2vecを使おうとしたときに出たエラー

$ python word2vec.py 

Traceback (most recent call last):

  File "word2vec.py", line 3, in <module>

    from gensim.models import word2vec

  File "/Library/Python/2.7/site-packages/gensim/__init__.py", line 6, in <module>

    from gensim import parsing, matutils, interfaces, corpora, models, similarities, summarization

  File "/Library/Python/2.7/site-packages/gensim/models/__init__.py", line 14, in <module>

    from .word2vec import Word2Vec

  File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 108, in <module>

    from gensim.models.word2vec_inner import train_batch_sg, train_batch_cbow

  File "__init__.pxd", line 155, in init gensim.models.word2vec_inner (./gensim/models/word2vec_inner.c:10913)

ValueError: numpy.dtype has the wrong size, try recompiling. Expected 88, got 96

pipでnumpyをインストールしているとダメらしい。
easy_installで入れ直すと動く

$ pip uninstall numpy

$ easy_install numpy

mecab-ipadic-neologdのインストール

みなさんご存知の形態素解析ソフトmecab。
mecab-ipadic-neologdを使えばweb上から得た新しい単語の解析も可能になる。
mecabはインストールされている前提で、辞書からの登録。

$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git

$ cd mecab-ipadic-neologd/

$ ./bin/install-mecab-ipadic-neologd

[install-mecab-ipadic-NEologd] : Start..

[install-mecab-ipadic-NEologd] : Check the existance of libraries

[install-mecab-ipadic-NEologd] :     find => ok

[install-mecab-ipadic-NEologd] :     sort => ok

[install-mecab-ipadic-NEologd] :     head => ok

[install-mecab-ipadic-NEologd] :     cut => ok

[install-mecab-ipadic-NEologd] :     egrep => ok

[install-mecab-ipadic-NEologd] :     mecab => ok

[install-mecab-ipadic-NEologd] :     mecab-config => ok

[install-mecab-ipadic-NEologd] :     make => ok

[install-mecab-ipadic-NEologd] :     curl => ok

[install-mecab-ipadic-NEologd] :     sed => ok

[install-mecab-ipadic-NEologd] :     cat => ok

[install-mecab-ipadic-NEologd] :     diff => ok

[install-mecab-ipadic-NEologd] :     tar => ok

[install-mecab-ipadic-NEologd] :     unxz is not found.

not foundの場合はxzをインストールする。

$ brew install xz

$ ./bin/install-mecab-ipadic-neologd

きちんと動作すると途中でyes or noを聞かれる

[install-mecab-ipadic-NEologd] : Do you want to install mecab-ipadic-NEologd? Type yes or no.

yes

yesと答えた後、パスワードを入力

default system dictionary   | mecab-ipadic-NEologd

エン ケン      | エンケン 

マツコ 会議      | マツコ会議 

ど根性 ガエル の 娘     | ど根性ガエル の 娘 

スマ ステ      | スマステ 

山下 美月      | 山下美月 

蒼 井 翔太      | 蒼井翔太 

ワッキー 貝山      | ワッキー貝山 

亜 人 ちゃん      | 亜人 ちゃん 

男 水       | 男水 

ハミルトン 島      | ハミルトン島 

島田 八 段      | 島田 八段 

天守 物語      | 天守物語 

ひだ まり スケッチ     | ひだまりスケッチ 

一例として、上記の内容が出力される。
インストール完了

実際に使う時は以下のコマンドをパスつきで実行
分かち書きオプションもばっちり

$ mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/ -Owakati

登録: 投稿 (Atom)