昨年、以下のエントリで、
SPARQLのクエリで取得したデータを取り込むプラグインを書きました。
今回は、逆向きのデータの流れ、
手元のデータから、SPARQLのエンドポイント用のデータを出力するプラグインを書きました。

SPARQLで取得したデータを入力とするEmbulkプラグイン(embulk-input-sparql)のご紹介
https://takemikami.com/2020/10/17/SPARQLEmbulkembulkinputsparql.html

このエントリで紹介する、embulk-formatter-turtleプラグインは、
Embulkでロード元としたデータをturtle形式で出力できます。
出力したturtle形式のRDFデータをfusekiなどのサーバにロードすれば、
LOD(Linked Open Data)として公開する事ができます。

embulk-formatter-turtle | GitHub
https://github.com/takemikami/embulk-formatter-turtle

Embulkとこのプラグインを利用するメリットは、
様々な場所・形式で保存されたデータを
LODとして公開する流れをバッチ処理として自動化できること
だと考えています。

使い方

それでは、embulk-formatter-turtleの使い方を紹介します。

このエントリでは、
embulkのexampleデータをturtle形式で出力するまでの流れを紹介します。

embulkのセットアップ

まずはembulkをセットアップします。
embulk.org ( https://www.embulk.org/ )のQuickStartの通り、実行します。

curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

現時点(2021年12月)の最新Stable版(embulk 0.9.23)は、
Java1.8系で実行する必要があるので、1.8系をJAVA_HOMEに指定しておきます。

embulkサンプルの実行

この手順も、embulk.orgのQuickStartの通りですが、
次のようにexampleを実行します.

embulk example ./try1
embulk guess ./try1/seed.yml -o config.yml
embulk preview config.yml
embulk run config.yml

実行すると、標準出力に次のようなデータ出力されます。

1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
4,11270,2015-01-29 11:54:36,20150129,

ここまで確認できたら、
このデータを、embulk-formatter-turtleを使って、
turtle形式で出力するように変更していきます。

embulk-formatter-turtleのインストール

次のように、embulk-formatter-turtleプラグインをインストールします。

embulk gem install embulk-formatter-turtle

turtle形式での出力

exampleのconfig.yamlの出力は、
次のように出力先が、標準出力になっています。

config.ymlの抜粋

out: {type: stdout}

config.yamlを変更して、
次のように、この出力先をファイル、形式をturtleに指定します。

config.ymlの抜粋(変更後)

out: 
  type: file
  path_prefix: "./output"
  file_ext: ttl
  formatter:
    type: turtle
    base: http://example.com/ttl/
    subject_column: 'id'
    columns:
    - {name: 'account', predicate: 'http://example.com/ttl/type#account'}
    - {name: 'time', predicate: 'http://example.com/ttl/type#time'}
    - {name: 'purchase', predicate: 'http://example.com/ttl/type#purchase'}
    - {name: 'comment', predicate: 'http://example.com/ttl/type#comment'}

変更したら、もう一度embulkを実行します。

embulk run config.yml

実行後、次のように確認すると、
turtle形式のファイルが出力されている事が確認出来ます。

$ cat output*.ttl
@base          <http://example.com/ttl/> .
<1>     <type#account>  "32864" ;
        <type#comment>  "embulk" ;
        <type#purchase>  "2015-01-27 00:00:00 UTC" ;
        <type#time>  "2015-01-27 19:23:49 UTC" .

<2>     <type#account>  "14824" ;
        <type#comment>  "embulk jruby" ;
        <type#purchase>  "2015-01-27 00:00:00 UTC" ;
        <type#time>  "2015-01-27 19:01:23 UTC" .

<3>     <type#account>  "27559" ;
        <type#comment>  "Embulk \"csv\" parser plugin" ;
        <type#purchase>  "2015-01-28 00:00:00 UTC" ;
        <type#time>  "2015-01-28 02:20:02 UTC" .

<4>     <type#account>  "11270" ;
        <type#purchase>  "2015-01-29 00:00:00 UTC" ;
        <type#time>  "2015-01-29 11:54:36 UTC" .

exampleのデータは、次の表のデータですが、
id列の値をsubject
account, time, purchase, comment列の値をobject
としたRDFデータに変換して出力します。

formatterのオプションは、
subject_columnで、subjectにする列を指定。
columnsで、objectとする列と、それに対応するpredicateを指定。
baseは、このRDFのbase uriを指定します。