Edit this page | Blame

Virtuoso

We run instances of virtuoso for our graph databases. Virtuoso is remarkable software and runs some really large databases, including Uniprot. Virtuoso can sometimes feel old and clunky. But, we still prefer it to other shiny new ones because it is the only large one not written in Java. Java packages are almost impossible to package in Guix.

Running virtuoso

Running virtuoso in a guix system container

We have a Guix virtuoso service in the guix-bioinformatics channel. The easiest way to run virtuoso is to use the virtuoso service to run it in a guix system container. The only downside of this method is that, since guix system containers require root privileges to start up, you will need root priviliges on the machine you are running this on.

Here is a basic guix system configuration that runs virtuoso listening on port 8891, and with its HTTP server listening on port 8892. Among other things, the HTTP server provides a SPARQL endpoint to interact with.

(use-modules (gnu)
             (gn services databases))

(operating-system
  (host-name "virtuoso")
  (timezone "UTC")
  (locale "en_US.utf8")
  (bootloader (bootloader-configuration
               (bootloader grub-bootloader)
	       ;; It doesn't matter what "/dev/sdX" is
               (targets (list "/dev/sdX"))))
  (file-systems %base-file-systems)
  (users %base-user-accounts)
  (packages %base-packages)
  (services (cons (service virtuoso-service-type
                           (virtuoso-configuration
                            (server-port 8891)
                            (http-server-port 8892)))
                  %base-services)))

You can write the above configuration to a file, say virtuoso-os.scm, build a container with it, and run it with the command below. Everything inside the container is ephemeral and vanishes when the container is stopped. In order to persist the database, we mount a host directory /tmp/virtuoso-state at /var/lib/virtuoso in the container. /var/lib/virtuoso is the default state directory used by the Guix virtuoso service.

sudo $(guix system container --network --share=/tmp/virtuoso-state=/var/lib/virtuoso virtuoso-os.scm)

When running the above command, you will be given the container's PID. Should you want to inspect the container, you can run:

sudo nsenter -at PID /run/current-system/profile/bin/bash

If you have only one shepherd process running on your system, you may use the following quick hack to get the PID.

sudo nsenter -at $(pgrep shepherd) /run/current-system/profile/bin/bash

Also, in this set-up, note that the conductor web interface is not supported in the GUIX Service that's part of guix-bioinformatics. It isn't required for using virtuoso as a SPARQL server and only adds to the confusion.

Running virtuoso by invoking it on the command line

You may also choose to run virtuoso the traditional way by invoking it on the command line. Managing long-running instances started from the command line is messy. So, this method works best for temporary instances.

First, we create a new directory for virtuoso and change into it. We will run virtuoso from this directory, and virtuoso will store all its state in this directory.

mkdir virtuoso
cd virtuoso

Then, we create a configuration file---virtuoso.ini. A basic configuration need only specify the ports to listen on. Here we specify port 8891 for the virtuoso server and port 8892 for the HTTP server that includes the SPARQL endpoint.

[Parameters]
ServerPort = localhost:8891

[HTTPServer]
ServerPort = localhost:8892

Finally, we start virtuoso.

virtuoso-t +foreground +configfile virtuoso.ini

Detailed documentation of the virtuoso configuration file format is at

In particular, consider setting NumberOfBuffers and MaxDirtyBuffers as described at

For a working configuration file, you can also look at /export/virtuoso/var/lib/virtuoso/db/virtuoso.ini in penguin2.

Running SPARQL Queries using isql

The straight-forward way of running SPARQL queries is using the web-interface:

To use a CLI tool, you can utilise isql by running:

guix shell virtuoso-ose -- isql -U dba -P password <server-port>

Queries within isql look like:

SQL> SPARQL SELECT * WHERE {?s ?p ?o};

Set passwords for virtuoso users

After running virtuoso, you will want to change the default password of the `dba` user. The default password of the `dba` user is `dba`. You can change passwords using the isql command-line client. See

In a typical production virtuoso installation, you will want to change the password of the dba user and disable the dav user. Here are the commands to do so. Pay attention to the single versus double quoting.

SQL> set password "dba" "new-password";
SQL> UPDATE ws.ws.sys_dav_user SET u_account_disabled=1 WHERE u_name='dav';
SQL> CHECKPOINT;

Loading data into virtuoso

Virtuoso supports at least three different ways to load RDF.

Bulk loading using the isql command-line client

Bulk loading using the isql command-line client is usually the fastest. But, it requires correct handling of file system permissions, and cannot work on remote servers.

SPARQL 1.1 Update

The standard SPARQL protocol allows update of RDF too.

SPARQL 1.1 Graph Store HTTP Protocol

For ease of implementation, SPARQL 1.1 also specifies an additional REST-like API to update data.

The virtuoso documentation shows examples of using this protocol with cURL.

We recap the same here.

When uploading data, the virtuoso server often does not report errors properly. It simply freezes up. So, it is very helpful to validate your RDF before uploading. For this, use rapper from the raptor2 package. To validate data.ttl, a turtle file, run

rapper --input turtle --count data.ttl
rapper: Parsing URI file: data.ttl with parser turtle
rapper: Parsing returned 652395 triples

Then, upload it to a virtuoso SPARQL endpoint running at port 8892

curl -v -X PUT --digest -u 'dba:password' -T data.ttl -G http://localhost:8892/sparql-graph-crud-auth --data-urlencode graph=http://genenetwork.org

where http://genenetwork.org is the name of the graph. Note that single quoting the password is good to do especially when you have special characters in the password.

The PUT method deletes the existing data in the graph before loading the new one. A POST method can be used instead. There is usually no need to manually delete old data before loading new data. virtuoso is slow at deleting millions of triples, resulting in an apparent freeze-up. So, it is preferable to handle such deletes manually using a lower-level SQL statement issued via the isql client.

Start isql with something like

guix shell --expose=verified-data=/var/lib/data virtuoso-ose -- isql -U dba -P password 8981

To delete a graph:

$ isql
SQL> DELETE FROM rdf_quad WHERE g = iri_to_id('http://genenetwork.org');

To add ttl files through isql:

ld_dir('/dir', '*.ttl', 'http://genenetwork.org');
rdf_loader_run();
checkpoint;

When virtuoso has just been started up with a clean state (that is, the virtuoso state directory was empty before virtuoso started), uploading large amounts of data using the SPARQL 1.1 Graph Store HTTP Protocol fails the first time. It succeeds only the second time. It is not clear why. I can only recommend retrying as in this commit:

Retry uploading to virtuoso (commit from dump-genenetwork-database repo) formerly (https://git.genenetwork.org/arunisaac/dump-genenetwork-database/commit/8f60fde7f5499e5ffe352d7ae98a2de34a91b89f)

Using load-rdf.scm script

You can use the following script to upload data in rdf.

This script first clears the database before uploading data. To run it:

guix shell -N virtuoso-ose -m manifest.scm -- ./pre-inst-env ./load-rdf.scm conn.scm dump.ttl

Bulk Loading Data

Virtuoso has access to the folder: /export/data/genenetwork-virtuoso/. As such, place all the turtle files for bulk uploads here. To bulk load data:

First make sure that all the data is deleted:

$ isql
SQL> DELETE FROM rdf_quad WHERE g = iri_to_id('http://genenetwork.org');

Also, make sure that the load list is empty before registering your turtle files.

DELETE FROM DB.DBA.load_list;

Note that the directory may be mapped to a different location by the service. On tux02 it is `/export/data/genenetwork-virtuoso/`.

Use isql to register all the turtle files:

SQL> ld_dir('/var/lib/data', '*.ttl', 'http://genenetwork.org');

Note, for the prior step, you can specify a specific file instead of adding all the files using the wildcard "*". Here's an example of doing this:

SQL> ld_dir('/var/lib/data', 'species.ttl', 'http://genenetwork.org');

Check the table DB.DBA.load_list to see the list of registered files that will be loaded:

SQL> SELECT * FROM DB.DBA.load_list;

Complete the actual bulk load of all data by running:

SQL> rdf_loader_run();

Commit the bulk loaded data to the Virtuoso database file by running:

checkpoint;

Run a query to make sure that indeed you have loaded data E.g.

SPARQL
PREFIX gn: <http://genenetwork.org/id/>

SELECT * FROM <http://genenetwork.org> WHERE {
gn:Mus_musculus ?p ?o.
};

In case you want to get a list of all queries:

SPARQL
SELECT  DISTINCT ?g
   WHERE  { GRAPH ?g {?s ?p ?o} }
ORDER BY ?g;

Other resources:

Dumping to RDF from the GeneNetwork MySQL database

See also

To dump data into a ttl file, first make sure that you are in the guix environment in the "dump-genenetwork-database" repository

See the README for instructions.

(made with skribilo)