Sunday 1 December 2013

Final reporting for ANDS and technical reflections -- Part 2.

Continuing our technical reflections on this project, I will wrap up with some speculation on how we think the infrastructure will evolve over the next 12 months.

One of the main features of the project has been how it has extended the preexisting Founders and Survivors (FAS) data model with genealogical relationships and with a hierarchical way of documenting sources. Yggdrasil's hierarchical sources model dovetails perfectly into XML modes of representation and we were able to leverage this in generating xml data driven work flows for populating the "branch" levels of Yggdrasil source trees. However Yggdrasil simply provides a blob of text to be used as required for each source. Our general experiences with the research domains of interest to which AP20 has applied (Convicts, Diggers, Koori Health) strongly lead us to believe that we need to do everything possible to get away from raw "text", either in web forms or spreadsheet cells, as a mode of data capture. This is of course an exceedingly difficult problem to solve without lots of custom programming.

Towards the end of the project we were able to attend the International Semantic Web Conference in Sydney:

   http://iswc2013.semanticweb.org/content/program-friday

A number of papers and posters at that conference were extremely relevant to providing practical ways forward for solving this and other problems. Of particular note for our needs were:
  • ActiveRaul which automatically generates a web-based editing interface from an ontology http://iswc2013.semanticweb.org/content/demos/30
  • PROV-O, an ontology for describing provenance: http://www.w3.org/TR/prov-o/Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. PROV-DM is the conceptual data model that forms a basis for the W3C provenance (PROV) family of specifications. See explanatory material here: https://wiki.duraspace.org/display/VIVO/Prov-O+Ontology
  • the collaborative redevelopment of ICD11 (International Cause of Death coding system) using Web Protege: http://link.springer.com/chapter/10.1007%2F978-3-642-16438-5_6#page-1
We are hopeful that ActiveRaul could provide a workable approach to providing editing services for ontology based data fragments such as specific research data capture needs associated with the existing Drupal data entry form of Prof McCalman's ships research project.

At the conference we also encountered many successful domain specific examples of where semantic technologies had been used to interlink, search and build innovative services across disparate sources of data. We believe this approach is a fertile way forward to solving a specific problem of better sharing and exchanging data with our collaborators such as Tasmanian Archives and Heritage Office and the Female Convicts Research Collective in Hobart. It is entirely feasible to see how an overarching ontology for prosopography, customised for the convict system, would enable each group to publish some RDF in accordance with that ontology and to have a Sparql endpoint to enable federated search across the multiple databases. We believe this approach can help us solve problems of collaborative matching and data exchange whilst enabling each party to continue with the data management practises which best suit their own needs. 

The data integration and portal like capabilities we have developed in Yggdrasil, and its existing deployment in Nectar/Amazon Web Services cloud environments mean it is well placed to evolve as a user interface to support this kind of capability. As we proceed with the Convicts and Diggers domain we will try to evolve a suitable ontology to assist us move in this direction.

Thursday 28 November 2013

(7) Final reporting for ANDS and technical reflections -- Part 1.

This post will make some technical reflections now we have reached the end of our ANDS funded development. Our code base is now up on github here in several repos:

    https://github.com/foundersandsurvivors

Note the main php application repository is "ap20-ands" which will continue to undergo development. Please use the "dev" branch here to get the latest version. We are taking this opportunity to review all ap20 code and a lot of the Founders and Survivors (FAS) code base built upon to refactor useful functionality into the repository. One difficulty in doing so is the way Git manages an entire hierarchy but with no support for preserving ownership and permissions. Web applications involve a widely dispersed set of code fragments from configuration data in /etc to php code inside the /var/www web hierarchy and support files outside of the web server. This either necessitates the use of git hook scripts to install/deploy code across the file system or some similar customised installer. We chose the latter approach because it enabled us to take an approach where one could update a repo without automatically overwriting an operational system. Our repos therefore include a "bin" and a "src" directory. The former is for installation/deployment scripts, the latter for the operational code. We make use of environment variables to enable repository users to customise the locations of code to suit their own requirements. The installation infrastructure will read what the repo provides in src/etc/environment and check that a system environment variable is set; if not a suggested value is given. Differences between the repo version and deployed version are reported, enabling the repo code to be tested prior to the operational versions being replaced.

One of the major and most interested technical challenges we faced in this project was integrating relational and xml databases. We were impressed with the general flexibility and performance of the open source BaseX database, particularly after we relocated a mirror instance to a Nectar Virtual Machine with 24GB ram and assigned a large JVM. For digital humanities projects of our relatively modest scale -- distinct complex record types in the order of tens of millions, and less than 100 gigabytes in total size, but a high degree of semantic complexity -- the capability provided by BaseX and its Xquery 3.0 implementation is most impressive. SaxonEE was also an indispensable part of our toolkit and came to the fore when manipulating very large files (the paid EE version provides excellent streaming capabilities). Most importantly, the use of the PostgreSql Jdbc driver enables BaseX Xquery to act as a dynamic and very powerful aggregator/integrator of diverse XML documents and relational data. Using a judicious combination of servlet mappings, apache2 access controls (IP number and ticket based using auth-pubtkt and ldap) we have been able to implement a small federated very flexible hybrid database across multiple Nectar and Vsphere VMs with seamless access using restful services. This has enabled us to focus on using the Yggdrasil relational database for its core capability of evidence based genealogical relationships and integrate supporting and additional XML data or in fact rest-based service on demand. We developed some low level functions using the php Curl library for testful services and PHP's inbuilt xml/xpath capabilities to enable this. Json is also an option but, quite frankly, we found it no more difficult to deal with XHTML (xml embedded in HTML) in client side Javascript libraries than with Json and regarded XML attributes (awkward to deal with in Json) as worth retaining.

A data domain where this capability to use XML/Xquery as a data aggregation "swiss army knife" really came to the fore was in Professor McCalman's Tasmanian Convict Ships research work with volunteers. This involved multiple sources of quite complex data being referred to and collected via both Google spreadsheets and drupal (using CCK). The google spreadsheets were developed as templates entirely by Prof. McCalman to enable volunteers to analyse the available data and make codified judgements on a wide range of socio-economic and familial evidence. A daily work flow would ensure all ships spreadsheets were created according to the templates, populated with the base convict population (the source of which varied by ship depending on past data collections) from the back end xml FAS database, including known record identifiers (ensure data collection was pre-linked, preventing any matching issues), and ensure appropriate access permissions to both the spreadsheets (for editors and staff) and to web based reports of progress. Whilst an amazingly functional API is available, Google spreadsheets proved to be quite problematic in terms of unexpected and seemingly random errors. Perl eval statements were required to wrap every single function call and decisions about whether to proceed or abort the processing of a ship. Nevertheless, we can say overall the very complex Google spreadsheets work flow worked pretty well using the combination of Perl code using Net::Google::Spreadsheets and Saxon/Basex Xquery with simple bash scripts stringing the steps of the work flow together. The AP20 infrastructure will eventually take over the Drupal data entry component of this work flow. The Drupal data capture form grew into a complex structure and using CCK meant a table for each data item. This meant the daily export job ran extremely slowly and would take up more and more memory. On several occasions were encountered low level system limitations on this process which were difficult to debug and rectify. Whilst the flexibility of Drupal/CCK is to be admired we would not recommend it for large complex structures where the volume of data is in the thousands. We counted ourselves as fortunate in being able to get through to the end of the ships research project with the Drupal data capture component holding together.

More in Part 2 coming.

Contributors