Thursday 28 November 2013

(7) Final reporting for ANDS and technical reflections -- Part 1.

This post will make some technical reflections now we have reached the end of our ANDS funded development. Our code base is now up on github here in several repos:

    https://github.com/foundersandsurvivors

Note the main php application repository is "ap20-ands" which will continue to undergo development. Please use the "dev" branch here to get the latest version. We are taking this opportunity to review all ap20 code and a lot of the Founders and Survivors (FAS) code base built upon to refactor useful functionality into the repository. One difficulty in doing so is the way Git manages an entire hierarchy but with no support for preserving ownership and permissions. Web applications involve a widely dispersed set of code fragments from configuration data in /etc to php code inside the /var/www web hierarchy and support files outside of the web server. This either necessitates the use of git hook scripts to install/deploy code across the file system or some similar customised installer. We chose the latter approach because it enabled us to take an approach where one could update a repo without automatically overwriting an operational system. Our repos therefore include a "bin" and a "src" directory. The former is for installation/deployment scripts, the latter for the operational code. We make use of environment variables to enable repository users to customise the locations of code to suit their own requirements. The installation infrastructure will read what the repo provides in src/etc/environment and check that a system environment variable is set; if not a suggested value is given. Differences between the repo version and deployed version are reported, enabling the repo code to be tested prior to the operational versions being replaced.

One of the major and most interested technical challenges we faced in this project was integrating relational and xml databases. We were impressed with the general flexibility and performance of the open source BaseX database, particularly after we relocated a mirror instance to a Nectar Virtual Machine with 24GB ram and assigned a large JVM. For digital humanities projects of our relatively modest scale -- distinct complex record types in the order of tens of millions, and less than 100 gigabytes in total size, but a high degree of semantic complexity -- the capability provided by BaseX and its Xquery 3.0 implementation is most impressive. SaxonEE was also an indispensable part of our toolkit and came to the fore when manipulating very large files (the paid EE version provides excellent streaming capabilities). Most importantly, the use of the PostgreSql Jdbc driver enables BaseX Xquery to act as a dynamic and very powerful aggregator/integrator of diverse XML documents and relational data. Using a judicious combination of servlet mappings, apache2 access controls (IP number and ticket based using auth-pubtkt and ldap) we have been able to implement a small federated very flexible hybrid database across multiple Nectar and Vsphere VMs with seamless access using restful services. This has enabled us to focus on using the Yggdrasil relational database for its core capability of evidence based genealogical relationships and integrate supporting and additional XML data or in fact rest-based service on demand. We developed some low level functions using the php Curl library for testful services and PHP's inbuilt xml/xpath capabilities to enable this. Json is also an option but, quite frankly, we found it no more difficult to deal with XHTML (xml embedded in HTML) in client side Javascript libraries than with Json and regarded XML attributes (awkward to deal with in Json) as worth retaining.

A data domain where this capability to use XML/Xquery as a data aggregation "swiss army knife" really came to the fore was in Professor McCalman's Tasmanian Convict Ships research work with volunteers. This involved multiple sources of quite complex data being referred to and collected via both Google spreadsheets and drupal (using CCK). The google spreadsheets were developed as templates entirely by Prof. McCalman to enable volunteers to analyse the available data and make codified judgements on a wide range of socio-economic and familial evidence. A daily work flow would ensure all ships spreadsheets were created according to the templates, populated with the base convict population (the source of which varied by ship depending on past data collections) from the back end xml FAS database, including known record identifiers (ensure data collection was pre-linked, preventing any matching issues), and ensure appropriate access permissions to both the spreadsheets (for editors and staff) and to web based reports of progress. Whilst an amazingly functional API is available, Google spreadsheets proved to be quite problematic in terms of unexpected and seemingly random errors. Perl eval statements were required to wrap every single function call and decisions about whether to proceed or abort the processing of a ship. Nevertheless, we can say overall the very complex Google spreadsheets work flow worked pretty well using the combination of Perl code using Net::Google::Spreadsheets and Saxon/Basex Xquery with simple bash scripts stringing the steps of the work flow together. The AP20 infrastructure will eventually take over the Drupal data entry component of this work flow. The Drupal data capture form grew into a complex structure and using CCK meant a table for each data item. This meant the daily export job ran extremely slowly and would take up more and more memory. On several occasions were encountered low level system limitations on this process which were difficult to debug and rectify. Whilst the flexibility of Drupal/CCK is to be admired we would not recommend it for large complex structures where the volume of data is in the thousands. We counted ourselves as fortunate in being able to get through to the end of the ships research project with the Drupal data capture component holding together.

More in Part 2 coming.

0 comments:

Post a Comment

Contributors