Fasconn: ANDS AP20 Development Blog

Sunday, 1 December 2013

Final reporting for ANDS and technical reflections -- Part 2.

Continuing our technical reflections on this project, I will wrap up with some speculation on how we think the infrastructure will evolve over the next 12 months.

One of the main features of the project has been how it has extended the preexisting Founders and Survivors (FAS) data model with genealogical relationships and with a hierarchical way of documenting sources. Yggdrasil's hierarchical sources model dovetails perfectly into XML modes of representation and we were able to leverage this in generating xml data driven work flows for populating the "branch" levels of Yggdrasil source trees. However Yggdrasil simply provides a blob of text to be used as required for each source. Our general experiences with the research domains of interest to which AP20 has applied (Convicts, Diggers, Koori Health) strongly lead us to believe that we need to do everything possible to get away from raw "text", either in web forms or spreadsheet cells, as a mode of data capture. This is of course an exceedingly difficult problem to solve without lots of custom programming.

Towards the end of the project we were able to attend the International Semantic Web Conference in Sydney:

http://iswc2013.semanticweb.org/content/program-friday

A number of papers and posters at that conference were extremely relevant to providing practical ways forward for solving this and other problems. Of particular note for our needs were:

ActiveRaul which automatically generates a web-based editing interface from an ontology http://iswc2013.semanticweb.org/content/demos/30
PROV-O, an ontology for describing provenance: http://www.w3.org/TR/prov-o/. Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. PROV-DM is the conceptual data model that forms a basis for the W3C provenance (PROV) family of specifications. See explanatory material here: https://wiki.duraspace.org/display/VIVO/Prov-O+Ontology
the collaborative redevelopment of ICD11 (International Cause of Death coding system) using Web Protege: http://link.springer.com/chapter/10.1007%2F978-3-642-16438-5_6#page-1

We are hopeful that ActiveRaul could provide a workable approach to providing editing services for ontology based data fragments such as specific research data capture needs associated with the existing Drupal data entry form of Prof McCalman's ships research project.

At the conference we also encountered many successful domain specific examples of where semantic technologies had been used to interlink, search and build innovative services across disparate sources of data. We believe this approach is a fertile way forward to solving a specific problem of better sharing and exchanging data with our collaborators such as Tasmanian Archives and Heritage Office and the Female Convicts Research Collective in Hobart. It is entirely feasible to see how an overarching ontology for prosopography, customised for the convict system, would enable each group to publish some RDF in accordance with that ontology and to have a Sparql endpoint to enable federated search across the multiple databases. We believe this approach can help us solve problems of collaborative matching and data exchange whilst enabling each party to continue with the data management practises which best suit their own needs.

The data integration and portal like capabilities we have developed in Yggdrasil, and its existing deployment in Nectar/Amazon Web Services cloud environments mean it is well placed to evolve as a user interface to support this kind of capability. As we proceed with the Convicts and Diggers domain we will try to evolve a suitable ontology to assist us move in this direction.

Thursday, 28 November 2013

(7) Final reporting for ANDS and technical reflections -- Part 1.

This post will make some technical reflections now we have reached the end of our ANDS funded development. Our code base is now up on github here in several repos:

https://github.com/foundersandsurvivors

Note the main php application repository is "ap20-ands" which will continue to undergo development. Please use the "dev" branch here to get the latest version. We are taking this opportunity to review all ap20 code and a lot of the Founders and Survivors (FAS) code base built upon to refactor useful functionality into the repository. One difficulty in doing so is the way Git manages an entire hierarchy but with no support for preserving ownership and permissions. Web applications involve a widely dispersed set of code fragments from configuration data in /etc to php code inside the /var/www web hierarchy and support files outside of the web server. This either necessitates the use of git hook scripts to install/deploy code across the file system or some similar customised installer. We chose the latter approach because it enabled us to take an approach where one could update a repo without automatically overwriting an operational system. Our repos therefore include a "bin" and a "src" directory. The former is for installation/deployment scripts, the latter for the operational code. We make use of environment variables to enable repository users to customise the locations of code to suit their own requirements. The installation infrastructure will read what the repo provides in src/etc/environment and check that a system environment variable is set; if not a suggested value is given. Differences between the repo version and deployed version are reported, enabling the repo code to be tested prior to the operational versions being replaced.

One of the major and most interested technical challenges we faced in this project was integrating relational and xml databases. We were impressed with the general flexibility and performance of the open source BaseX database, particularly after we relocated a mirror instance to a Nectar Virtual Machine with 24GB ram and assigned a large JVM. For digital humanities projects of our relatively modest scale -- distinct complex record types in the order of tens of millions, and less than 100 gigabytes in total size, but a high degree of semantic complexity -- the capability provided by BaseX and its Xquery 3.0 implementation is most impressive. SaxonEE was also an indispensable part of our toolkit and came to the fore when manipulating very large files (the paid EE version provides excellent streaming capabilities). Most importantly, the use of the PostgreSql Jdbc driver enables BaseX Xquery to act as a dynamic and very powerful aggregator/integrator of diverse XML documents and relational data. Using a judicious combination of servlet mappings, apache2 access controls (IP number and ticket based using auth-pubtkt and ldap) we have been able to implement a small federated very flexible hybrid database across multiple Nectar and Vsphere VMs with seamless access using restful services. This has enabled us to focus on using the Yggdrasil relational database for its core capability of evidence based genealogical relationships and integrate supporting and additional XML data or in fact rest-based service on demand. We developed some low level functions using the php Curl library for testful services and PHP's inbuilt xml/xpath capabilities to enable this. Json is also an option but, quite frankly, we found it no more difficult to deal with XHTML (xml embedded in HTML) in client side Javascript libraries than with Json and regarded XML attributes (awkward to deal with in Json) as worth retaining.

A data domain where this capability to use XML/Xquery as a data aggregation "swiss army knife" really came to the fore was in Professor McCalman's Tasmanian Convict Ships research work with volunteers. This involved multiple sources of quite complex data being referred to and collected via both Google spreadsheets and drupal (using CCK). The google spreadsheets were developed as templates entirely by Prof. McCalman to enable volunteers to analyse the available data and make codified judgements on a wide range of socio-economic and familial evidence. A daily work flow would ensure all ships spreadsheets were created according to the templates, populated with the base convict population (the source of which varied by ship depending on past data collections) from the back end xml FAS database, including known record identifiers (ensure data collection was pre-linked, preventing any matching issues), and ensure appropriate access permissions to both the spreadsheets (for editors and staff) and to web based reports of progress. Whilst an amazingly functional API is available, Google spreadsheets proved to be quite problematic in terms of unexpected and seemingly random errors. Perl eval statements were required to wrap every single function call and decisions about whether to proceed or abort the processing of a ship. Nevertheless, we can say overall the very complex Google spreadsheets work flow worked pretty well using the combination of Perl code using Net::Google::Spreadsheets and Saxon/Basex Xquery with simple bash scripts stringing the steps of the work flow together. The AP20 infrastructure will eventually take over the Drupal data entry component of this work flow. The Drupal data capture form grew into a complex structure and using CCK meant a table for each data item. This meant the daily export job ran extremely slowly and would take up more and more memory. On several occasions were encountered low level system limitations on this process which were difficult to debug and rectify. Whilst the flexibility of Drupal/CCK is to be admired we would not recommend it for large complex structures where the volume of data is in the thousands. We counted ourselves as fortunate in being able to get through to the end of the ships research project with the Drupal data capture component holding together.

More in Part 2 coming.

Tuesday, 28 August 2012

(6) Project Outputs & Our Primary Product

This project will produce two operational genealogy and social history database websites, one for each of the research domains used to extend Yggdrasil:

the KHRD (Koori Health Research Database) and
the Convicts and Diggers database.

These sites will allow authorised users to search for individuals, view their genealogies and life course data in a variety of ways, manage population sub-groups and cohorts, access routine reports, export data, and grow the database's scope and accuracy as further sources come to light, are entered, and interpretations improve.

The code used to create these websites will be availabe for re-use under a BSD licence. All components created by the project team will be available from a github site (see the "Sourcecode" link on this blog) along with documentation for users and developers. The user reference manual will address operational day to day usage procedures and will explain how to go about using a new database instance and providing access to authorised users. Where possible documentation will also be built into online contextual help. set up your own instance of the site. The latter manual will explain how to install the software and will advise as to required pre-requisites, dependancies and configuration. All dependencies will be documented and configuration instructions for those dependences will be provided. Detailed information will be given along with the code used to initially load both the KHRD and the Convicts and Diggers. The intention is that other parties should be able to adapt the load procedures used for these website instances to their own requirements, if they are not starting a new database from scratch.

(5) Features and Technology

A Norse term for "tree of life", Yggdrasil is an open source genealogy software program developed by Leif Biberg Kristensen. A Postgresql database using extensive stored procedures and a very thin layer of PHP as its web interface, it is well suited to recording basic genealogies of whole populations. Because it is an open source web based database application whose design is not tied to the ageing GEDCOM "standard" it will help solve problems arising in large academic human population study projects requiring collaboration and multiple users.

Yggdrasil (a "vanilla" LAMP application) is by far the most important technology being used in this project and we are extremely grateful to Leif for generously allowing us to take over the ongoing development of this open source application. With one exception (Temporal Earth), all related technologies are web-based, offer restful interfaces, can exchange data using XML and/or JSON, and are therefore relatively straightforward to integrate in a web application. Building on experiences with Founders and Survivors we will use both VSphere and OpenStack (Research Cloud) virtualisations to deploy database instances. Ubuntu 12.04 LTS server edition is our OS of choice.

The essence of the AP20 "FASCONN" project is to enhance and extend the existing Yggdrasil software application with 10 new or improved capabilities:

extend the database for new features, improve web usability and provide adequate user and developer documentation;
allow geocoding of place names (interacting with Geosciences Australia new WFS-G service as a web client), which will in turn support geographical and map based visualisations of all events where the place is known (considering PostGIS and GeoDjango);
extend sources and facts captured with prosopography XML markup inspired by TEIp5 Names, Dates, People and Places and integrating data already available in the Founders and Survivors BaseX XML database and XRX techniques;
improve person matching and record linkage (integrating concept's of John Bass's LKT C independent link management program reimplemented using the Neo4J Graph database and possibly Gremlin for graph traversal);
enable populations, groups, and cohorts to be identified and managed (enhance/extend Yggdrasil's tables);
improve searching and selection (considering using Apache SOLR and a custom Data Import Request Handler);
generate animated visualisations of population events in space and time (using Matt Coller's Temporal Earth);
generate large-scale genealogy diagrams using either Pajek or Neo4J (James Rose who has done similar work to advise);
generate narrative prosopographies (adapt and extend the XML/XSLT1.0 Founders and Survivors public search interface; possibly adding Similie timelines and more interactivity by using client side XSLT 2.0 using Saxon-CE);
generate documented export (flat) files to support researchers' own analysis in their tools of choice e.g. SPSS, R, Excel etc.

The AP20 Software Capabilities Diagram shows these 10 core capabilities pictorially. Building out from the database at its core (C1), each capability is relatively self-contained. C1, the database centre, is the extended Yggdrasil.

For further technical details about the project, please read the initial AP20 technical overview proposed to ANDS, or see our short AP20 video presentation and the ANDS Community day in Sydney in May. You can also view our experimental RIFCS entries which document the services and collections involved in the project.

Loading existing data collections into their Yggdrasil instances will require bespoke work flows using bash shell scripts, Perl, Saxon-EE (commercial license available in Founders and Survivors project and servers). These will be adapted from an earlier proof of concept load of KHRD data into Yggdrasil.

Given the technical context above, a number of questions require clarification:

a) What is the most important part of technology being used?

Yggdrasil i.e. Postgresql, PHP, Apache, LDAP (for user authentication and authorisation, leveraging an existing Founders and Survivors directory server). Standard web technologies such as CSS, some XQuery and XSLT for Founders and Survivors BaseX database interaction.

b) What will require the most development effort and why?

Capability 3 - the ability to associate Yggdrasil events and people with extended, flexible "factoids" will require the most development effort because of the complexity involved in enabling a data-driven (ie. user configurable) data model which is scalable, and which can provide easy to use web-interfaces for data capture and validation, without assistance from programmers. To mitigate this risk we intend to, initially, provide a basic capability which leverages existing programmed interfaces to Google Spreadsheets. This will enable a researcher to use an external Google spreadsheet for data capture using a Google Form of their own design. Our academic research champions have proven on other Founders and Survivors data capture exercises they are perfectly capable of designing and building their own Google Forms and spreadsheets.

Building a simple work flow manager into Yggdrasil, where a population can be defined, and a data capture exercise for that population can be associated with a particular Google Form, a hidden "key" field providing the join data between the Yggdrasil database person and the Google spreadsheet row containing researcher designed form data, and serving up the required HTML and keeping track of to whom the entry has been assigned, is most certainly readily and quite easily doable. This is likely to be more acceptable to users than entering valid TEI XML markup or using unsatisfactory web based XML or JSON editors. This approach is of course somewhat less "integrated". However we believe it will be a useful first step and enable us to return to the more thorny problem when solid progress has been made across the other 9 capabilities.

c) What features are the most important to gain customer satisfaction and buy-in?

Multiple users, ease of access, ease of sharing data with others, ease of extending data, and ease of keeping data accurate and growing, in a single place, are vital. In short, the user base will be delighted to use a shared readily accessible web database. Also vital are one click access to highly useful and relevant outputs which directly support research (Capabilities 6 to 10). Users are tolerant of less than glorious and beautiful interfaces and are more concerned about accurate and usable content.

d) Non-functional requirements.

The following requirements apply to AP20 as a whole system and should be adequate to guide developments in regard to technical requirements across each specific capability.

i) Authentication and authorisation

Access to all data updating functions are to be authenticated and authorised to a research domain.

Authentication method (http basic) provider to be configurable as:

.htaccess and local http password files
LDAP e.g. the Founders and Survivors LDAP service at http://founders-and-survivors.org.
other methods as deemed useful and feasible within time/budget constraints (e.g. MAMS enabling the application is highly desirable but may not be achievable).

Authenticated access levels/roles are "administrator" (data custodian and application configuration), "staff" (data maintenance), and "visitor" (read only).

ii) Data privacy (legal and ethical)

Where authorised researchers choose to make exported data or particular functions available to the public, any identifying data concerning living persons is to be presented anonymously.

iii) Performance

Interactive web response times should be reasonable to the task e.g. 2-8 seconds for small interactive tasks. Batch tasks such as data loading or quality assurance work should be backgrounded/queued so that users are not delayed, and are informed when output is available.

iv) Scalability

Each instance (i.e. each research domain) should be capable of supporting up to 10 simultaneous logged in users. The database structures and processing should be capable of supporting the entire FAS data domain (75,000 convicts) and descendants of the survivors for up to 6 generations i.e. up to a million persons is theoretically possibly. In such cases the amount of memory to be available to the host need to be appropriate to the database's requirements.

v) Security

No data encryption is required but data access is to be restricted to authorised users unless otherwise determined by a research domain's data costodians/administrators.

vi) Maintainability

As a research application (not business transactions), maintainablity will be enhanced by using tools and technologies where skills are relatively accessible in the Research IT space. Yggdrasil consists of "vanilla" web technology PHP/Postgresql. It may be enhanced by use of or integration with popular frameworks. All code developed is to be under version control and publicly available at github.

vii) Usablity

Users are expected to be competent in their research domain and capable of reading reasonable instructions as to how to use the application. Usability requirements are thus moderate.

vii) Multi-lingual support

Yggdrasil supports Norwegian and English.

ix) Auditing and Logging

A simple change log showing who did what when is required. Changes which add/merge/delete or link persons need to be explicitly logged.A simple text and/or XML file will suffice for this purpose. The logs will be retained and rolled over weekly or after each 1000 transactions, whichever is sooner.

x) Availability

24/7 or "four nines" availability is not required.

Implementation and operations will be simplified if a weekly maintenance window of one hour (for small domains e.g. KHRD) and up to three hours (for large domains e.g. Convicts and Diggers) are routinely scheduled to enable domain data quality assurance and database backups to be conducted without user updates.

A deployment model where research cloud instances may be invoked, configued, loaded, used, and saved/snapshotted may be a perfectly acceptable mode of operations for both KHRD and TasDig domains, and would greatly mitigate internet security risks posed by 24/7 web operations. It is also important, in the case, that the procedure for researchers to invoke their instance is not onerous. If the database and associated files exceed 8GB then research cloud snapshots will need to be supplemented with scripted save/restore to S3 containers and/or other storage. These issues should be clearer after the tech lead attend an OpenStack workshop in September 2012.

e) A high-level architectural diagram.

Data load workflows:

diagram of the workflow used to load the KHRD database as an Yggdrasil database instance
diagram for Convicts and Diggers Yggdrasil load (to be completed following KHRD load)

f) Source code repository

Available at https://github.com/foundersandsurvivors/ap20-ands.

We may add separate respositories for code associated with loading the KHRD and Convicts and Diggers instances.

(4) How will we know if it works?

It is important to note that each group will have their own distinct instance of an AP20 genealogy web application database to gain access to the genealogies of their target population. Both groups of customers will want to:

locate individuals and/or groups of individuals and explore all known data about them in a variety of ways:

narrative prosopography (life course events sequenced in time),
inter-generational relationships e.g. spouses, ancestors and descendants (listed and/or visualised as family trees/network charts),
visualise events placed geographically on maps and animated across time,
reports such as survival analysis and infant mortality,
export flattened versions of data and reports for further adhoc analysis in statistical packages with adequate meta data describing the provenance of the data and relevant citation information.

add new sources of evidence such as birth, death, and marriage certificates, and other user-specified sources of interest;
locate or create individuals whose existence is evidenced by or inferred from the sources;
assert from sources a variety of user specified events indicating when, where, who, and user defined "factoids" of interest;
be assisted in accessing related web-based information in external databases e.g. search in Trove;
be assisted in matching, merging, and linking internal and external data records to individuals in the population;
be assisted in organising data capture and targeted collecting of research data using controlled "crowd sourcing" methods;
execute the above easily, efficiently, flexibly, and transparently (changes need to be logged);
ensure data is secured, private and accessible to authorised users as determined by the customers themselves in line with relevant legislative, privacy and ethical requirements.

Initially the KHRD database will be loaded and our user interface analyst, Nick Knight, will test the system, document system usage conventions, coordinate and assist other authorised KHRD users, assure adequate data quality, and liaise with the research champion and the technical lead as to software enhancements required to meet the KHRD user group's needs.

Once this initial version of the KHRD is operational, attention will turn to the load of the Diggers and Convicts population in a second instance of the application.

(3) Who is the project serving?

One of our key customer groups are academic historians, demographers, members of the public who are genealogists, and those interested in coordinating and enabling access to Koori health data such as the Onemda VicHealth Koori Health Unit. This group of users will use the Koori Health Research Database (KHRD). The "research champion" for this group of users is historian Professor Janet McCalman from the University of Melbourne's School of Population Health. The existing dataset which is the core of this research domain has has a particularly difficult and fractured provenance. This group will benefit greatly from easy access (via the world wide web) to a single accurate shared online non-propritary database recording the lives and relationships of this population. Where determined by the long term data custodians (Onemda), members of the Koori community will be given extracts which provide private access to their descendants records from the database, thus supporting their sense of community and identity.

Our second key customer group are academic historians, demographers, members of the public who are genealogists, involved in the Convicts and Diggers: a demography of life courses, families and generations (ARC Discovery 2011-2013). Based on convict records from the Founders and Survivors project, birth, death and marriage registrations, World War One service records, and other historical data, this project explores long-term demographic outcomes of individuals, families and lineages. The project draws on the expertise of family historians to trace individuals and their descendants for 'Australia's biggest family history'. Some volunteers (members of the public interested in family history or their convict ancestors) may be involved in some data gathering and research tasks. The "research champion" for this group of users is Dr. Rebecca Kippen from the University of Melbourne's School of Population Health.

(2) Why do this project?

The goals of the FASCONN project are to provide a sound basis for various collaborative academic research projects concerning the genealogy and family history of large groups of individuals with a particular focus on human health and survival. Two distinct populations will form the research domains for this software development:

An existing Koori Health Research Database covering over 10,00 indigenous people and their descendants who lived in Victoria and NSW in the 19th and 20th Century, with a particular focus on recording and understanding health outcomes e.g. causes of death.
Building on the Founders and Survivors project, a database of over 15,000 Tasmanian born WWI AIF enlistees who are descendant from convicts transported to Tasmania in the 19th Century, with a particular focus on understanding the historical determinants of responses to stress.

A well documented open-source web-based multiple user database application designed to address the general area of large-scale historical demography has many benefits. It will enable many collaborators, starting from scratch, to formerly record and transcribe their evidence and sources, and to assert events, multiple persons, and human relationships with formal citations from those sources. This will enable family and social historians to focus on collecting or linking to, and interpreting, the relevant historical sources - the core of their research - efficiently and effectively. They will no longer need to risk being locked into single-user desktop bound proprietary software oriented to a single individuals family tree, nor be distracted by having to build their own technological solutions to what is a common problem: analysing the experiences of groups of people, the events in place and time which form their life histories, and the inter-generational relationships between them. They will be in control of their own research data and will be able to extend their databases and share data with other researchers over a long period of time.