Fasconn: ANDS AP20 Development Blog

(5) Features and Technology

A Norse term for "tree of life", Yggdrasil is an open source genealogy software program developed by Leif Biberg Kristensen. A Postgresql database using extensive stored procedures and a very thin layer of PHP as its web interface, it is well suited to recording basic genealogies of whole populations. Because it is an open source web based database application whose design is not tied to the ageing GEDCOM "standard" it will help solve problems arising in large academic human population study projects requiring collaboration and multiple users.

Yggdrasil (a "vanilla" LAMP application) is by far the most important technology being used in this project and we are extremely grateful to Leif for generously allowing us to take over the ongoing development of this open source application. With one exception (Temporal Earth), all related technologies are web-based, offer restful interfaces, can exchange data using XML and/or JSON, and are therefore relatively straightforward to integrate in a web application. Building on experiences with Founders and Survivors we will use both VSphere and OpenStack (Research Cloud) virtualisations to deploy database instances. Ubuntu 12.04 LTS server edition is our OS of choice.

The essence of the AP20 "FASCONN" project is to enhance and extend the existing Yggdrasil software application with 10 new or improved capabilities:

extend the database for new features, improve web usability and provide adequate user and developer documentation;
allow geocoding of place names (interacting with Geosciences Australia new WFS-G service as a web client), which will in turn support geographical and map based visualisations of all events where the place is known (considering PostGIS and GeoDjango);
extend sources and facts captured with prosopography XML markup inspired by TEIp5 Names, Dates, People and Places and integrating data already available in the Founders and Survivors BaseX XML database and XRX techniques;
improve person matching and record linkage (integrating concept's of John Bass's LKT C independent link management program reimplemented using the Neo4J Graph database and possibly Gremlin for graph traversal);
enable populations, groups, and cohorts to be identified and managed (enhance/extend Yggdrasil's tables);
improve searching and selection (considering using Apache SOLR and a custom Data Import Request Handler);
generate animated visualisations of population events in space and time (using Matt Coller's Temporal Earth);
generate large-scale genealogy diagrams using either Pajek or Neo4J (James Rose who has done similar work to advise);
generate narrative prosopographies (adapt and extend the XML/XSLT1.0 Founders and Survivors public search interface; possibly adding Similie timelines and more interactivity by using client side XSLT 2.0 using Saxon-CE);
generate documented export (flat) files to support researchers' own analysis in their tools of choice e.g. SPSS, R, Excel etc.

The AP20 Software Capabilities Diagram shows these 10 core capabilities pictorially. Building out from the database at its core (C1), each capability is relatively self-contained. C1, the database centre, is the extended Yggdrasil.

For further technical details about the project, please read the initial AP20 technical overview proposed to ANDS, or see our short AP20 video presentation and the ANDS Community day in Sydney in May. You can also view our experimental RIFCS entries which document the services and collections involved in the project.

Loading existing data collections into their Yggdrasil instances will require bespoke work flows using bash shell scripts, Perl, Saxon-EE (commercial license available in Founders and Survivors project and servers). These will be adapted from an earlier proof of concept load of KHRD data into Yggdrasil.

Given the technical context above, a number of questions require clarification:

a) What is the most important part of technology being used?

Yggdrasil i.e. Postgresql, PHP, Apache, LDAP (for user authentication and authorisation, leveraging an existing Founders and Survivors directory server). Standard web technologies such as CSS, some XQuery and XSLT for Founders and Survivors BaseX database interaction.

b) What will require the most development effort and why?

Capability 3 - the ability to associate Yggdrasil events and people with extended, flexible "factoids" will require the most development effort because of the complexity involved in enabling a data-driven (ie. user configurable) data model which is scalable, and which can provide easy to use web-interfaces for data capture and validation, without assistance from programmers. To mitigate this risk we intend to, initially, provide a basic capability which leverages existing programmed interfaces to Google Spreadsheets. This will enable a researcher to use an external Google spreadsheet for data capture using a Google Form of their own design. Our academic research champions have proven on other Founders and Survivors data capture exercises they are perfectly capable of designing and building their own Google Forms and spreadsheets.

Building a simple work flow manager into Yggdrasil, where a population can be defined, and a data capture exercise for that population can be associated with a particular Google Form, a hidden "key" field providing the join data between the Yggdrasil database person and the Google spreadsheet row containing researcher designed form data, and serving up the required HTML and keeping track of to whom the entry has been assigned, is most certainly readily and quite easily doable. This is likely to be more acceptable to users than entering valid TEI XML markup or using unsatisfactory web based XML or JSON editors. This approach is of course somewhat less "integrated". However we believe it will be a useful first step and enable us to return to the more thorny problem when solid progress has been made across the other 9 capabilities.

c) What features are the most important to gain customer satisfaction and buy-in?

Multiple users, ease of access, ease of sharing data with others, ease of extending data, and ease of keeping data accurate and growing, in a single place, are vital. In short, the user base will be delighted to use a shared readily accessible web database. Also vital are one click access to highly useful and relevant outputs which directly support research (Capabilities 6 to 10). Users are tolerant of less than glorious and beautiful interfaces and are more concerned about accurate and usable content.

d) Non-functional requirements.

The following requirements apply to AP20 as a whole system and should be adequate to guide developments in regard to technical requirements across each specific capability.

i) Authentication and authorisation

Access to all data updating functions are to be authenticated and authorised to a research domain.

Authentication method (http basic) provider to be configurable as:

.htaccess and local http password files
LDAP e.g. the Founders and Survivors LDAP service at http://founders-and-survivors.org.
other methods as deemed useful and feasible within time/budget constraints (e.g. MAMS enabling the application is highly desirable but may not be achievable).

Authenticated access levels/roles are "administrator" (data custodian and application configuration), "staff" (data maintenance), and "visitor" (read only).

ii) Data privacy (legal and ethical)

Where authorised researchers choose to make exported data or particular functions available to the public, any identifying data concerning living persons is to be presented anonymously.

iii) Performance

Interactive web response times should be reasonable to the task e.g. 2-8 seconds for small interactive tasks. Batch tasks such as data loading or quality assurance work should be backgrounded/queued so that users are not delayed, and are informed when output is available.

iv) Scalability

Each instance (i.e. each research domain) should be capable of supporting up to 10 simultaneous logged in users. The database structures and processing should be capable of supporting the entire FAS data domain (75,000 convicts) and descendants of the survivors for up to 6 generations i.e. up to a million persons is theoretically possibly. In such cases the amount of memory to be available to the host need to be appropriate to the database's requirements.

v) Security

No data encryption is required but data access is to be restricted to authorised users unless otherwise determined by a research domain's data costodians/administrators.

vi) Maintainability

As a research application (not business transactions), maintainablity will be enhanced by using tools and technologies where skills are relatively accessible in the Research IT space. Yggdrasil consists of "vanilla" web technology PHP/Postgresql. It may be enhanced by use of or integration with popular frameworks. All code developed is to be under version control and publicly available at github.

vii) Usablity

Users are expected to be competent in their research domain and capable of reading reasonable instructions as to how to use the application. Usability requirements are thus moderate.

vii) Multi-lingual support

Yggdrasil supports Norwegian and English.

ix) Auditing and Logging

A simple change log showing who did what when is required. Changes which add/merge/delete or link persons need to be explicitly logged.A simple text and/or XML file will suffice for this purpose. The logs will be retained and rolled over weekly or after each 1000 transactions, whichever is sooner.

x) Availability

24/7 or "four nines" availability is not required.

Implementation and operations will be simplified if a weekly maintenance window of one hour (for small domains e.g. KHRD) and up to three hours (for large domains e.g. Convicts and Diggers) are routinely scheduled to enable domain data quality assurance and database backups to be conducted without user updates.

A deployment model where research cloud instances may be invoked, configued, loaded, used, and saved/snapshotted may be a perfectly acceptable mode of operations for both KHRD and TasDig domains, and would greatly mitigate internet security risks posed by 24/7 web operations. It is also important, in the case, that the procedure for researchers to invoke their instance is not onerous. If the database and associated files exceed 8GB then research cloud snapshots will need to be supplemented with scripted save/restore to S3 containers and/or other storage. These issues should be clearer after the tech lead attend an OpenStack workshop in September 2012.

e) A high-level architectural diagram.

Data load workflows:

diagram of the workflow used to load the KHRD database as an Yggdrasil database instance
diagram for Convicts and Diggers Yggdrasil load (to be completed following KHRD load)

f) Source code repository

Available at https://github.com/foundersandsurvivors/ap20-ands.

We may add separate respositories for code associated with loading the KHRD and Convicts and Diggers instances.