Tuesday, 28 August 2012

(6) Project Outputs & Our Primary Product


This project will produce two operational genealogy and social history database websites, one for each of the research domains used to extend Yggdrasil:

  1. the KHRD (Koori Health Research Database) and 
  2. the Convicts and Diggers database.
These sites will allow authorised users to search for individuals, view their genealogies and life course data in a variety of ways, manage population sub-groups and cohorts, access routine reports, export data, and grow the database's scope and accuracy as further sources come to light, are entered, and interpretations improve.

The code used to create these websites will be availabe for re-use under a BSD licence. All components created by the project team will be available from a github site (see the "Sourcecode" link on this blog) along with documentation for users and developers. The user reference manual will address operational day to day usage procedures and will explain how to go about using a new database instance and providing access to authorised users. Where possible documentation will also be built into online contextual help. set up your own instance of the site. The latter manual will explain how to install the software and will advise as to required pre-requisites, dependancies and configuration. All dependencies will be documented and configuration instructions for those dependences will be provided. Detailed information will be given along with the code used to initially load both the KHRD and the Convicts and Diggers. The intention is that other parties should be able to adapt the load procedures used for these website instances to their own requirements, if they are not starting a new database from scratch.

(5) Features and Technology


A Norse term for "tree of life", Yggdrasil is an open source genealogy software program developed by Leif Biberg Kristensen. A Postgresql database using extensive stored procedures and a very thin layer of PHP as its web interface, it is well suited to recording basic genealogies of whole populations. Because it is an open source web based database application whose design is not tied to the ageing GEDCOM "standard" it will help solve problems arising in large academic human population study projects requiring collaboration and multiple users.

Yggdrasil (a "vanilla" LAMP application) is by far the most important technology being used in this project and we are extremely grateful to Leif for generously allowing us to take over the ongoing development of this open source application. With one exception (Temporal Earth), all related technologies are web-based, offer restful interfaces, can exchange data using XML and/or JSON, and are therefore relatively straightforward to integrate in a web application. Building on experiences with Founders and Survivors we will use both VSphere and OpenStack (Research Cloud) virtualisations to deploy database instances. Ubuntu 12.04 LTS server edition is our OS of choice.

The essence of the AP20 "FASCONN" project is to enhance and extend the existing Yggdrasil software application with 10 new or improved capabilities:
  1. extend the database for new features, improve web usability and provide adequate user and developer documentation;
  2. allow geocoding of place names (interacting with Geosciences Australia new WFS-G service as a web client), which will in turn support geographical and map based visualisations of all events where the place is known (considering PostGIS and GeoDjango);
  3. extend sources and facts captured with prosopography XML markup inspired by TEIp5 Names, Dates, People and Places and integrating data already available in the Founders and Survivors BaseX XML database and XRX techniques;
  4. improve person matching and record linkage (integrating concept's of John Bass's LKT C independent link management program reimplemented using the Neo4J Graph database and possibly Gremlin for graph traversal);
  5. enable populations, groups, and cohorts to be identified and managed (enhance/extend Yggdrasil's tables);
  6. improve searching and selection (considering using Apache SOLR and a custom Data Import Request Handler);
  7. generate animated visualisations of population events in space and time (using Matt Coller's Temporal Earth);
  8. generate large-scale genealogy diagrams using either Pajek or Neo4J (James Rose who has done similar work to advise);
  9. generate narrative prosopographies (adapt and extend the XML/XSLT1.0 Founders and Survivors public search interface; possibly adding Similie timelines and more interactivity by using client side XSLT 2.0 using Saxon-CE);
  10. generate documented export (flat) files to support researchers' own analysis in their tools of choice e.g. SPSS, R, Excel etc.
The AP20 Software Capabilities Diagram shows these 10 core capabilities pictorially. Building out from the database at its core (C1), each capability is relatively self-contained. C1, the database centre, is the extended Yggdrasil. 

For further technical details about the project, please read the initial AP20 technical overview proposed to ANDS, or see our short AP20 video presentation and the ANDS Community day in Sydney in May. You can also view our experimental RIFCS entries which document the services and collections involved in the project.

Loading existing data collections into their Yggdrasil instances will require bespoke work flows using bash shell scripts, Perl, Saxon-EE (commercial license available in Founders and Survivors project and servers). These will be adapted from an earlier proof of concept load of KHRD data into Yggdrasil.

Given the technical context above, a number of questions require clarification:

a) What is the most important part of technology being used?


Yggdrasil i.e. Postgresql, PHP, Apache, LDAP (for user authentication and authorisation, leveraging an existing Founders and Survivors directory server). Standard web technologies such as CSS, some XQuery and XSLT for Founders and Survivors BaseX database interaction.

b) What will require the most development effort and why? 


Capability 3 - the ability to associate Yggdrasil events and people with extended, flexible "factoids" will require the most development effort because of the complexity involved in enabling a data-driven (ie. user configurable) data model which is scalable, and which can provide easy to use web-interfaces for data capture and validation, without assistance from programmers. To mitigate this risk we intend to, initially, provide a basic capability which leverages existing programmed interfaces to Google Spreadsheets. This will enable a researcher to use an external Google spreadsheet for data capture using a Google Form of their own design. Our academic research champions have proven on other Founders and Survivors data capture exercises they are perfectly capable of designing and building their own Google Forms and spreadsheets.

Building a simple work flow manager into Yggdrasil, where a population can be defined, and a data capture exercise for that population can be associated with a particular Google Form, a hidden "key" field providing the join data between the Yggdrasil database person and the Google spreadsheet row containing researcher designed form data, and serving up the required HTML and keeping track of to whom the entry has been assigned, is most certainly readily and quite easily doable. This is likely to be more acceptable to users than entering valid TEI XML markup or using unsatisfactory web based XML or JSON editors. This approach is of course somewhat less "integrated". However we believe it will be a useful first step and enable us to return to the more thorny problem when solid progress has been made across the other 9 capabilities.

c) What features are the most important to gain customer satisfaction and buy-in?

Multiple users, ease of access, ease of sharing data with others, ease of extending data, and ease of keeping data accurate and growing, in a single place, are vital. In short, the user base will be delighted to use a shared readily accessible web database. Also vital are one click access to highly useful and relevant outputs which directly support research (Capabilities 6 to 10). Users are tolerant of less than glorious and beautiful interfaces and are more concerned about accurate and usable content.

d) Non-functional requirements.


The following requirements apply to AP20 as a whole system and should be adequate to guide developments in regard to technical requirements across each specific capability.


i) Authentication and authorisation


Access to all data updating functions are to be authenticated and authorised to a research domain.

Authentication method (http basic) provider to be configurable as:


  • .htaccess and local http password files
  • LDAP e.g. the Founders and Survivors LDAP service at http://founders-and-survivors.org.
  • other methods as deemed useful and feasible within time/budget constraints (e.g. MAMS enabling the application is highly desirable but may not be achievable).

Authenticated access levels/roles are "administrator" (data custodian and application configuration), "staff" (data maintenance), and "visitor" (read only).

ii) Data privacy (legal and ethical)


Where authorised researchers choose to make exported data or particular functions available to the public, any identifying data concerning living persons is to be presented anonymously.

iii) Performance


Interactive web response times should be reasonable to the task e.g. 2-8 seconds for small interactive tasks. Batch tasks such as data loading or quality assurance work should be backgrounded/queued so that users are not delayed, and are informed when output is available.

iv) Scalability


Each instance (i.e. each research domain) should be capable of supporting up to 10 simultaneous logged in users. The database structures and processing should be capable of supporting the entire FAS data domain (75,000 convicts) and descendants of the survivors for up to 6 generations i.e. up to a million persons is theoretically possibly. In such cases the amount of memory to be available to the host need to be appropriate to the database's requirements.



v) Security


No data encryption is required but data access is to be restricted to authorised users unless otherwise determined by a research domain's data costodians/administrators.

vi) Maintainability


As a research application (not business transactions), maintainablity will be enhanced by using tools and technologies where skills are relatively accessible in the Research IT space. Yggdrasil consists of "vanilla" web technology PHP/Postgresql. It may be enhanced by use of or integration with popular frameworks. All code developed is to be under version control and publicly available at github.

vii) Usablity


Users are expected to be competent in their research domain and capable of reading reasonable instructions as to how to use the application. Usability requirements are thus moderate.

vii) Multi-lingual support


Yggdrasil supports Norwegian and English.

ix) Auditing and Logging


A simple change log showing who did what when is required. Changes which add/merge/delete or link persons need to be explicitly logged.A simple text and/or XML file will suffice for this purpose. The logs will be retained and rolled over weekly or after each 1000 transactions, whichever is sooner.


x) Availability


24/7 or "four nines" availability is not required.

Implementation and operations will be simplified if a weekly maintenance window of one hour (for small domains e.g. KHRD) and up to three hours (for large domains e.g. Convicts and Diggers) are routinely scheduled to enable domain data quality assurance and database backups to be conducted without user updates.

A deployment model where research cloud instances may be invoked, configued, loaded, used, and saved/snapshotted may be a perfectly acceptable mode of operations for both KHRD and TasDig domains, and would greatly mitigate internet security risks posed by 24/7 web operations. It is also important, in the case, that the procedure for researchers to invoke their instance is not onerous. If the database and associated files exceed 8GB then research cloud snapshots will need to be supplemented with scripted save/restore to S3 containers and/or other storage. These issues should be clearer after the tech lead attend an OpenStack workshop in September 2012.



e) A high-level architectural diagram.


Data load workflows:



f) Source code repository


Available at https://github.com/foundersandsurvivors/ap20-ands.

We may add separate respositories for code associated with loading the KHRD and Convicts and Diggers instances.



(4) How will we know if it works?

It is important to note that each group will have their own distinct instance of an AP20 genealogy web application database to gain access to the genealogies of their target population. Both groups of customers will want to:
  • locate individuals and/or groups of individuals and explore all known data about them in a variety of ways:
    • narrative prosopography (life course events sequenced in time), 
    • inter-generational relationships e.g. spouses, ancestors and descendants (listed and/or visualised as family trees/network charts),
    • visualise events placed geographically on maps and animated across time,
    • reports such as survival analysis and infant mortality,
    • export flattened versions of data and reports for further adhoc analysis in statistical packages with adequate meta data describing the provenance of the data and relevant citation information.
  • add new sources of evidence such as birth, death, and marriage certificates, and other user-specified sources of interest;
  • locate or create individuals whose existence is evidenced by or inferred from the sources;
  • assert from sources a variety of user specified events indicating when, where, who, and user defined "factoids" of interest;
  • be assisted in accessing related web-based information in external databases e.g. search in Trove;
  • be assisted in matching, merging, and linking internal and external data records to individuals in the population;
  • be assisted in organising data capture and targeted collecting of research data using controlled "crowd sourcing" methods;
  • execute the above easily, efficiently, flexibly, and transparently (changes need to be logged);  
  • ensure data is secured, private and accessible to authorised users as determined by the customers themselves in line with relevant legislative, privacy and ethical requirements.
Initially the KHRD database will be loaded and our user interface analyst, Nick Knight, will test the system, document system usage conventions, coordinate and assist other authorised KHRD users, assure adequate data quality, and liaise with the research champion and the technical lead as to software enhancements required to meet the KHRD user group's needs. 

Once this initial version of the KHRD is operational, attention will turn to the load of the Diggers and Convicts population in a second instance of the application.

(3) Who is the project serving?

One of our key customer groups are academic historians, demographers, members of the public who are genealogists, and those interested in coordinating and enabling access to Koori health data such as the Onemda VicHealth Koori Health Unit. This group of users will use the Koori Health Research Database (KHRD). The "research champion" for this group of users is historian Professor Janet McCalman from the University of Melbourne's School of Population Health. The existing dataset which is the core of this research domain has has a particularly difficult and fractured provenance. This group will benefit greatly from easy access (via the world wide web) to a single accurate shared online non-propritary database recording the lives and relationships of this population. Where determined by the long term data custodians (Onemda), members of the Koori community will be given extracts which provide private access to their descendants records from the database, thus supporting their sense of community and identity.

Our second key customer group are academic historians, demographers, members of the public who are genealogists, involved in the Convicts and Diggers: a demography of life courses, families and generations (ARC Discovery 2011-2013). Based on convict records from the Founders and Survivors project, birth, death and marriage registrations, World War One service records, and other historical data, this project explores long-term demographic outcomes of individuals, families and lineages. The project draws on the expertise of family historians to trace individuals and their descendants for 'Australia's biggest family history'. Some volunteers (members of the public interested in family history or their convict ancestors) may be involved in some data gathering and research tasks. The "research champion" for this group of users is Dr. Rebecca Kippen from the University of Melbourne's School of Population Health. 

(2) Why do this project?

The goals of the FASCONN project are to provide a sound basis for various collaborative academic research projects concerning the genealogy and family history of large groups of individuals with a particular focus on human health and survival. Two distinct populations will form the research domains for this software development:

  1. An existing Koori Health Research Database covering over 10,00 indigenous people and their descendants who lived in Victoria and NSW in the 19th and 20th Century, with a particular focus on recording and understanding health outcomes e.g. causes of death. 
  2. Building on the Founders and Survivors project, a database of over 15,000 Tasmanian born WWI AIF enlistees who are descendant from convicts transported to Tasmania in the 19th Century, with a particular focus on understanding the historical determinants of responses to stress.
A well documented open-source web-based multiple user database application designed to address the general area of large-scale historical demography has many benefits. It will enable many collaborators, starting from scratch, to formerly record and transcribe their evidence and sources, and to assert events, multiple persons, and human relationships with formal citations from those sources. This will enable family and social historians to focus on collecting or linking to, and interpreting, the relevant historical sources - the core of their research - efficiently and effectively. They will no longer need to risk being locked into single-user desktop bound proprietary software oriented to a single individuals family tree, nor be distracted by having to build their own technological solutions to what is a common problem: analysing the experiences of groups of people, the events in place and time which form their life histories, and the inter-generational relationships between them. They will be in control of their own research data and will be able to extend their databases and share data with other researchers over a long period of time.

Monday, 27 August 2012

(1) FASCONN: The Team

Hello and welcome to the development blog of the Australian National Data Service (ANDS) Application Partner Project AP20: Founders and Survivors Genealogical Connections project, known as FASCONN. This first post introduces our technical team members.

Len Smith (Project Manager)

TO BE ADDED (bio collection with Nick Knight)

Sandra Silcot (Analyst/Programmer and Tech Lead)

Sandra has a Bachelor of Applied Science (EDP) and a Bachelor of Arts (History & Social Theory) and has worked mainly developing applications software since the late 1970's. Her career started with the State Electricity Commission of Victoria as a trainee Analyst/Programmer and later as a database analyst using IBM's IMS. After a four year stint as a trade union official working mainly on technological change and women's issues for the Municipal Officers Association, she completed her Arts Honors degree at the University of Melbourne in the early 1990's. Since that time she has worked in a variety of IT roles in the higher education sector, all of which have been related to world wide web technologies and SGML/XML encoding systems. After a ten year stint developing and supporting the University of Melbourne's web based teaching and learning system Sandra has been involved with research work in the digital humanities field, in particular being the lead developer in the Founders and Survivors project. Sandra is comfortable in a wide range of activities from feasibility assessment through to deployment and support. She uses a variety of tools and technologies such as web servers (Apache), relational databases, XML/SGML, XML databases, XQuery, XSLT, Perl, Javascript, directory services (LDAP), general Linux systems administration and virtualisation (VSphere and OpenStack) and is a skilled systems integrator. She most enjoys working directly with academic end users to apply IT in solving problems which are important to them.

Matt Coller (Spatial Visualisation consultant and developer)


Matthew Coller is a specialist in multimedia authoring and dynamic visualisation.  Working at the University of Melbourne from 1998 to 2001 he co-authored the interactive ChemCAL modules that still accompany chemistry teaching at a number of universities, and which have recently been adapted for inclusion in the VCE textbooks, Heinemann Chemistry 1 & 2After completing a Master of Multimedia degree at Monash University in 2004, Matthew embarked on a PhD investigating historical visualisation. His initial prototype,SahulTime, demonstrates ways of visualising Australia's past across a variety of timescales: historical, archaeological and geological, and has won awards at two major archaeological conferences.  Matthew is now expanding this vision to build Temporal Earth, an online system that draws on crowd-sourced data to present an interactive visualisation of world history across all timescales. Within this ANDS project, Matthew will develop the functionality to present Yggdrasil's search-results as maps, timelines, or spatio-temporal representations of processes over time, which can then be combined with other content visualisations from the Temporal Earth project.

John Bass (Data linkage consultant and developer)

TO BE ADDED (bio collection with Nick Knight)

James Rose (Genealogy database consultant)

TO BE ADDED (bio collection with Nick Knight)

Nick Knight (User interface and domain expert)

TO BE ADDED (bio collection with Nick Knight)

Others (to be determined)

We will engage other programmers to do short self-contained units of work to assist the core team focus on the more difficult high-risk elements of the project.

Contributors