Science and Industry

Wired Magazine: Petabyte Age Issue

The latest issue of Wired is devoted entirely to massive data and data mining applications: everything from astronomy, environmental and medical applications, through to legal discovery, tracking airfare prices, and pollsters identifying voter intentions.

Its a fascinating range of 13 articles that should have something to interest most readers of this blog - all available from the online issue linked above.

just published: Research Data Preservation Costs Report

I have posted two previous entries to the blog in March and January detailing progress with the JISC-funded research data preservation costs study. I am pleased to report that the online executive summary and full report (pdf file) titled “Keeping Research Data Safe: a cost model and guidance for UK Universities” is now published and can be downloaded from the JISC website.

It has been an very intensive piece of work over four months and I am extremely grateful to the many colleagues who contributed and made this possible. We have uncovered a lot of valuable data and approaches and hope this can be built on by future studies and implementation and testing. We have attempted to “show our workings” as far as possible to facilitate this so  the text of the report is accompanied by extensive appendices.

We have made 10 recommendations on future work and implementation. For further information see the Executive Summary online.

The report iteself has chapters covering the Introduction, Methodology, Benefits of Research Data Preservation, Describing the Cost Framework and its Use, Key Cost Variables and Units,the Activity Model and Resources Template, Overviews of the Case Studies, Issues Universities Need to Consider, Different Service Models and Structures, Conclusions and Recommendations. There are also four detailed case studies covering the Universities of Cambridge, King’s College London, Southampton, and the Archaeology Data Service (University of York).

Although focused on the UK and UK universities in particular, it should be of interest to anyone involved with research data or interested generally in the costs of digital preservation.

 

Comments and Feedback welcome!

Research Data and the Computing Cloud: NSF/Google and IBM

Research in the Cloud: Providing Cutting Edge Computational Resources to Scientists is an interesting recent post to the Google Research Blog. It provides Google’s take on its participation in the National Science Foundation/Google/IBM collaboration within The Cluster Exploratory Program (CluE).

The NSF solicitation for proposals was released last week. To quote from the call:
“In addition to the widespread societal impact of data-intensive computing, this computational paradigm also promises significant opportunities to stimulate advances in science and engineering research, where large digital data collections are increasingly prevalent. Well-known examples include the Sloan Digital Sky Survey, the Visible Human, the IRIS Seismology Data Base, the Protein Data Bank and the Linguistic Data Consortium, however other valuable data collections or federations of data collections are being assembled on an ongoing basis. In many fields, it is now possible to pose hypotheses and test them by looking in databases of already collected information.   Further, the possibility of significant discovery by interconnecting different data sources is extraordinarily appealing. In data-intensive computing, the sheer volume of data is the dominant performance parameter.  Storage and computation are co-located, enabling large-scale parallelism over terabytes of data. This scale of computing supports applications specified in high-level programming primitives, where the run-time system manages parallelism and data access. Supporting architectures must be extremely fault-tolerant and exhibit high degrees of reliability and availability.
The Cluster Exploratory (CluE) program has been designed to provide academic researchers with access to massively-scaled, highly-distributed computing resources supported by Google and IBM.  While the main focus of the program is the stimulation of research advances in computing, the potential to stimulate simultaneous advances in other fields of science and engineering is also recognized and encouraged.”

It should be interesting to see how this collaboration evolves and the datasets it includes. For more information see the The Cluster Exploratory (CluE) program call text.

OR2008 - Presentations available

 

The Open Repositories conference (OR2008) repository is available at http://pubs.or08.ecs.soton.ac.uk/ as a permanent record of the conference activities.

The repository contains papers, presentations and poster artwork for 144 different conference contributions from the main conference sessions (Interoperability, Legal, Models, Architectures & Frameworks, National Perspectives, Scientific Repositories, Social Networking, Sustainability, Usage, Web 2.0), the Poster session, User Group sessions (DSpace, EPrints, Fedora), Birds of a Feather sessions, the Repository Managers session and the ORE Information day.

My powerpoint presentation from the Plenary keynote for the Fedora International Users’ Meeting is also available there. Titled “Keeping alert: issues to know today for long-term digital preservation with repositories” it focussed on research data and sustainability. It drew heavily from the forthcoming JISC Research Data Preservation Costs study and the draft final report titled “Keeping Research Data Safe: A Cost Model and Guidance for UK Universities”. It concludes by outlining tentative findings and implications for repositories from that report.

Digital Preservation Cost Models

I blogged back in January on the JISC Research Data Preservation Costs study and promised an update at the end of March. Well the draft final report titled “Keeping Research Data Safe: A Cost Model and Guidance for UK Universities” is now with JISC and being peer-reviewed.

It’s been a significant effort and I think it should be a major contribution to thinking on digital preservation cost models and costs in general – hopefully the final report will be out later this Spring.  In short we have produced:

• A cost framework consisting of:

o A list of key cost variables divided into economic adjustments (inflation/deflation, depreciation, and costs of capital), and service adjustments (volume and number of deposits, user services, etc);

o An activity model divided into pre-archive, archive, and support services;

o A resources template including major cost categories in TRAC ( a methodology for Full Economic Costing used by UK universities); and divided into the major phases from our activity model  and by duration of activity.

Typically the activity model will help identify resources required or expended, the economic adjustments help spread and maintain these over time, and the service adjustments help identify and adjust resources to specific requirements. The resources template provides a framework to draw these elements together so that they can be implemented in a TRAC-based cost model. Normally the cost model will implement these as a spreadsheet, populated with data and adjustments agreed by the institution.

The three parts of the cost framework can be used in this way to develop and apply local cost models. The exact application may depend on the purpose of the costing which might include: identifying current costs; identifying former or future costs; or comparing costs across different collections and institutions which have used different variables. These are progressively more difficult. The model may also be used to develop a charging policy or appropriate archiving costs to be charged to projects.

In addition to the cost framework there are:

• A series of case studies from Cambridge University, King’s College London, Southampton University, and the Archaeology Data Service at York University, illustrating different aspects of costs for research data within HEIs;

• A cost spreadsheet based on the study developed by the Centre for e-Research King’s College London for its own forward planning and provided as a confidential supplement to its case study in the report;

• Recommendations for future work and use/adaptation of software costing tools to assist implementation.

Watch this space (well blog) for a future announcement of the final report and url for the download.

UK Budget: Pricing Public Sector Information report

Buried deep in the small print of the UK Government budget statement today was the following interesting item:

“The Office of Fair Trading’s (OFT) market study into the commercial use of public information highlighted important issues around access to public sector information for commercial or other re-use. The Government commissioned Cambridge University to analyse the pricing of this information. This analysis is published alongside Budget 2008. The Government will look closely at public sector information held by trading funds to distinguish more clearly what is required by Government for public tasks and ensure that this information next Spending Review the Government will ensure that information collected for public
purposes is priced so that the need for access is balanced with ensuring that customers pay a fair contribution to the cost of collecting this information in the long term. These issues will be considered in conjunction with the assessment of trading funds.”

This report with the rather catchy title “Models of Public Sector Information Provision via Trading Funds” by  Prof David Newbery, Prof Lionel Bently, and Rufus Pollock from Cambridge University was published today and can be downloaded at http://www.berr.gov.uk/files/file45136.pdf.

For those interested in the context there has been a long-running debate over pricing of data from some government agencies. The Guardian hosts a “Free Our Data” campaign blog which has a commentary on the Cambridge report and associated issues.

New UK National Nuclear Archive to be established

Colleagues may have missed the announcement that The UK Nuclear Decommissioning Authority will invest £8 million in plans to create the UK’s National Nuclear Archive (NNA) in Caithness, Scotland. The money will be invested over three years and will help get the £20 million project off the ground.

For those interested in the digital preservation issues involved in the NNA, I would refer you to an informative presentation by Simon Tucker Information Manager at NDA. This was a presentation to the “Nuclear Information over the Millennia Workshop” held in November 2006.

The NNA will potentially hold between 20 and 30 million digital, paper and photographic records primarily concerning the history, development and decommissioning of the UK’s civil nuclear industry since the 1940s. Around 20 specialist jobs will be created by the project. The archive will take about four years to build and many more to establish as an exemplar in its field. Land near the airport, currently owned by the local authority, has been earmarked as a potential site.

The development will undoubtably be an important one and is a good reminder of the long-term value over centuries of some electronic records and digital preservation issues in key industries.