Back in September last year I wrote a post about the National Library of Australia's (NLA) newspaper digitisation project. At the time I thought this a very worthy and exciting initiative which aimed, to make available to the general public, scanned and OCRed [Optical Character Recognition] versions of all Australian out-of-copyright newspaper editions. I've certainly made good use of it by re-printing a number of articles and poems here. Given that it's now been over six months since that initial description I thought it about time to follow-up and see how the project is progressing.
In about mid- to late October last year the project stopped adding new pages to the publicly available depository. I was a bit surprised when this happened as I hadn't actually read enough of the supplementary pages on the project's website to get a firm understanding of the project's timeline. Basically, the initial website was only a pilot. Its aim was to set up the digitisation mechanism and repository, and to engage the public to help fix the scanned pages.
You may recall that I wrote last year about the OCR process which converts the scanned pages into editable text. This is a bit dodgy with old newspapers and you get a lot of misread characters that need to be proof-read and edited. The NLA decided to invite users of the public to correct the pages they read with the aim of gradually improving the final product. I'm not sure how many editors they expected to pick up but I suspect they were quite happy to end up with about 1300 registered users who have corrected 2 million lines of text in 100,000 newspaper articles. The final number of articles made available under the pilot was 367,651 (on approximately 300,000 pages), so something over a quarter have been edited in some way or other. There is no way of telling if all the 100,000 articles have been completely fixed, but at least that number has been looked at and amended, however slightly.
So what of the future? The pilot project has been a success, volunteers have been engaged and are working away, so where to from here? Well, it appears that it was never the intention of the project to tie up NLA personnel on this project indefinitely so a Request For Tender process has been instituted which will result in the appointment of a panel of contractors to undertake the actual work of scanning and digitising - I suspect the editing will be left with the volunteers. That panel should be announced sometime soon, and hopefully the project will "re-boot" and the number of pages available will rapidly increase. The current plan is for the project to have 1.5 million newspaper pages available by the end of 2009, and 4 million pages by the end of 2010. I'll be very pleased if they can reach anything like those numbers.