David, at the "Sarsaparilla" weblog, has alerted me to the Australian National Library's newspaper digitisation project. According to the website:
The National Library of Australia, in collaboration the Australian State and Territory libraries, has commenced a program to digitise out of copyright newspapers. We are creating a free online service that will enable full-text searching of newspaper articles. This will include newspapers published in each state and territory from the 1800s to the mid-1950s, when copyright applies. The first Australian newspaper, published in Sydney in 1803, is included in the Program.As someone who looks at a lot of very old newspapers I can only applaud this initiative, as it will certainly make my personal projects a lot easier in the future. At present, only a very small number of newspaper issues have been digitised but that number will continue to grow. So far it looks like the project has picked one or two newspapers in each Australian state, and chosen only a few contiguous years.
Part of the difficulty here concerns the availability of the material and whether or not it is out of copyright. The paper chosen from Victoria is "The Argus" and thus far the project has digitised each edition from 1915-1925 and from 1945. Given that this paper published 6 editions a week - with the possible exceptions being Good Friday, ANZAC Day and Christmas Day - there are approximately 310 editions a year. The early years of the paper, which was printed in broadsheet format, contained 8 pages per weekday and 16 pages on a Saturday. Not a lot by today's standards, but you have to see the material to understand how much text they were able to squeeze into those pages: advertisements were nearly all of the "classified/textual" variety and pictures were almost non-existent.
Most of the originals of these old papers in libraries are bound into large ledger style volumes, so scanning in the central gutter - the part of the paper that is closest to the spine of the books - is fairly difficult without breaking open the books and laying the pages flat. Some modern photocopiers scan an opened volume by tilting the books during the process to get full access to the pages. But this presupposes that the central gutter is wide enough to allow for this. Modern books are formatted with the fore-knowledge that the pages would be bound between covers; newspapers had no such knowledge and the gutter margin, in many cases, is very narrow. By the look of the "Argus" pages here I suspect they have utilised microfilm copies of the papers rather than the original sheets.
There is both good and bad in that approach. Good because you can actually get the scanner to see all of the page, and bad if the only film you have available is one that has seen a lot of use. Microfilm readers are notorious for scatching the film, which, when copied using any form of photograph or photocopying process, leaves long black streak marks across the final image. This is merely a nuisance when it comes to reading that image, but a hindrance when the digitised image is optically scanned and run though a character recognition process as it is here. For that is the final aim of this whole project: not only to make photo images of the newspaper pages available to the world at large, but also to convert the embedded text into editable files. This is a wonderful idea of course, because it makes available the full text of this material, not just a graphic image.
I've transcribed a number of pieces from old newspapers and magazines over the years. Most of it poetry but, more recently, a number of prose pieces that I've posted here. This has involved a complete re-typing of the material because I found out, fairly early, that basic Optical Character Recognition (OCR) run through a basic scanner was - well - pretty crap. I seemed to spend longer fixing the material than I would have if I typed it out straight from the start. The NLA's OCR results tend to be of a different breed all together. And of interest is the fact that you can register as an editor on the NLA website and actually correct the scanned result of the text yourself. For example, on Tuesday 21st November 1916, "The Argus" printed the following:
Mr. C..J. Dennis, author of "Tile Senti- mental Bloke" and "The Moods of Ginger Mick," has resigned from the position of secretary to Senator Russell, .Assistant Minister in the Federal Cabinet. ' He in- tends to retire to the country for 1 time to give his undivided attention to the produc- tion of another book.Or so the scanned version showed. This was pretty easy to convert to a corrected version:
Mr. C.J. Dennis, author of "The Senti- mental Bloke" and "The Moods of Ginger Mick," has resigned from the position of secretary to Senator Russell, Assistant Minister in the Federal Cabinet. He in- tends to retire to the country for a time to give his undivided attention to the production of another book.You could almost do that without the original text being handy.
And then sometimes you get something like this:
Sir Herbert Wallen, professor of poetry at Oxfoul, read an inlet c-ting papct on "Oversea« Poetrv" nt the Jtovnl Colonial Institute on Wcdnesdiy He said that Aus- tralian poetr} was still I irgely "open jur," anti consisted of poems ot men, action, nnd movement Sir Hcibcit Warien pud n tribute to the woik ot ]'--cx Hvuns, Arthur Adams, anil John Sandes amongst Hie younger generation Ile jaitl that Ml jCvuns's "Commonwcilth Ode" was a "laureate piece" worthv to live Although he lind been startled bv the slang used by U J. Dennis, his woik was leal pootiy.Which has a sort of poetry all of its own.
A friend told me recently that New Zealand is way ahead of Australia in terms of digitising its newspapers, so its good to see us starting to catch up. I'll be using this site as a major resource over the coming years, and you'll start to see some of the results of that here quite soon.