robinturner: Giving a tutorial, c. 2000 (tutorial)
[personal profile] robinturner
Some years ago, when we were the rector's pet project, the provost (who was standing in for him while he was in hospital) asked us if there was anything we needed. Somewhat to his surprise, we came out with a list of office improvements (new carpets, paint, pictures on the walls, no rats) rather than the grad assistants he was thinking of giving us. We declined the offer of slaves assistants because we couldn't think of anything for them to do (other than the traditional roles of washing coffee cups and supplying sexual favours, neither of which one is supposed to mention at such meetings).

Now, after a day of scanning, I really wish we'd accepted the offer. In my previous courses, I'd managed to compile my course book largely from online sources, thus minimising the amount of scanning and typing I had to do. My Matrix course book was, not surprisingly, compiled entirely from online sources (with the exception of a page from Baudrillard I typed out). My warriors course only required one chapter of a book to be scanned, and since it was an expensive book with clean white pages, it wasn't too hard. But here I am preparing the book for "Monsters Among Us" with a load of good material that is unfortunately trapped in paper. It is very frustrating to scan a chapter of a book, getting cramp from holding the thing flat on the screen, then find it turns out like this
as the real seducer/predator lurks behind the scenes. Dracula emerges to
away his wives and'insist that the victim belongs to him.
.\ second key scene of transgressive eroticism is Dracula)s rape of Lucy on a
by the sea in Whitby; this occurs rougWy one-third of the way through ilie
1, in chapter.B. This scene) told from Mina)s point of view) first registers her
eness of Dracula)s presence.l3 Awakened in the night) Mina.has the strong
that some~ing is wrong. Suspense builds and time is almost halted as Mina
tempts to rush t4rough the dark landscape) until finally she glimpses an ap+

Date: 2005-08-31 01:11 am (UTC)
From: [identity profile] trochee.livejournal.com
grr...

seems like there should be an automatic way to clean this up.

for example, a mixed word-backoff and spelling-correction system. Surely the correction from t4rough to through should be automatable...

Date: 2005-08-31 01:23 am (UTC)
From: [identity profile] trochee.livejournal.com
seems like this paper would have a lot to offer to your system.

I'm tempted to build something for you while I'm bored at my internship.

Date: 2005-08-31 02:33 pm (UTC)
From: [identity profile] cassielsander.livejournal.com
The problem with automatic correction systems is that they can conceal mistakes without correcting them; a quick scan through the above makes it relatively clear to a human where the problems are and how to fix them, whereas a computer working automatically might not make the thing make any more sense while simultaneously make the original meaning less clear.

For instance take this OCR'd article about the Marvel Comics History of Atlantis, which looks fine at first glance but includes correctly spelled but incorrect information such as that Atlantis was damaged by "explosive debt charges". Obviously the Atlanteans need to transfer their balances.

Also I must say I found the moment when Mina glimpsed her first ap+ to be extremely transgressive.

Date: 2005-08-31 03:22 pm (UTC)
From: [identity profile] trochee.livejournal.com
I dunno. From reading it over it looks like this stuff was hand-corrected by a human with a spell-checker and no sense. this page was written by the owner but has similar sorts of mistakes.

But even with using a word-processor spellchecker (and I wouldn't recommend that) the text would be cleaner than what [livejournal.com profile] solri posted above.

I was suggesting something more sophisticated: using the context in a statistical chooser environment. Train a backed-off n-gram window over a megaword or so and then I'm sure that "roughly" will get *much* better probability measures than "rougWy". One might set up a blended model that combines (weighted?) edit distance with word probability and see how small you could set the threshold for improving the overall model probability. If the threshold was high enough, you'd prevent any "false corrections" and still get a much better transcript in the end.

Date: 2005-08-31 10:34 pm (UTC)
From: [identity profile] solri.livejournal.com
Sounds good. If you get two dictionary words matching a given mis-scan, you can always ask the user, just like any good spellchecker would do. It would also be worth getting it to cross-check within the document and weight accordingly, to avoid classing unusual words as errors.

Date: 2005-09-01 10:40 pm (UTC)
From: [identity profile] solri.livejournal.com
Here's a nice one I've just acquired (from Adam Douglas' The Beast Within: a history of the werewolf):

"modem man had been hunting on his own for Some 40,000 years or so"

Yeah - if he'd had ADSL, it wouldn't have taken half as long.
(deleted comment)

Date: 2005-08-31 05:48 pm (UTC)
From: [identity profile] vret.livejournal.com
That's what happens when you try to scan a vampire.

We stayed in Whitby last week. Didn't see a mirror the whole time we were there.

Date: 2005-09-01 04:43 am (UTC)
From: [identity profile] redngold.livejournal.com
Its the Solri Bus Co.! Nothing for weeks, then 4 come along within 5 minutes. Funny what the new academic year can do to a lad...

Date: 2005-09-01 10:54 pm (UTC)
From: [identity profile] solri.livejournal.com
That is because I indulge in LJ largely as a method of procrastination, so I need something to procrastinate from.

Date: 2005-09-01 04:45 am (UTC)
From: [identity profile] redngold.livejournal.com
Oh - and to return to the topic, this looks like a lousy OCR app. What are you using?

Date: 2005-09-01 10:37 pm (UTC)
From: [identity profile] solri.livejournal.com
I'm using the stuff that came with the HP printer, as there is still no really good OCR software for Linux. However, I think the problem is with the scanning, not the character recognition. Books with nice bleached-white environmentally unfriendly paper scan OK.

Date: 2005-09-01 05:11 am (UTC)
From: [identity profile] redngold.livejournal.com
ps - you could borrow my RA, but then you'd have to feed him.

Date: 2005-09-02 03:43 pm (UTC)
From: [identity profile] cf.livejournal.com
why don't you just buy one? sure, you're at a richy rich school, but maybe you could just go out and find that one scholarship student who needs to make some money. Give herim efes money, and see what happens... (grad school recommendation letters?)

Date: 2005-09-02 05:02 pm (UTC)
From: [identity profile] solri.livejournal.com
Buy? You think they pay me that much?

Profile

robinturner: (Default)
Robin Turner

June 2014

M T W T F S S
      1
2345678
9101112131415
16171819202122
232425 26272829
30      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags