October 11 2006

Video searching by sight and sound. Researchers have developed a system that uses a combination of face recognition, close-captioning information, and original television scripts to automatically name the faces on that appear on screen, making episodes of the TV show Buffy the Vampire Slayer searchable.

I wish I could recognise people 80% of the time. The folks to cash in on this technology first will be the purveyors of porn, mark my words.
I think I'm psychic, I see a "Cease and Desist" letter coming in that YourTube vidder's future. :(

*does search for Spike and Angel kiss, Spike and Angel hold hands, Spike and Angel push each other up against a wall*

What?! That all happened.
The Buffy project. I just love the sound of that.
This is amazing technology.

Interesting post--thanks Spikeylover.

And I agree with Simon, The Buffy Project sounds great. Although it could also be the name of a garage band. Or maybe a secret government mind-control lab.
heee... I can brag, or really bore people to death(probably more the latter), but as a graduate student(master's..taking too damn long) working in Computer Vision, I can certainly weigh in here somewhat. Face recognition is not my subdivision, but have done some assignments and learning of it in my courses.

Face Recognition(and not talking about in action/sci fi movies) has been a research field for quite quite awhile(I think one of the big papers relates to Eigenfaces done by Pentland in '91, but there is research dating back to the 60's). It is a data intensive task, which by itself does not yield optimal results. The basic method in that paper is setting up a database of all known faces, putting it into a space where the amount of data represnting that face is low, then taking a new face and bring that into the same space, and then searching in the database for the closest match by using the low dimensional data. That is data intensive and can be quite buggy. Plus for this type of thing one requires the perfect pose, the face has to be a certain size in the picture, not too big and not too small, and the most optimal lighting conditions on the face. This would probably be good for passport photos but not much beyond that. So using landmarks like nose and eyes and mouth are quite good, but are still problematic.

In terms of a television show(or movie), for someones face, besides the problem of pose(which way the direction the face is), and the problem of occlusion(whether the face is totally hidden or blocked by some other object), there is the problem of lighting. A big problem. With numerous light sources, and whether its daytime or nighttime, and the shadows cast on the face, causes alot of problems in terms of recognizing(for the computer) if the nose of the same person is the same under two different lighting conditions. One can try to take into account the geometry of the face, but again that is difficult. Then there is the problem of finding the exact face(or the landmarks of the face) in the static picture, and tracking it through the frames of the video. And on top of all of those things, there is the inherent noise in the visual data, and so the algorithms used have to be robust against noise etc.

The fact they use other data besides video data is quite cool to help narrow down the recognition of faces, but obviously for videos in general might not work because there aren't any scripts or subtitles(like alot of the videos on youtube).

There is also a movement in computer vision to do video based indexing search and retrieval. Which I am sure google would probably be interested in, is the use of shot boundaries(you know when you see one shot, and then there is a pan but the same shot, then all of a sudden its a different shot, that change is a shot boundary), to mark off video data segments. So like there are paragraphs and chapters in a paper in a book, one can create an index of "chapters and paragraphs" in terms of visual media, automatically. Again there are various problems as stated above but its quite interesting and challenging.

Ok... enough boring you guys, back to your regularly scheduled program.

kurya, we love it when you talk like that. I find the whole notion fascinating.
Yep, great post kurya. I've read a few articles about face recognition from the perspective of biometric passports and even then (with standard lighting, proportions etc.) it seems quite hit and miss so when I read this it sounded like quite a breakthrough (i.e. i'll believe it when I see it in action ;).

As you say, most YouTube videos are extracts which have been remixed so scripts would be very hard to come by (you'd have to know which episode and which segment of which episode the part belonged to) and storing information about every frame sounds very data intensive.

Anyone who knows a bit about MPEG care to comment on whether the heavy compression would affect the frame by frame analysis as well ?

And though accuracy usually improves given time, 80% is not great (imagine Google returning pages totally unrelated to your search 2 out of every 10 times). Also, does that mean 20% false positives, 20% false negatives, 20% that don't match at all or a mixture of all three ?

(as an aside, i'd love to see it deal with Stargate: SG-1 from season 9 onwards, Michael Shanks and Ben Browder are dead ringers for each other ;)

Very interesting technology though (the MPAA must be salivating at the litigation possibilities).
Hmmm in terms of MPEG, it does introduce some artifacts I think(I don't know it at all actually), BUT a student who did a project style master's for my prof, used MPEG compression for his algorithm. Basically it is an MPEG stream from cameras(traffic cameras for his application), and woud use the coefficiants and values of the MPEG as inputs to his movement detection algorithm. The purpose of his project was to determine the average velocity of the traffic flow on traffic cameras. This would allow automatic detection of traffic congestion and can be used for various purposes.

Although I am guessing in this case, they just use the results, and do not use the MPEG encoding to their benefit(one would have to study their algorithm to see how they do it). I can imagine it be problematic, but it would be nice to know if MPEG compression makes a big difference in this algorithm. For some algorithms that use video data, it is in the totally uncompressed form, and so it takes up quite alot of resources.
Neat. Confusing, but neat.
Ohhh and I just had a look at the article and saw the researchers site linked. I'm reading the paper of this researchers work and it is so cool, to see Buffy being used as examples of image data. For anyone curious and who have the stomach for the mathematics, they can see the original paper in pdf format HERE Man the things I do for Joss.

ETA: OH and its sooo cool, one of the authors of that paper Zisserman is the author of a very useful Computer Vision book. It is sooo weird to see the Verse and my grad studies colliding.

Whoa! You *rock*, kurya! :-)
thanx billz hehe, and jaynelovesvera and Saje. I thought I would be the "thread killa" in this case for my uber long post. Its rare that I can get excited about grad studies work. (I am sooooo NOT doing a PHd, dont want to get permanent head damage)
I have it. It's not so bad.
"thread killa" status or phd or permanent head damage?
Two out of three, no phd. Check some old threads for the last man standing, and you'll likely see my Joanie loves Chachi derivative. And the doc told me I likely lost 30 IQ points in my possible atempted murder, but clearly successful head-cracking experience. I still have a ravine down my forehead that rivals any so-called canal on Mars. You write vey cleanly, BTW, kurya.
Thanx jayneslovesvera for the compliment, I am sure I have that status somewhat of thread killa... and have some head damage from studies... cells committing sepuku from the overload in Undergrad. If only I can put those writing skills to my thesis that has been delayed too much because of a problem of procrastination.

