Eolas Technologies Incorporated v. Adobe Systems Incorporated et al
Filing
1301
Opposed MOTION for Leave to File a Brief Re The Term "Browser Application" by Adobe Systems Incorporated, Amazon.com Inc., CDW Corporation, Google Inc., J.C. Penney Corporation, Inc., Staples, Inc., The Go Daddy Group, Inc., Yahoo! Inc., YouTube, LLC. (Attachments: # 1 Exhibit 1 - Defs Brief re the Term "Browser Application", # 2 Exhibit C to Brief - p353-meyrowitz82, # 3 Exhibit D to Brief - IRIS Hypermedia_Haan92, # 4 Exhibit E to Brief - ADBE0196713 Rowe92, # 5 Exhibit F to Brief - DBE0196715 Hindus93)(Wolff, Jason) (Additional attachment(s) added on 2/1/2012: # 6 Certificate of Authorization to File Under Seal, # 7 Text of Proposed Order) (mjc, ).
EXHIBIT F
Capturing, Structuring,
Ubiquitous Audio
DEBBY HINDUS,
and Representing
CHRIS SCHMANDT
and CHRIS HORNER
MIT Media Lab
Although
talking
acqmrmg
and
is an integral
accessing
the
a udLo, or the unobtrusive
recognition
selves
not
interaction
by user
cannot
available
for
M derived
interaction
structuring
stored
from
and
of applications.
mobde
audio
An
fluent
conversational
captured
interactions.
the
information
capture.
and
article
aspect
the
work
social
inherent
This
describes
this
in everyday
is placed
the
is choosing
the
and
themof an
augmented
for capturing
for later
an appropriate
context
and
retrieval
of representations
broader
for
Speech
words
structure
speech
apphcations
of a family
within
so the
Instead,
stored
support
on ubzquztous
environments.
and mechanisms
of retrieval
evolution
work
in the
calls,
computer
speech,
describes
and telephone
Important
article
applications,
has focused
interactions
discussions
Finally,
little
approach
transcribe
or after
this
has been
Our
of speech
yet
office
there
of conversations.
acoustical
interactions.
representation,
of collaboration,
organizing
from
during
audio
of these
range
capture
technology
are
part
contents
visual
across
of desktop
a
audio,
implications.
Categories and Subject Descriptors: C.3 [Computer
Systems Organization]:
Special-Purpose
and Apphcation-Based
Systems; H.3, 1 [Information
Storage
and Retrieval]:
Content Analysis and Indexing, H.4.3 [Information
Systems Applications]:
Communications
Apphcations;
H,5, 1. [Information
Interfaces
and Presentation]:
Multimedia
Information
Systems—aucho
mpu t / output:
H.5.2 [Information
Interfaces
and Presentation]:
User Interfaces —mteraction styles; H.5.3 [Information
Interfaces
and Presentation]:
Group and Organization
Interfaces—asynchronous
General
Terms:
Additional
tion
1.
interaction;
Design,
Key Words
software,
synchronous
Human
Factors
and Phrases:
semi-structured
interaction
Audio
data,
interactions,
software
collaborative
telephony,
stored
work,
speech,
multimedia
ubiqmtous
worksta-
computmg
INTRODUCTION
People
spend
study
face-to-face
The majority
Authors’
of their
of this
20 Ames
Street,
Permission
not made
work
has been
email:
pubhcation
Association
and
its
specific
ACM
an
supported
MA
02139:
date
email.
appear,
and
[1990]
this
and
time
Inc
Road,
Building
C,
Schmandt and c. Homer, MIT Media Lab,
edu: horner@medla.mlt,edu.
material
is granted
advantage,
the ACM
notice
Yet,
1801 Page Mill
geek@medla.mit
of this
commercial
Machinery.
C.
Schwab’s
25–50%,
Microsystems,
Corporation,
and
2090 of the workday,
additional
by Sun
Research
fee all or part
In Reder
about
is given
To copy otherwise,
that
provided
copyright
copying
that
notice
the copies
is by permission
or to republish,
requires
are
and the title
of the
a fee and/or
permission.
01993
for
Interval
for direct
for Computing
talking.
comprised
hindus@interval.tom;
Cambridge,
to copy without
or distributed
calls
accounted
D. Hindus,
CA 943o4;
workday
phone
meetings
addresses:
Palo Alto,
of the
much
of professionals,
1046 -8188
/93/’
ACM TransactIons
1000-0376
$01.50
on Information
Systems,
Vol. 11, No. 4, October
1993, Pages 376-400
Ubiquitous
spent
This
talking
has been to a great
loss of speech
of the
audio
medium
the presence
Furthermore,
poses
than
information
extent
in influencing
communication
given
“is
speech has
organization
been used in a limited
way,
of the audio data. A number
technology.
especially
regardless
and Chapanis
communicative
valuable
complex,
controversial,
and social aspects of a collaborative
et al. 1991, p. 21].
Nevertheless,
speech is an underutilized
resource
for
377
the dominance
outcomes
media
[Oschman
fulfills
different
and
.
of computer
striking,
communication
of visual
and other
speech communication
text
out of reach
is all the more
Audio
for
the
task”
of
1974].
purmore
[Chalfonte
CSCW.
Recorded
in applications
that require
little
of CSC W systems have focused on
synchronous
video and audio communication
Egido [1990], and informal
communication,
for conferencing,
including
RAVE
summarized
in
from EuroPARC
[Gaver et al. 19921, Cruiser
from Bellcore
[Fish et al. 19931, CAVECAT
from
the University
of Toronto
[Mantei
et al. 1991], PARC’s media
spaces [Bly
et al. 1993], Mermaid
from NEC [ Watabe
et al. 1991], Team Workstation
from NTT [Ishii
1990], and COCO from Sun [Isaacs and Tang 1993]. However, the potential
as a source of data
to capture
these collaborative
interactions
has been rarely exploited
to date. Capturing
enables
hearing
repeated
of interesting
utterances,
tions with colleagues
not present
for the original
conversations
with other kinds of communications,
and use them
conversations
sharing
of
and shared documents,
that are already
stored on computers.
The Activity
Information
Retrieval
(AIR) project at Rank Xerox
illustrates
how
capture
can provide
people
with
conversa-
discussion,
and collating
of
such as electronic
mail
EuroPARC
access to information
about
their own previously
inaccessible
day-to-day
activities.
Lamming
and Newman
[1992] make use of EuroPARC’s
RAVE system that continually
videotapes
lab
members,
and they have made the stored video retrievable
by using situational information
from the “active
laboratory
made
such as where the person was at a particular
time
badges” (developed
by Olivetti
Research,
Limited)
members
on pen-based
[Want
devices
of the progression
from
retrieval
of the contents
This article
describes
everyday
work
et al. 1992])
during
and by using
meetings.
AIR
timestamped
is, we believe,
support
of synchronous
interactions
of interactions.
various
means of capturing
speech
environments;
we call
this
ubiquitous
(obtained
worn by
audio.
notations
an example
to storage
and
interactions
Common
in
work-
day activities
other than formal
meetings
provide
a starting
point for exploring ubiquitous
audio, in terms of both user interfaces
and audio processing.
Ubiquitous
audio can come from a number
of sources and through
a variety
of
computing
refers to the eventual
replacephysical
input
devices. Ubiquitous
ment
of explicit
unobtrusively
computing
computer
present
approach
interactions
in day-to-day
will
eventually
by specialized
pursuits
lead
smart
[ Weiser
to sizable
devices
199 1]. The
but
not
that
are
ubiquitous
unmanageable
quantities
of stored information.
For example,
assuming
four hours of conversation
per workday
(of which
30UZ0 is silence)
and 10:1
compression
of
telephone-quality
speech, a year of office speech for one person would require
approximately
2 gigabytes
of storage.
ACM
Transactions
on Information
Systems,
Vol. 11, No. 4, October
1993.
378
.
Hindus
many
In
Debby
ways,
storing
et al.
and
retrieving
communication
for
later
review
is
much more demanding
than merely
establishing
the synchronous
channel.
Both audio and video are time-dependent
media with few sharp boundaries
to
exploit
for indexing
or classification.
If it were
and natural
language
Such processing
will
processing
techniques
not be as reliable
and
perhaps
1991],
decades
[Zue
practicable,
speech
recognition
could convert
speech to text.
as fast as human
speech for
and in any case would
cause the loss of nuances
carried by the audio signal but not in the words themselves.
In the meantime,
extending
audio technology
to spontaneous
conversation
will require
automatic derivation
of structure
without
understanding
the spoken words.
Malone
et al. [1987]
introduced
the term “semi-structured
messages”
in
their
work
on electronic
mail.
Such
messages
but some of the fields contain
unstructured
mation
Lens users can fill in these fields
contain
write
rules to route and sort received
messages,
attributes
[Mackay
et al. 1989]. We use the term
respect to audio
who is involved
and user-supplied
on the
based
on these
Inforand can
additional
with
semi-structured
audio
but the actual
words
in the recordings
are not
approach
defines a framework
for making
these
by incorporating
acoustical
cues, situational
data,
structure.
Acoustical
structure
detection
and the association
of portions
talker.
Semi-structure
aids in providing
relying
set of fields,
information.
messages
recordings
to indicate
that some information
(e.g., date, time,
in the conversation,
and when someone was speaking)
about
the recordings
is known,
known. The semi-structured
quantities
of audio usable
they
a known
text or other
when writing
explicit
see fit, but they
creation
are not
of structure—users
required
to create
be manageable
and accessible.
In the following
sections,
we
describe
collaboration
and how these
in work
situations
includes
speech
and silence
of the audio signal with the correct
flexible
access to the data without
can
structure
applications
create
structure
for the audio
for
capturing
applications
derive
data
as
to
spoken
structure
during or after capture. We describe user interfaces
for later retrieval
of these
stored speech interactions.
Our emphasis
is on choosing
visual
representations and mechanisms
for interacting
with speech segments
that range from
seconds-long
snippets
to hour-long
recordings.
Finally,
we discuss the technological and social contexts
of digital
audio recording.
These contexts
include
mobile
computing
variety
of applications.
2. RELATED
devices
and
the
use
of speech
as a data
type
across
a
WORK
Audio in most CSCW applications
is only a medium
for synchronous
communication
and not yet a source of data. Short pieces of recorded
speech—speech
snippets—are
the main use of speech as data in current
CSCW applications.
These snippets
are used in message
systems,
such as voice mail,
and in
multiuser
editing
systems. Speech can also be used to annotate
text, a facility
demonstrated
in Quilt
[Fish
et al. 1988] and now common
in commercial
software.
The stored speech is not itself structured
in these applications;
the
recorded
speech is treated
as a single unbroken
entity,
and the application
ACM
Transactions
on Information
Systems,
Vol. 11, No, 4, October
1993
Ubiquitous Audio
maintains
an external
reference
to the
sound,
such
as a message
position
within
the text. This simple
approach
is suitable
and is not very informative
for our work.
The Etherphone
system at Xerox PARC addressed
many
ing
a functional
ments
Zellweger
primary
application
al.
et
for
the
were
interface
[1988]
for
stored
an
speech,
although
of stored
overview
of the
number
only
for
aspects
in
or
snippets
of provid-
annotations
speech
379
.
of docu-
Etherphone
Etherphone
(see
system).
This
sophisticated
and innovative
work included
a moving
indicator
during
playback, a sound-and-silence
display,
segmentation
at phrase boundaries,
editing, cut and paste of pieces of audio, markers,
text annotations,
an elegant
storage
cated
system,
many
and
encryption
of these
features
[Ades
dynamic
displays
of conversation
port lengthy
recordings
and speech
across a range
PhoneSlave
approach
of applications.
and HyperVoice
to enrich
and
in our work
telephone
Swinehart
1986].
and extended
We have
it to explicitly
repli-
support
and spontaneous
capture.
We also supas a data type that can be cut and pasted
are
examples
interactions
Slave [Schmandt
and Arons 1985]
telephone
message,
asking
callers
with
of
using
respect
a semi-structured
to messages.
Phone-
used conversational
techniques
to take a
a series of questions
and recording
the
answers.
These speech segments
could be highly
correlated
with structured
information
about the call. For example,
the response
to, “At what number
can you be reached?”
contained
the phone number.
has been applied
by Resnick
[1992] in HyperVoice,
for telephone-based
bulletin
boards.
HyperVoice
Structured
data capture
an application
generator
applications
provide
a speech-
and touchtone-driven
interface
that
uses the form-entry
metaphor.
While
recording
their
messages,
contributors
to the bulletin
board
are asked to
fill in some specific fields using appropriate
mechanisms.
For instance,
the
headline
field
is filled
in with
given by touchstones
so that
also supports
Resnick’s
Skip
gating
among
fields
a brief
recording,
whereas
expiration
dates
are
validity
checks can be performed.
HyperVoice
and Skan retrieval
mechanism
for easily navi-
and messages
[Resnick
and Virzi
1992].
PhoneSlave
and HyperVoice
demonstrate
the value of even simple
turing,
and we have taken a similarly
simple
approach
to structure
applications
that capture
conversations.
A contrasting
in hypermedia
documents.
These embody
considerable
strucin our
approach
can be seen
structure,
which
is
explicitly
supplied
during
the authoring
process. Muller
and Daniel
[1990]
implemented
HyperPhone,
a software
environment
for accessing voice documents in a conversational
fashion.
HyperPhone’s
voice documents
are text
items that have been structured
with links to facilitate
access when spoken
by a synthesizer.
One reported
conclusion
is that the items must be short and
very
highly
describes
recognition
a series
connected
for the user interactions
HyperSpeech,
a speech-only
for navigation.
HyperSpeech
of interviews,
and
to be successful.
Arons
[1991]
hypermedia
system utilizing
speech
nodes contain
recorded
speech from
HyperSpeech
users
can move
between
topics
or
between
speakers.
Hundreds
of manually
constructed
links
exist in this
system. These examples
illustrate
the amount
of structure
needed to make
quantities
of audio useful.
Our work on structuring
emphasizes
automatic
ACM
Transactions
on Information
Systems,
Vol. 11, No. 4, October
1993
380
.
Debby
derivation
et al.
than
rather
Hindus
explicit
authoring.
As we have
extended
our
visual
displays
and interactions
to lengthy
recordings,
we have had to add features
so that applications
can support
multiple
levels of structure.
None of the above applications
addresses
our primary
interest
area, spontaneous
collaboration.
related
recent
The SoundBrowser
work.
A portable
project
prototype
for
at Apple
is the most
closely
spontaneous
capturing
user-
structured
handheld
audio has been developed
by Degen et al [1992]. They modified
tape recorder
so that
users could mark
interesting
portions
recordings
that these
of meetings
annotations
recorded.
The
a
of
or demarcate
items in personal
memos. A key point is
could be made in real time, as the sound was being
SoundBrowser
itself
is a Macintosh
application
for reviewing
the stored audio, and it supports
innovative
visual representations
and user
interactions,
including
zooming
and scanning
operations
during
playback
of
the recordings.
Although
our visual
display
of recorded
audio is quite different from the SoundBrowser’s,
we have incorporated
into our display
bookmark-style
annotations,
zooming,
and
scanning.
We,
too,
recognized
importance
of retrospective
marking
and invented
an interactive
display that continually
shows the conversation’s
recent past.
3. THE “HOLY
GRAIL’”: AUTOMATIC
TRANSCRIPTION
the
dynamic
OF FORMAL
MEETINGS
Conspicuously
absent from our discussion
so far is the notion of capturing
the
spoken
contents
of formal
group
meetings
without
human
transcription.
Given
the
importance
of meetings,
this
is an obvious
CSCW
application,
as
indicated
by the body of work on electronic
meeting
systems
[Dennis
et al.
1988; Mantei
1988]. Due to technological
issues, however,
it is very difficult
to automatically
structure
recordings
of meetings.
One issue is the association
of each utterance
with
a participant.
The
optimal
solution
is to record
each person’s
speech on a separate
audio
channel,
but it is quite difficult
to get each attendee’s
speech to be transmitted by only one microphone.
In fact, high-quality
recordings
of meetings
are
problematic
in general,
due to background
noise, room acoustics,
and poor
microphone
placement
with respect to some meeting
participants.
Using one
or more
wide-area
microphones
(such
as boundary
zone
microphones
often
used for teleconferences)
allows more flexibility
in seating
but compromises
audio
quality.
Highly
directional
microphones
can eliminate
some background noise and ambient
room noise, but they require
careful placement
and
restrict
the mobility
of meeting
participants.
The recording
may be intelligible. However,
the added noise and variable
speech amplitude
interfere
with
further
digital
signal
processing,
particularly
speech recognition,
which
is
quite sensitive
to microphone
type and placement.
Transcription
of the spoken words is the other issue. Speech recognition
of
fluent,
unconstrained
natural
language
is nowhere
near ready yet, even with
ideal acoustic
conditions.
Keyword
spotting,
a less ambitious
approach
that
could produce
partial
transcripts,
is very difficult
when applied
to spontaneous speech, especially
speech from multiple
talkers
[Soclof and Zue 1990].
ACM
Transactmns
on Information
Systems,
Vol
11, No. 4, October
1993
Ubiquitous
However,
word-spotting
indexing
and
need
techniques
be perfect
not
editing
graphical
color
through
system
to be useful;
display
luminance
have
that
This
ubiquitous
informal
section
the
audio
short-term
incorporated
Bush
199 1],
Intelligent
the
Ear
into
and
of
the
381
an
keyword
[Schmandt
confidence
audio
spotting
1981]
word
used
a
recognition
This
one
in
such
and
tool,
is
the
audio
notes,
application
memory
DISCUSSIONS
semi-structured
personal
xcapture,
captures.
auditory
OFFICE
and
meetings,
describes
it
the
indicate
AND RETRIEVING
explored
support
to
been
and
.
levels.
4. CAPTURING
We
have
[Wilcox
Audio
and
a digital
office,
with
in
of tools
conversations.
issues
tape
no
a variety
telephone
in
loop
inherent
structuring
that
provides
structure
to the
recording.
4.1
Capturing
When
Office
multiple
may
suggest
down
but
perfect,
cannot
to
the
flow
provides
with
xcapture
minutes
are
4,2
and
could
the
next
office
is
the
user
xcapture
of
a
by
Many
but
into
retrieval
versions
task.
is not
the
was
making
the
sound
but
because
in
microphone,
for
“That
Xcapture
speech.
microphone
lengths
be recorded,
in
early
one
to write
in the recorded
a background
and
the
says,
forgotten.
memory;
memory
in
project,
strives
author
been
structure
ambient
buffer
second
already
conversation,
a speaker
records
the
writing
the other
short-term
audio
Whenever
typical
recording,
stores
recent
in
practice
in use
circular
scenario
the
of
micro-
by another
buffer.
just
difficulties
use
workstations
appli-
Five
to
described;
make
that
from
an
15
longer
impractical,
section.
audio
data
to
xcapture
the
data
data
application
needs
bar
in
the
time-based
the
This
as
our
audio.
on the
until
work,
Time
left
replay
offset
with
fresh
server
displays
recording
clicks
window
is in
on its
in
the
sound
plays
mouse
can
to jump
to the
new
sound,
allowing
the
ACM Transactions
icon,
and
be
that
on Information
the
stream
of
icon
which
causes
buffer
user
marks.
bar
to
move
is,
the
used
interface
When
the
flows
this
mouse
a
using
SoundViewer,
a cursor
used
random
least
another
animated
audio
1. The
location;
into
right.
halts
entire
as tick
The
to
When
a small
manipulation
horizontally
server
the
progress.
the
Figure
a direct
audio
fills,
data.
animated
displays
illustrated
provides
buffer
temporarily
xcapture
that
is displayed
SoundViewer,
the
audio
user
data
When
replaced
recording,
appear.
a SoundViewer
from
the
audio
buffer.
and
as a reminder
to
extensively
clicks
discarded
widget,
window
digital
circular
to record,
records
window
receives
its
During
Xcapture
recorded
in
is
xcapture.
of a moving
cause
recent
which
Retrieving Office Discussions
While
new
of
resources
used.
cation,
discussions
the
time
have
sort
short-term
equipped
is rarely
as discussed
the
use of any additional
recording
now
phone
By
words
this
of
on a collaborative
to a paragraph,
the
exactly
no direct
workstation
are
again,”
supply
Xcapture
are working
remember.
it
remembers
made
authors
a new wording
say
meant
Discussions
into
cursor
provides
to
user
the
and
a
access.
Systems, Vol. 11, No 4, October 1993.
382
.
Debby
Hindus
et al.
m
m
Save
Fig. 1.
During
within
Viewer
it
the
indicated
the
back
slider
in
to up
but
playback
is
speeds.
not
is
periods
and
perception
4.3
of time
works
have
Our
ture
as
review
point
times
sound
boundaries.
(For
increase
of the
recorded
twice
done
at
at
normal
even
higher
understandable
which
heard
samples
are
in
some
cartoon
playback
during
and
see Arons
a collaborative
is
a better
complete
pitch
discarding
a survey
of the
techniques
and
[1992a].)
memory
utterance
a conversation;
even
searching
forgotten
is that
to
turn
them
on,
ten
are
a discussion
It
random
just
microphones
during
dialog.
with
through
our
aid
or
is less
audio
access
minutes
and
xcaptu
of
audio
is
powered,
battery
defeating
to
successful
and
we
re’s
continuous
recording.
experience
utility
of
rate
the
replaying
to
than
it remains
effect
finding
a consequential
A minor
at
the
techniques,
background
the
well
to
time-scaling
tedious.
by
compression,
replay
used
chunks
be
an
by
user
faster
the
pitch,
with
and
it.
xcapture
can
sound
of Xcapture
immediately
when
the
improved
by smoothing
Discussion
Xcapture
is
to record
so that
increasing
raises
Discarding
quality
audio
Comprehension
speech
the
even
xcapture
recorded
is replayed
Sound-
of
recording,
Therefore,
material
speech
original
contents
the
it
familiar
Simply
the
speed.
when
Time-compressing
voices.
approach;
the
interesting
The
lengthy
required
normal
this
the
allows
times
through
inadequate;
characters’
was
reduced
straightforward.
played
than
to
a
above.
SoundViewer
scanning
clue
for
through
time
is significantly
speed,
no
described
less
to three
speech
recording.
gave
scanning
discussion
to find
longer)
onerous
the
under
playback
it
is
support
speech
A
but
mechanism
SoundViewer
wants
(or
long
Retrieval
random-access
~
user
xcapture
time,
1.2
version of Xcapture after anoffice
a five-minute
represented.
the
Anearly
retrieval,
segment
Speed;
with
xcapture
of spontaneous
implicit
in
led
to several
recordings.
a conversation
can
One
be
research
direction
exploited
directions
considers
as part
to improve
how
of the
the
capture
strucand
process. A second direction
explores
improvements
to visual representations
of stored
speech and to audio-related
interaction
mechanisms.
These directions
are described
in the following
two sections, respectively.
retrieval
ACM
Transactions
on Information
Systems,
Vol
11, No. 4, October
1993
Ubiquitous Audio
5.
383
.
DYNAMIC CAPTURE AND DISPLAY OF TELEPHONE CONVERSATIONS
This
part
of our
structure,
work
addresses
interaction
at the
the
time
derivation
of inherent
of capture,
and
the
conversational
visual
presentation
of audio during
conversations.
The inherent
structure
of a conversation
is
defined by the naturally
occurring
pauses at the end of phrases and sentences
and
by the
alternation
of utterances
audio;
workstations,
can
very
the
relevant
are
a practical
little
and talker
be separated.
average
speakers
(this
equipment
of telephone
choice
for
is
because
calls
are
in a study
also
calls
beyond
as the
semi-
audio-capable
the two audio
typically
of professional
is
demonstrating
required
is possible
detection
Telephone
of 3–6 minutes
portions
alternation
segmented
into
tool that allows
).
users to identify
and save
conversation
progresses.
Telephone
conversations
structured
between
The audio
data can be automatically
pieces by the Listener,
a telephone
listening
turntaking
called
understandable
briefi
work
channels
calls
activities
lasted
an
[Reder
and
Schwab 1990].
Studies of audio interactions
and telephone
calls informed
the design of the
Listener’s
segmentation
strategy.
Beattie
and Barnard
[1979]
focused
on
turntaking
during
found
turntaking
that
inquiries
to British
pauses
directory
averaged
turns were accomplished
conversational
parameters,
within
such
0.2 seconds.
as turntaking,
as summarized
[1987].
The
by Rutter
and of itself
is insufficient
Turntaking
pauses
length,
happen
ing
and
the
5.1
The
will
pauses
conversational
Capturing
Listener
following
not
A number
pausing,
studies
suggest
a conversation,
be distinguishable
phrases
although
from
of one
They
34%
of
of studies
quantify
and interruptions,
that
pausing
in
for three
reasons:
pauses
by their
other
will be attributable
to turntaking,
pausing.
The Listener
therefore
between
operators.
long,
and many turns
uses both turntak-
speaker’s
utterance
to derive
structure.
and
Displaying
captures
scenario.
notification
above
for structuring
not all pauses
with
minimal
assistance
0.5 seconds
structure
You
window
Conversational
from
receive
on your
Structure
telephone
a telephone
screen.
You
calls,
call.
choose
as described
The
Listener
to record
in
pops
the call.
the
up
While
a
you
are talking,
a graphical
representation
of the conversation
is constructed
your screen, showing
the shifts in who is speaking
and the relative
length
on
of
each turn. You can click on a segment
to indicate
that it should be saved. At
the end of the phone call, you can listen to segments
and decide which ones to
save, or just save all the marked
Two microphones
collect audio
to the
telephone
microphone
sits
person’s
speech
phone).
This
distinguish
handset
and
in the user’s
(assuming
second,
between
segments.
signals
for the
carries
office
that
the
single-talker
the
two
ACM
talkers.
Transactions
speech
near
Listener.
from
both
the telephone
handset
is used
One is connected
talkers.
and
rather
than
audio
stream
enables
the
The
Listener
receives
audio
on Information
Systems,
The
carries
just
other
that
a speakerListener
data
Vol. 11, No. 4, October
to
from
1993.
384
.
Debby
Hindus
et al
both microphones,
performs
pause detection
on each source, synchronizes
the
sources, and then locates changes of talker
between
pauses. This last step is
needed
because
turntaking
pauses
can be undetectable
with
just
pause
detection.
The new segment
is then added to the call display.
The call display that appears during
the conversation
must
so as not
straints
to interfere
imply
and
interesting
takes
that
with
only
the
the
recent
Also,
portions
of the
2 shows
conversation
call
proceeds,
30 seconds
reflecting
the
display
a visual
is displayed
the
can be identified
during
memory
conversation
to the left,
relative
Segments
sation
visual
conversational
segments
marking
or por-
distinguished
when
to mark
and
allows
segments
highlighted,
requires
even fewer
matters,
and not on interacting
with
user actions
are clicking
application
functions.
is shown,
segment,
audio
each
one second of audio. New
segments
scroll out of view
colors
for users
segments
to substantial
marking
interaction
As the
the previous
turn
Each
can be visually
border
by their
unmarked.
segments
users
of the conver-
those
by clicking
and marking
user
when
automatic
is focused
on the
seg-
on them.
is reversible.
interactions;
attention
the Listener
the pointer
to identify
at any time
the user can toggle
that all new segments
are marked.
During
the phone call, the user’s
automatic
the user
conversation.
may merit
later
rehearing.
The Listener’s
structure
reinforces
the user’s memory
of
individual
are visually
mechanism
turns
talkers.
segment
Structure
mechanisms
A user can mark
Marked
sation
provides
the
within
a SoundViewer,
the same
our applications.
In this picture,
each talker
that are interesting
and
display
of conversational
significant
ments.
from
after
conversational
of the
and by different
User-Supplied
The Listener
of the
each SoundViewer
represents
at the right-hand
side, and older
positioning
5.2 Adding
shortly
con-
are salient,
of approximately
Each
utterances
tion of the audio signal, is displayed
representation
that is used throughout
tick mark
within
segments
appear
part
representation
retrospectively.
phrase-level
only
be unobtrusive,
short-term
place.
Figure
segments
conversation.
Another
the convermarking
so
conversation
program.
Therefore,
the only feasible
on segments
of interest
or on the
toggle. Once the conversation
is completed,
the nature
of
changes
from capture
to review.
The postcall
Browser
displays
the entire
conversation
and provides
A user can replay all or part of the conversation,
additional
editing
revise the choice
of segments to store, and save the segments
for later retrieval.
Users can also
provide a descriptive
tag for each conversation,
although
tags are not required.
Once these postcall
revisions
are made, only the marked
segments
are
saved. Marked
segments
will typically
occur in consecutive
groups, and when
the conversation
is retrieved
in the future
these groups are visually
distinct,
as shown in Figure
3.
5.3
Retrieving
Stored
conversations
Situational
ACM
Stored
TransactIons
and
Telephone
may
be retrieved
supplemental
on Information
Conversations
Systems,
long
structure
Vol
after
can
the
provide
11, No 4, October
1993
phone
call
memory
took
cues
place.
to
the
1
I
D:
Hello, this is Debby Hindus speaking.
B:
Hi Deb, it’s Bob. I’m just getting
D:
Well,
B:
OK,
I think
it’ll
take
me about
out of work, I figured
another
hour,
hour
I’m just going to head on home, I’ll probably
if you think
Well,
OK. By the way, somebody,
B:
mentioned
in your
B:
Yeah,
uh...
an article you might
tutorial.
it’s
up the things
to finish
I’m doing now.
do a little shopping on the way.
of it, maybe you could get some of that good ice Cream that you got last week
D:
B:
B:
I’d call and see how late you’re going to stay tonight.
and a half,
Debby:
by Graeme
Oh really?
Hirst,
Fig, 2,
be able to use
in the
Sequence
[Debby’s
very short
turn
June ’91 Computational
of segments
during
is ignored.
]
Linguistics.
a phone
call, with
transcriptions
386
Debby
.
H]ndus
et al.
g
]mlm~
Fig, 3.
Browsing
content
of the stored
collects
and
data
includes
and date
which
of the
segments
There
are
speech
other
cannot
information;
party’s
and the
with
name
phone
three
groups
be searched
for
and phone
number
structure,
two
is stored
kinds
of chat
with
in a file
retrieval:
The Listener
calls,
situational
of the user.
The user’s
of the marked
along
text.
if known;
The representation
data,
like
segments,
number,
of supplemental
audio
of saved
telephone
to save is one form
corresponding
supplemental
conversation
situational
call;
tags are another.
the
audio;
stores
the
a stored
structure,
segments
situational,
and referred
one
is finding
the time
choice
and
and indices
conversational,
of
textual
into
and
to as a chat.
the
desired
audio
segments
within
a chat, and the other is locating
a particular
chat from
among numerous
stored chats. Our work has been narrowly
focused on capturing
and retrieving
segments
within
a single conversation.
Future
efforts
will need to address mechanisms
for navigating
among many chats, such as
making
use of situational
data to locate a chat in a fashion akin to locating
an
electronic
mail message.
5.4
Discussion
of the
Listener
We have used the applications
ourselves
enough
to be confident
that the
underlying
concepts are viable and worthy
of additional
research.
We have,
for example,
used the Listener
while
collaborating
long-distance
on papers
xcapture recordings
of impromptu
office discussions
to other
and mailed
group members.
Although
the Listener’s
day-to-day
usage was limited
to one
of the authors
for technical
reasons,
that
author
experienced
consistent
success in marking
segments
of interest
or adjoining
segments.
Furthermore,
ACM
Transactmns
on Information
Systems,
Vol
11, No 4, October
1993
Ubiquitous
the
minimal
interactions
in conversations,
by
casual
interactive
the Listener
with
was engaging
static
highlights
involving
into
such
distinguishable
display
several
her
and
was
aspects
conversation.
and
One
participation
comprehended
considerably
of building
less
real-time
continuing
segment
length
of two
Additionally,
problem
by the
aspect
aspect
seconds,
As
Listener
is how well
is how
determination
segments
or sequentially.
calculated
between
utterances.
Another
significant
Another
Final
as a minimum
individually
are
imperfect.
segments.
segments.
played
boundaries
interfere
display
Browser’s
is awkward
signal
constraints,
when
noticeably
387
.
is
consistent
high-quality
audio from microphones
in offices. The
from telephones
is good, but using two microphones
to segment
on talker
audio
not
dynamic
The
applications
how to obtain
audio quality
the
the
observers.
understandable.
Experience
with
based
did
and
Audio
in
so that
the chosen
that
ensure
should
shown
to divide
must
sound
reflect
visually
complete
Figure
fall
they
visual
4, segment
the
in
pauses
presentation
works
during
and after
the conversational
the conversation.
The Listener’s
call display
does represent
structure
of speech, and it worked well as a dynamic
repre-
sentation
the conversation.
tation
and
during
for later
browsing.
innovation
when
with
informed
It was less successful
Clearly,
respect
there
to representation
by the cognitive
as a static
are opportunities
science
and
perspective.
represen-
for experimentation
interaction,
particularly
For example,
interacting
with a computer
program
while engaged in conversation
raises issues of task
and memory
workload,
and use of attentional
resources.
Finally,
privacy
issues received
only minimal
attention
in this prototype
implementation.
that
record
As we discuss
conversation
need
toward
outside
6. PRESENTING
AND INTERACTING
accommodate
material
for
with
speech
working
lengthy
of a small
with
research
that
well,
application-specific
of this
privacy
the
however,
and
limitations
by using
SoundViewers
arranging
them
the
applications
concerns
before
they
Listener
SoundViewer
to be too simple
structure.
avoided
for each segment
conversational
RECORDINGS
original
and it proved
or user-supplied
snippets,
article,
group.
WITH LENGTHY
xcapture
recordings
to represent
end
to accommodate
can be employed
We saw when
the
did
It worked
the
In
well
SoundViewer’s
of the conversation
structure.
not
for audio
this
section,
and
we
describe enhancements
we made to the SoundViewer
that enable it to directly
support
segmentation,
multiple
levels
of structure,
and presentation
of
lengthy
recordings.
These enhancements
include
the display of segmentation,
scaling and zooming
of long sounds, and the ability
to annotate
parts of the
sound with text or markers
that act as bookmarks.
Mechanisms
for navigating among segments
and for rapid searches were developed
as well.
6.1 Displaying
Multiple Levels of Structure
The enhanced
SoundViewer
widget
supports
the optional
display
of several
levels of structure.
The most general
structuring
for speech is to distinguish
ACM Transactions
on Information
Systems,
Vol
11, No 4, October
1993.
388
.
Debby
..~
Hindus
et al
“~~~’
“ “
segment
Fig. 4
n
segment
Silences
are divided
segment
n+l
between
n+2
seagnents.
between
speech and silence intervals.
Figure
5 shows the modified
widget
incorporated
into a voice mail application;
segments
of speech are displayed
as black
bars,
and
SoundViewer
silence
allows
is white,
following
the user to jump
Etherphone’s
forward
example.
and backward
between
The
speech
segments
during
playback,
by pressing
the space bar or “b” key, respectively.
Applications
may require
segmentation
at a level of semantic
and structural
ing
information
between
higher
two
application-specific
level
Viewer
its
may
than
speakers
specify
of structure,
own
layer
and silence bars, as shown
layer is up to the application,
content.
For
musical
notes.
another.
The third
example,
level
arrow-shaped
another
way
and
an
interludes
can
of structure
silence,
of black
be used
and
as
To
distinguishpresent
containing
white
in
bars
a radio
to
skip
by adding
[Degen
in the SoundViewer
to convey
such
speech.
application
is user-supplied
within
a recording
the SoundBrowser
markers
and
music
below
the
information
within
show
from
structure.
visual
et al.
are
one
Displaying
and
Interacting
wkh
The SoundViewer
emphasizes
interactive
playback
controls
speech
indicated
content
Users
can
Lengthy
by
bar
to
denote
bookmarks.
We followed
1992];
users can place
by pressing
the caret
a SoundViewer,
and
key. Text
is
an optional
text label can be set by the application.
This label can display
the
name in a telephone
message, for example,
or the date of a recording.
6.2
this
a Sound-
in Figure
6. The interpretation
of this content
which can also associate a visual icon with the
musical
Keystrokes
points of interest
the example
of
speech
or between
caller’s
Recordings
the temporal
aspect of audio
and has required
improvement
together
with
to accommo-
date recordings
longer than a few minutes.
Like other graphical
interfaces
to
time-varying
media,
the SoundViewer
uses a mapping
from time to space
(length)
to represent
sound. Showing
the total duration
of a recording,
along
with a continually
updated
position
indication
during
playback,
is important
for navigation
and user comfort
[Myers
1985].
The SoundViewer
initially
used tick marks of varying
size to convey a time
scale, and longer sounds were shown with closer spacing between
tick marks.
But these visual
cues were inadequate
for indicating
total
duration
or
positioning
within
long recordings.
Because
tick marks
failed
to present
absolute
duration,
text labels
were introduced
into
the SoundViewer
to
a Is-minute
display
sound duration;
for example,
“1 min 30 see” labels
recording.
Tick marks
did provide
navigational
cues, however,
and we are
evaluating
tick marks and speech-and-silence
displays
for navigation.
ACM
Transactions
on Jnformatlan
Systems,
VO1 11, No 4, October
1993
Ubiquitous
.
—“.
.
.
.
.
.
“.
—--------
“..
-...”
----
—.--..—.
“-—
pmail
-::W:!
—.-—
LEFT:vieu
119
\
.—--.--
RIGHT: delete
II
1
Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.
Why Is My Information Online?