Rockstar Consortium US LP et al v. Google Inc
Filing
33
RESPONSE to Motion re 18 MOTION to Change Venue filed by NetStar Technologies LLC, Rockstar Consortium US LP. (Attachments: # 1 Declaration of Donald Powers, # 2 Declaration of Mark Hearn, # 3 Declaration of Kristin Malone, # 4 Exhibit 1 to Decl. of K. Malone, # 5 Exhibit 2 to Decl. of K. Malone, # 6 Exhibit 3 to Decl. of K. Malone, # 7 Exhibit 4 to Decl. of K. Malone, # 8 Exhibit 5 to Decl. of K. Malone, # 9 Exhibit 6 to Decl. of K. Malone, # 10 Exhibit 7 to Decl. of K. Malone, # 11 Exhibit 8 to Decl. of K. Malone, # 12 Exhibit 9 to Decl. of K. Malone, # 13 Exhibit 10 to Decl. of K. Malone, # 14 Exhibit 11 to Decl. of K. Malone, # 15 Exhibit 12 to Decl. of K. Malone, # 16 Exhibit 13 to Decl. of K. Malone, # 17 Exhibit 14 to Decl. of K. Malone, # 18 Exhibit 15 to Decl. of K. Malone, # 19 Text of Proposed Order)(Nelson, Justin)
EXHIBIT 4
High Scalability - High Scalability - Google's Colossus Makes Search Real-time by Dumping MapReduce
Home
Real Life Architectures
Strategies
Contact
All Time Favorites
RSS
Twitter
All Posts
Advertising
Facebook
G+
Book Store
Start Here
« How Can the Large Hadron Collider Withstand One Petabyte of Data a Second? | Main | How did
→
Google Instant become Faster with 5-7X More Results Pages? »
Google's Colossus Makes Search RealTime By Dumping MapReduce
SATURDAY, SEPTEMBER 11, 2010 AT 5:50AM
As the Kings of scaling, when Google changes its search
infrastructure over to do something completely different, it's
news. In Google search index splits with MapReduce, an
exclusive interview by Cade Metz with Eisar Lipkovitz, a senior director of engineering at
Google, we learn a bit more of the secret scaling sauce behind Google Instant, Google's
new faster, real-time search system.
The challenge for Google has been how to support a real-time world when the core of their
search technology, the famous MapReduce, is batch oriented. Simple, they got rid of
MapReduce. At least they got rid of MapReduce as the backbone for calculating search
indexes. MapReduce still excels as a general query mechanism against masses of data,
but real-time search requires a very specialized tool, and Google built one. Internally the
successor to Google's famed Google File System, was code named Colossus.
Details are slowly coming out about their new goals and approach:
RECENT POSTS
Sponsored Post:
Goal is to update the search index continuously, within seconds of content changing,
Couchbase, Tokutek,
without rebuilding the entire index from scratch using MapReduce.
Logentries, Booking,
The new system is a "database-driven, Big Table–variety indexing system."
Apple, MongoDB,
When a page is crawled, changes are made incrementally to the web map stored in
BlueStripe, AiScaler,
BigTable, using something like database triggers.
Aerospike,
A compute framework executes code on top of BigTable.
LogicMonitor,
The use of triggers is interesting because triggers are largely ignored in production
systems. In a relational database triggers are integrity checks that are executed on a record
every time a record is written. The idea is that if the data is checked for problems before it
AppDynamics,
ManageEngine,
Site24x7
is written into the database, then your data will always be correct and the database can be
How the AOL.com
a pure source of validated facts. For example, an account balance could be checked for a
Architecture Evolved to
negative number. If the balance was negative then the write would fail, the transaction
99.999% Availability, 8
aborted, and the database would maintain a correct state.
Million Visitors Per
Day, and 200,000
http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html[2/19/2014 9:52:39 AM]
High Scalability - High Scalability - Google's Colossus Makes Search Real-time by Dumping MapReduce
In practice there are many problems with triggers:
Requests Per Second
Destroys performance. Since triggers happen on every write they slow down the
Stuff The Internet Says
write path to the extent that database performance is often killed. Triggers also take
On Scalability For
record locks which makes contention even worse.
February 14th, 2014
Checking is distributed. Not all integrity checks can be made from data inside the
Snabb Switch - Skip
database so checks that are database oriented are put in the triggers and checks, that
the OS and Get 40
say access a 3rd party system, are kept in application code. So now we have checks
million Requests Per
in multiple places, which leads to the update anomalies in code that ironically,
Second in Lua
relational databases are meant to prevent in data.
Paper: Network Stack
Checking logic is duplicated. Checking in the database is often too late. A user
Specialization for
needs to know immediately when they are doing something wrong, they can't wait until
Performance
a transaction is committed to the database. So what ends up happening is checking
13 Simple Tricks for
code is duplicated, which again leads to update anomalies.
Scaling Python and
Not scalable. The database CPU becomes dedicated to running triggers instead of
Django with Apache
"real" database operations, which slows down the entire database.
advertise
Application logic moves into triggers. Application code tends to move into
from HackerEarth
Stuff The Internet Says
triggers over time since triggers happen on writes (changes), in a single common
On Scalability For
place, they are a very natural place to do all the things that need to be done to data.
Login
For example, it's common to emit events about data changes and perform other
Register
change sensitive operations. It's easy to imagine building secondary indexes from
All Time Favorites
triggers and synchronizing to backup systems and other 3rd party systems. This
makes the performance problems even worse. Instead of application code being
executed in parallel on horizontally scalable nodes, it's all centralized on the database
server and the database becomes the bottleneck.
February 7th, 2014
Little’s Law, Scalability
Useful Products
Useful Papers
Useful Strategies
and Fault Tolerance:
The OS is your
bottleneck. What you
can do?
Useful Blogs
Sponsored Post:
So, triggers tend not to be used in OLTP scenarios. Applications contain the same logic
Useful Books
and it's up to release policies and testing to make sure nothing falls through the cracks.
Useful Conferences
Book Store
Logentries, Booking,
Apple, MongoDB,
This isn't to say you don't want to put logic in triggers, you do. Triggers are ideal in an
BlueStripe, AiScaler,
event oriented world because the database is the one central place where changes are
Aerospike,
recorded. The key is to make triggers efficient.
High Scalability
RSS
It sounds like Google may have made a specialized database where triggers are efficient
by design. We know hundreds of thousands of pages are being crawled simultaneously.
High Scalability
Comments RSS
When the pages are written to the database that's the perfect
opportunity perform calculations and updates. The compute framework would make it
efficient to perform operations in parallel on pages as they come in so that processing isn't
completely centralized on each node.
LogicMonitor,
AppDynamics,
ManageEngine,
Site24x7
How Google Backs Up
the Internet Along With
Exabytes of Other
Data
One can imagine that the in-memory data structures that existed on the MapReduce nodes
have been extracted in some form and reified within BigTable. What we have is a sort of
Internet DOM, analogous to the browser DOMs that have made it possible to have such
incredibly powerful browser UIs, but for the web. I imagine programming the web for
Google has become something like programming a browser DOM. I could be completely
wrong of course, but I explored this idea a while ago in All the world is a DOM.
ALL TIME FAVORITES
More Favorites...
YouTube Architecture
Plenty Of Fish
Architecture
There's an interesting quote at the end of the article: "We're in business of making
http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html[2/19/2014 9:52:39 AM]
Google Architecture
High Scalability - High Scalability - Google's Colossus Makes Search Real-time by Dumping MapReduce
searches useful, we're not in the business of selling infrastructure."
How Twitter Stores
These could actually be the same goal. It's a bit silly to have everyone in the world crawl the
same pages and build up yet another representation of the Web. Imagine if the Web was
represented by an Internet DOM that you could program, that you could write to, and that
you could add code to just like Javascript is added to the browser DOM. Insert a node into
250 Million Tweets A
Day Using MySQL
Scaling Twitter: Making
Twitter 10000 Percent
Faster
the Internet DOM, like a div in an HTML page, and it would just show up on the Web.
Flickr Architecture
This Internet DOM could be shared and all that backbone bandwidth and web site CPU
Amazon Architecture
reclaimed for something more useful. Google is pretty open, but they may not be that open.
How I Learned to Stop
Worrying and Love
Todd Hoff |
11 Comments |
Permalink |
Share Article
like
http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html
AVrWl44m
Print Article
Email Article
Like You and 65 others like this.65 people like this. Sign Up to see what your
Like friends like.
Using a Lot of Disk
google
in
Tweet
Space to Scale
267
Stack Overflow
Architecture
Facebook
An Unorthodox
Reader Comments (11)
Approach to Database
Design : The Coming
of the Shard
If you are looking to build something similar, check out the open source elasticsearch. It was built with
the mentioned architecture in mind (sorry for the "plug", I am the creator of elasticsearch).
Building Super
Scalable Systems:
Blade Runner Meets
September 11, 2010 |
Shay Banon
Autonomic Computing
in the Ambient Cloud
Interesting. But Google Instant is independent of realtime search, isn't it? I thought Instant was a new UI
Are Cloud Based
Memory Architectures
that shows results before you hit the Search button, with the same back end search.
the Next Big Thing?
Latency is Everywhere
September 11, 2010 |
Brian M
and it Costs You Sales
- How to Crush it
I suspect you're taking the statement "like triggers" too literally. Incrementally updating results is known
How will memristors
to be very useful for maintaining real-time result sets. Look up "materialized views".
change everything?
DataSift Architecture:
September 11, 2010 |
Allan MacKinnon
Realtime Datamining
At 120,000 Tweets Per
Second
Thanks Todd, great post!
I think the idea of the "internet DOM" is very powerful. Try to imagine the next step. Node's don't only
represent documents on the web, but objects in the world. Each object can be a part of the world DOM.
In DOM you also have events, so you can listen to events that happen on objects in the DOM. You can
listen for events and trigger new events. Perhaps your alarm clock will listen to changes in your work
schedule and your toaster and friends will listen to your alarm clock. Maybe an inventory management
system will listen for price-change events and execute orders accordingly. The world object DOM with
events is a very powerful computational network that works very similar to a human brain.
Useful Scalability
Blogs
Scaling Traffic: People
Pod Pool of On
Demand Self Driving
Robotic Cars who
Automatically Refuel
from Cheap Solar
VoltDB Decapitates Six
http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html[2/19/2014 9:52:39 AM]
High Scalability - High Scalability - Google's Colossus Makes Search Real-time by Dumping MapReduce
This is far ahead of the curve, though. Google seems to finally have figured out how to navigate the
SQL Urban Myths And
"document" graph in realtime. The next difficulty level is the social graph, where we don't know yet
Delivers Internet Scale
who'll take the cake. Only then will the object graph play a role. Exciting :)
OLTP In The Process
The Canonical Cloud
September 11, 2010 |
Jonas Huckestein
Architecture
Justin.Tv's Live Video
Broadcasting
Sounds closer to replication than triggers? They're just updating the copy in Bigtable?
Architecture
Why Are Facebook,
September 12, 2010 |
Greg Linden
Digg, And Twitter So
Hard To Scale?
What The Heck Are
Brian M:
You Actually Using
It's more than just a new UI. If you think about it, instant requires a bunch of new back-end
NoSQL For?
infrastructure. How do you log "views" on the server end if they're instant? You need to generate the
Playfish's Social
candidate list, but you only know retrospectively whether the user hits the 3-second threshold to count as
a view.
Gaming Architecture 50 Million Monthly
Users And Growing
For things like sponsored search advertising, where an estimate of how likely someone is to click on an
The Updated Big List
ad is very important for ensuring you show higher-quality ads, this is actually a very big deal.
Of Articles On The
Amazon Outage
jon
September 13, 2010 |
More Favorites...
Google document data is dimensionally simplistic and it should not be compared to a standard OLTP or
even relational database. This is not meant to say they don't do an amazing job at what they do but when
an entire core data model consists of one table that happens to be extremely easy to
partition...well...LUCKY!!!
Jason
September 14, 2010 |
Yahoo's YQL presents the web like one big DOM :)
September 14, 2010 |
mike503
Each call to the database has a certain overhead. You can gain performance if you can limit the number
of calls to the database by using something like stored procs or triggers or MapReduce.
I think that Google has to much data to process it all in some kind of client or middle tier so they process
it in the database itself. I don't think I tell something new here.
I don't think that triggers lock records, at least not in Oracle and the code in triggers does do real database
work. And triggers are not fired when the transaction is committed but when the inserts, update and
deletes are happening. Or just before and after the inserts, update and deletes are happening.
The big disadvantage of triggers is the lack of maintainability, it all becomes too cyclic and the flow
http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html[2/19/2014 9:52:39 AM]
High Scalability - High Scalability - Google's Colossus Makes Search Real-time by Dumping MapReduce
becomes difficlut to understand when you have a lot of triggers. Oracle (Tom Kyte) doesn't advice the use
of triggers but advices the use of stored procs or packages of stored procs. Oracle's parallelized pipelined
functions offer the same functionality as MapReduce: http://www.dbms2.com/2010/02/11/googlemapreduce-patent/
I think that triggers are more used for journaling (storing audit information) than for business logic
checks.
Using stuff like MapReduce, stored procs or triggers becomes indeed much harder (slower) when your
data is distributed in all different kind of stores. But if you write your stored procs in for instance Java
you are not limited in your possibilities.
September 15, 2010 |
RC
As a poster above mentioned, I think the difference is more likely to be the difference between reexecuting a query over all of the source data when it changes as opposed to incrementally updating a
materialized view as the underlying data changes.
Incremental maintenance of materialized views : http://www.csd.uoc.gr/~hy562/Papers/maintenance.pdf
So instead of batching up and processing n web page changes through map-reduce and having multiple
'layers' of web index, a single index can be maintained as changes are detected :
onPageChange(old_page, new_page)
{
Calculate index entry diff(old_page, new_page);
Apply diff to index (remove old entries, change some existing entries,
add new entries)
}
This could be a similar amount of work per page change as required in the batch approach, but doesn't
require multiple layered generations of index and periodic re-index-the-web jobs. I imagine that the
number of changed pages is << the number of static pages for any time unit, so the total computation
could be less.
To be efficient it requires the abllity to perform random access writes to the index, which is presumably
where BigTable helps. Also to be worthwhile, the index changes need to be pushed out to all the replicas
as the changes occur, rather than as a batch job.
Lower-latency maintenance of the search index is decoupled from higher throughput of searches as
required for the Instant Api. However, both put more pressure on the underlying index storage
infrastructure.
September 15, 2010 |
Frazer Clement
I wonder why they haven't built a system yet to allow website owners to notify Google directly of content
http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html[2/19/2014 9:52:39 AM]
High Scalability - High Scalability - Google's Colossus Makes Search Real-time by Dumping MapReduce
changes. That would remove all that outdated crawling (polling) replacing it with notifications. Then make
that free with some daily quotas so that small websites would still be real-time indexed + add payment
options to raise those quotas for large commercial websites. In the end we'll have a realtime web search
funded by websites themselves.
September 19, 2010 |
Max
Post A New Comment
Enter your information below to add a new comment.
Author:
Author Email (optional):
Author URL (optional):
↓|↑
Post:
Some HTML allowed:
Comment Moderation Enabled
Your comment will not appear until it has been cleared by a website editor.
Notify me of follow-up comments via email.
Create Post
Preview Post
COPYRIGHT © 2011, POSSIBILITY OUTPOST. ALL RIGHTS RESERVED.
http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html[2/19/2014 9:52:39 AM]
Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.
Why Is My Information Online?