Facebook connects its 500 million users using an array of open source software to enable social networking as well as data intelligence. Facebook’s open source Web serving infrastructure has a lot more than just the traditional LAMP (Linux/Apache/MySQL/PHP) stack behind it.
During a keynote session at the OSCON open source conference, David Recordon, the senior open programs manager at Facebook, detailed the infrastructure in use today at Facebook.
At the language level of the stack, Recordan noted that Facebook is using PHP by way of its own HipHop PHP runtime project. Facebook officially announced HipHop earlier this year as a way to speed up PHP operations, improve efficiency and decrease CPU utilization.
At the database tier, Recordan said Facebook primarily stores user data in the MySQL database. He said that Facebook runs thousands of MySQL nodes, though he added that Facebook doesn’t care that MySQL is a relational database.
“We generally don’t use it (MySQL) for Joins
Recordan said that Facebook has three different layers for data. At the first layer is the database tier, which is the primary data store and where MySQL sits. On top of that, Facebook uses Memcached caching technology, then a Web server on top of that to serve the data.
“We’re actually using our Web server to combine the data to do joins and that’s where HipHop is so important,” Recordan said. “Our Web server code is fairly CPU-intensive because we’re doing all these different sorts of things with data.”
In addition to MySQL, Facebook leverages a pair of NoSQL-type databases as well including Cassandra and HBase, which is part of the Apache Hadoop project.
“While we store the majority of our user data inside of MySQL, we have about 150 terabytes of data inside of Cassandra, which we use for inbox search on the site and over 36 petabytes of uncompressed data in Hadoop overall.”
Recordan said that Facebook’s Hadoop cluster has a little over 2,200 servers in it, running a total of 23,000 CPU cores inside of them. He added that by the end of the year, Facebook expects to be storing over 50 petabytes worth of information.
The Hadoop components help to enable Facebook to use the data it has to understand how people are using the site. Recordan said that Facebook uses data analysis for all sorts of product decisions including how Facebook sends e-mails and how it ranks news feeds.
In order to help enable the data analysis, Facebook uses an open source technology called Scribe.
“Scribe takes the data from our Web servers and funnels it into HDFS (Hadoop Distributed File System) and into our Hadoop warehouses,” Recordan said. The problem that we originally ran into was too many Web servers trying to send data to one place, so Scribe breaks it up into a series of funnels for collecting data over time.”
Recordan said that Facebook’s Hadoop cluster is vital to the business and the system is highly monitored and maintained. Facebook has what it calls a Platinum Hadoop cluster, plus a second cluster called the Silver Hadoop cluster where data from the Platinum cluster is replicated.
Additionally Facebook uses the Apache Hive technology, which provides a SQL
“A large part of our infrastructure is open source and we really think that it’s important in terms of being able to allow developers that are building with the Facebook platform to scale using the same pieces of infrastructure that we use,” Recordan said. “Fundamentally we’re all running into the same sets of challenges.”