Google Research Blog
The latest news from Research at Google
DeepMind moves to TensorFlow
Friday, April 29, 2016
Posted by Koray Kavukcuoglu, Research Scientist, Google DeepMind
, we conduct state-of-the-art
on a wide range of algorithms, from deep learning and reinforcement learning to systems neuroscience, towards the goal of building
Artificial General Intelligence
. A key factor in facilitating rapid progress is the software environment used for research. For nearly four years, the open source
machine learning library has served as our primary research platform, combining excellent flexibility with very fast runtime execution, enabling rapid prototyping. Our team has been proud to contribute to the open source project in capacities ranging from occasional bug fixes to being core maintainers of several crucial components.
With Google’s recent open source release of
, we initiated a project to test its suitability for our research environment. Over the last six months, we have re-implemented more than a dozen different projects in TensorFlow to develop a deeper understanding of its potential use cases and the tradeoffs for research. Today we are excited to announce that DeepMind will start using TensorFlow for all our future research. We believe that TensorFlow will enable us to execute our ambitious research goals at much larger scale and an even faster pace, providing us with a unique opportunity to further accelerate our research programme.
As one of the core contributors of Torch7, I have had the pleasure of working closely with an excellent community of developers and researchers, and it has been amazing to see all the great work that has been built on top of the platform and the impact this has had on the field. Torch7 is currently being used by Facebook, Twitter, and many start-ups and academic labs as well as DeepMind, and I’m proud of the significant contribution it has made to a large community in both research and industry. Our transition to TensorFlow represents a new chapter, and I feel very excited about the prospect of DeepMind contributing heavily to another great open source machine learning platform that everyone can use to advance the state-of-the-art.
Computer Science Education for All Students
Tuesday, April 26, 2016
Posted by Maggie Johnson, Director of Education and University Relations
(Cross-posted on the
Google for Education Blog
Computer science education is a pathway to innovation, to creativity, and to exciting career prospects. No longer considered an optional skill, CS is quickly becoming a “new basic”, foundational for learning. In order for our students to be equipped for the world of tomorrow, we need to provide them with access to computer science education today.
At Google, we believe that all students deserve these opportunities. Today we
some of America’s leading companies, governors, and educators to support an
open letter to Congress
, asking for funding to provide every student in every school the opportunity to learn computer science. Google has long been committed to
developing programs, resources, tools and community partnerships
that make computer science engaging and accessible for all students.
We are strengthening that commitment today by announcing an additional investment of $10 million towards computer science education for 2017, along with the
that we have allocated for 2016. This funding will allow us to build more resources, scale our programs, and provide additional support to our partners, with a goal of reaching an additional 5 million students.
With Congress’ help, we can ensure that every child has access to computer science education. Please join us by signing our online petition at
Helping webmasters re-secure their sites
Monday, April 18, 2016
Posted by Kurt Thomas and Yuan Niu, Spam & Abuse Research
Every week, over
10 million users encounter harmful websites
that deliver malware and scams. Many of these sites are compromised personal blogs or small business pages that have fallen victim due to a weak password or outdated software. Safe Browsing and Google Search protect visitors from dangerous content by displaying browser warnings and labeling search results with
‘this site may harm your computer’
. While this helps keep users safe in the moment, the compromised site remains a problem that needs to be fixed.
Unfortunately, many webmasters for compromised sites are unaware anything is amiss. Worse yet, even when they learn of an incident, they may lack the security expertise to take action and address the root cause of compromise. Quoting one webmaster from a survey we conducted, “our daily and weekly backups were both infected” and even after seeking the help of a specialist, after “lots of wasted hours/days” the webmaster abandoned all attempts to restore the site and instead refocused his efforts on “rebuilding the site from scratch”.
In order to find the best way to help webmasters clean-up from compromise, we recently teamed up with the University of California, Berkeley to explore how to quickly contact webmasters and expedite recovery while minimizing the distress involved. We’ve summarized our key lessons below. The full study, which you can read
, was recently presented at the
International World Wide Web Conference
When Google works directly with webmasters during critical moments like security breaches, we can help 75% of webmasters re-secure their content. The whole process takes a median of 3 days. This is a better experience for webmasters and their audience.
How many sites get compromised?
Number of freshly compromised sites Google detects every week.
Over the last year Google detected nearly 800,000 compromised websites—roughly 16,500 new sites every week from around the globe. Visitors to these sites are exposed to low-quality scam content and malware via
. While browser and search warnings help protect visitors from harm, these warnings can at times feel punitive to webmasters who learn only after-the-fact that their site was compromised. To balance the safety of our users with the experience of webmasters, we set out to find the best approach to help webmasters recover from security breaches and ultimately reconnect websites with their audience.
Finding the most effective ways to aid webmasters
Getting in touch with webmasters:
One of the hardest steps on the road to recovery is first getting in contact with webmasters. We tried three notification channels: email, browser warnings, and search warnings. For webmasters who proactively registered their site with
, we found that email communication led to 75% of webmasters re-securing their pages. When we didn’t know a webmaster’s email address, browser warnings and search warnings helped 54% and 43% of sites clean up respectively.
Providing tips on cleaning up harmful content:
Attackers rely on hidden files, easy-to-miss redirects, and remote inclusions to serve scams and malware. This makes clean-up increasingly tricky. When we emailed webmasters, we included tips and samples of exactly which pages contained harmful content. This, combined with expedited notification, helped webmasters clean up 62% faster compared to no tips—usually within 3 days.
Making sure sites stay clean:
Once a site is no longer serving harmful content, it’s important to make sure attackers don’t reassert control. We monitored recently cleaned websites and found 12% were compromised again in 30 days. This illustrates the challenge involved in identifying the root cause of a breach versus dealing with the side-effects.
Making security issues less painful for webmasters—and everyone
We hope that webmasters never have to deal with a security incident. If you are a webmaster, there are some quick steps you can take to reduce your risk. We’ve made it
easier to receive security notifications through Google Analytics
as well as through
. Make sure to register for both services. Also, we have laid out helpful tips for
updating your site’s software
adding additional authentication
that will make your site safer.
If you’re a hosting provider or building a service that needs to notify victims of compromise, understand that the entire process is distressing for users. Establish a reliable communication channel before a security incident occurs, make sure to provide victims with clear recovery steps, and promptly reply to inquiries so the process feels helpful, not punitive.
As we work to make the web a safer place, we think it’s critical to empower webmasters and users to make good security decisions. It’s easy for the security community to be pessimistic about incident response being ‘too complex’ for victims, but as our findings demonstrate, even just starting a dialogue can significantly expedite recovery.
Security and Privacy
Announcing TensorFlow 0.8 – now with distributed computing support!
Wednesday, April 13, 2016
Posted by Derek Murray, Software Engineer
Google uses machine learning across a wide range of its products. In order to continually improve our models, it's crucial that the training process be as fast as possible. One way to do this is to run
across hundreds of machines, which shortens the training process for some models from weeks to hours, and allows us to experiment with models of increasing size and sophistication. Ever since we released TensorFlow as an open-source project, distributed training support has been one of the most requested features. Now the wait is over.
Today, we're excited to release TensorFlow 0.8 with distributed computing support, including everything you need to train distributed models on your own infrastructure. Distributed TensorFlow is powered by the high-performance
library, which supports training on hundreds of machines in parallel. It complements our recent announcement of
Google Cloud Machine Learning
, which enables you to train and serve your TensorFlow models using the power of the Google Cloud Platform.
To coincide with the TensorFlow 0.8 release, we have published a
for the I
nception image classification
neural network in the TensorFlow models repository. Using the distributed trainer, we trained the Inception network to 78% accuracy in less than 65 hours using 100 GPUs. Even small clusters—or a couple of machines under your desk—can benefit from distributed TensorFlow, since adding more GPUs improves the overall throughput, and produces accurate results sooner.
TensorFlow can speed up Inception training by a factor of 56, using 100 GPUs.
The distributed trainer also enables you to scale out training using a cluster management system like
. Furthermore, once you have trained your model, you can deploy to production and
speed up inference using TensorFlow Serving on Kubernetes
Beyond distributed Inception, the 0.8 release includes
for defining your own distributed models. TensorFlow's distributed architecture permits a great deal of flexibility in defining your model, because every process in the cluster can perform general-purpose computation. Our previous system
(like many systems that have followed it) used special "parameter servers" to manage the shared model parameters, where the parameter servers had a simple read/write interface for fetching and updating shared parameters. In TensorFlow, all computation—including parameter management—is represented in the dataflow graph, and the system maps the graph onto heterogeneous devices (like multi-core CPUs, general-purpose GPUs, and mobile processors) in the available processes. To make TensorFlow easier to use, we have included Python libraries that make it easy to write a model that runs on a single process and scales to use multiple replicas for training.
This architecture makes it easier to scale a single-process job up to use a cluster, and also to experiment with novel architectures for distributed training. As an example, my colleagues have recently shown that
synchronous SGD with backup workers
, implemented in the TensorFlow graph, achieves improved time-to-accuracy for image model training.
The current version of distributed computing support in TensorFlow is just the start. We are continuing to research ways of improving the performance of distributed training—both through engineering and algorithmic improvements—and will share these improvements with the community
. However, getting to this point would not have been possible without help from the following people:
TensorFlow training libraries
- Jianmin Chen, Matthieu Devin, Sherry Moore and Sergio Guadarrama
- Zhifeng Chen, Manjunath Kudlur and Vijay Vasudevan
- Shanqing Cai
Inception model architecture
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Jonathon Shlens and Zbigniew Wojna
- Amy McDonald Sandjideh
- Jeff Dean and Rajat Monga
All of Google’s CS Education Programs and Tools in One Place
Tuesday, April 12, 2016
Posted by Chris Stephenson, Head of Computer Science Education Programs
(Cross-posted on the
Google for Education Blog
Interest in computer science education is growing rapidly; even the President of the United States has spoken of the importance of
giving every student an opportunity to learn computer science
. Google has been a supportive partner in these efforts by developing high-quality learning programs, educational tools and resources to advance new approaches in computer science education. To make it easier for all students and educators to access this information, today we’re launching a
CS EDU website
that specifically outlines our initiatives in CS education.
The President’s call to action is grounded in economic realities coupled with a lack of access and ongoing system inequities. There is an increasing need for computer science skills in the workforce, with the
Bureau of Labor Statistics
estimating that there will be more than 1.3 million job openings in computer and mathematical occupations by 2022. The majority of these jobs will require at least a Bachelor’s degree in Computer Science or in Information Technology, yet the U.S. is only producing 16,000 CS undergraduates per year.
One of the reasons there are so few computer science graduates is that too few students have the opportunity to study computer science in high school.
shows that only 25% of U.S. schools currently offer CS with programming or coding, despite the fact that 91% of parents want their children to learn computer science. In addition, schools with higher percentages of students living in households below the poverty line are even less likely to offer rigorous computer science courses.
Increasing access to computer science for all learners requires tremendous commitment from a wide range of stakeholders, and we strive to be a strong supportive partner of these efforts. Our new
website shows all the ways Google is working to address the need for improved access to high quality computer science learning in formal and informal education. Some current programs you’ll find there include:
: providing more than 360,000 middle school students with an opportunity to create technology through free computer science clubs
Exploring Computational Thinking
: sharing more than 130 lesson plans aligned to international standards for students aged 8 to 18
: offering support and mentoring to address the retention problem in diverse student populations at the undergraduate level in more than 40 universities and counting
and other programming tools powering Code.org’s
Hour of Code
(2 million users)
Google’s Made with Code
: movement that inspires millions of girls to learn to code and to see it as a means to pursue their dream careers (more than 10 million unique visitors)
...and many more!
Computer science education is a pathway to innovation, to creativity and to exciting career opportunities, and Google believes that all students deserve these opportunities. That is why we are committed to developing programs, resources, tools and community partnerships that make computer science engaging and accessible for all students. With the launch of our
CS EDU website
, all of these programs are at your fingertips.
Genomic Data Processing on Google Cloud Platform
Tuesday, April 05, 2016
Posted by Dr. Stacey Gabriel, Director of the Genomics Platform at the Broad Institute of MIT and Harvard
Today we hear from Broad Institute of MIT and Harvard about how their researchers and software engineers are collaborating closely with the Google Genomics team on large-scale genomic data analysis. They’ve already reduced the time and cost for whole genome processing by several fold, helping researchers think even bigger. Broad’s open source tools, developed in close
collaboration with Google Genomics
, will also be made available to the wider research community.
– Jonathan Bingham, Product Manager, Google Genomics
Dr. Stacey Gabriel, Director of the
Genomics Platform at the Broad Institute
As one of the largest genome sequencing centers in the world, the
of MIT and Harvard generates a lot of data. Our DNA sequencers produce more than 20 Terabytes (TB) of genomic data per day, and they run 365 days a year. Moreover, our rate of data generation is not only growing, but accelerating – our output increased more than two-fold last year, and nearly two-fold the previous year. We are not alone in facing this embarrassment of riches; across the whole genomics community, the rate of data production is doubling about every eight months with no end in sight.
Here at Broad, our team of software engineers and methods developers have spent the last year working to re-architect our production sequencing environment for the cloud. This has been no small feat, especially as we had to build the plane while we flew it! It required an entirely new system for developing and deploying pipelines (which we call
), as well as a new framework for wet lab quality control that uncouples data generation from data processing.
Courtesy: Broad Institute of MIT and Harvard
Last summer Broad and Google
announced a collaboration
to develop a safe, secure and scalable cloud computing infrastructure capable of storing and processing enormous datasets. We also set out to build cloud-supported tools to analyze such data and unravel long-standing mysteries about human health. Our engineers collaborate closely; we teach them about genomic data science and genomic data engineering, and they teach us about cloud computing and distributed systems. To us, this is a wonderful model for how a basic research institute can productively collaborate with industry to advance science and medicine. Both groups move faster and go further by working together.
As of today, the largest and most important of our production pipelines, the
Whole Genome Sequencing Pipeline
, has been completely ported to the
Google Cloud Platform
(GCP). We are now beginning to run production jobs on GCP and will be switching over entirely this month. This switch has proved to be a very cost-effective decision. While the conventional wisdom is that public clouds can be more expensive, our experience is that cloud is dramatically cheaper. Consider the curve below that my colleague Kristian Cibulskis recently showed at
Out of the box, the cost of running the
Genome Analysis Toolkit
(GATK) best practices pipeline on a 30X-coverage whole genome was roughly the same as the cost of our on-premise infrastructure. Over a period of a few months, however, we developed techniques that allowed us to
reduce costs: We learned how to parallelize the computationally intensive steps like aligning DNA sequences against a reference genome. We also optimized for GCP’s infrastructure to lower costs by using features such as
. After doing these optimizations, our production whole genome pipeline was about 20% the cost of where we were when we started, saving our researchers millions of dollars, all while reducing processing turnaround time eight-fold.
There is a similar story to be told on storage of the input and output data.
Google Cloud Storage Nearline
is a medium for storing DNA sequence alignments and raw data. Like most people in genomics, we access genetic variants data every day, but raw DNA sequences only a few times per year, such as when there is a new algorithm that requires raw data or a new assembly of the human genome. Nearline’s price/performance tradeoff is well-suited to data that’s infrequently accessed. By using Nearline, along with some compression tricks, we were able to reduce our storage costs by greater than 50%.
Altogether, we estimate that, by using GCP services for both compute and storage, we will be able to lower the total cost of ownership for storing and processing genomic data significantly relative to our on premise costs. Looking forward, we also see advantages for data sharing, particularly for large multi-group genome projects. An environment where the data can be securely stored and analyzed will solve problems of multiple groups copying and paying for transmission and storage of the same data.
Porting the GATK whole genome pipeline to the cloud is just the starting point. During the coming year, we plan to migrate the bulk of our production pipelines to the cloud, including tools for arrays, exomes, cancer genomes, and RNA-seq. Moreover, our non-exclusive relationship with Google is founded on the principle that our groups can leverage complementary skills to make products that can not only serve the needs of Broad, but also help serve the needs of researchers around the world. Therefore, as we migrate each of our pipelines to the cloud to meet our own needs, we also plan to make them available to the greater genomics community through a Software-as-a-Service model.
This is an exciting time for us at Broad. For more than a decade we have served the genomics community by acting as a hub for data generation; now, we are extending this mission to encompass not only sequencing services, but also data services. We believe that by expanding access to our tools and optimizing our pipelines for the cloud, will enable the community to benefit from the enormous effort we have invested. We look forward to expanding the scope of this mission in the years to come.
Google Cloud Platform
Lessons learned while protecting Gmail
Tuesday, March 29, 2016
Posted by Elie Bursztein - anti-abuse & security research, Nicolas Lidzborski - Gmail security engineering, and Vijay Eranti - Gmail anti-abuse engineering
Earlier this year in San Francisco,
hosted their inaugural
, which focused on security, privacy and electronic crime through the lens of emerging threats and novel attacks. We were
excited to help make this conference happen
and to participate in it.
At the conference, we heard from a variety of terrific speakers including:
, Professor at MIT and inventor of RSA, who spoke about
the consequences of backdooring encryption
, Chief of the NSA Tailored Access Operations organization, who spoke about about
defending against state attackers
George “Geohot” Hotz
, Hacker extraordinaire, who discussed
state of the art software debugging
In addition, we were able to
share the lessons we’ve learned
about protecting Gmail users since it was launched over a decade ago. Those lessons are summarized in the infographic below (the talk slides are
We were proud to sponsor this year's inaugural Enigma conference, and it is our hope that the core lessons that we have learned over the years can benefit other online products and services. We're looking forward to participating again next year when
Enigma returns in 2017
. We hope to see you there!
Security and Privacy
Adaptive Data Analysis
Automatic Speech Recognition
Electronic Commerce and Algorithms
Google Cloud Platform
Google Science Fair
Google Voice Search
High Dynamic Range Imaging
Internet of Things
Natural Language Processing
Natural Language Understanding
Optical Character Recognition
Public Data Explorer
Security and Privacy
Site Reliability Engineering
Give us feedback in our
Official Google Blog
Public Policy Blog
Lat Long Blog
Ads Developer Blog
Android Developers Blog