Welcome!

@DevOpsSummit Authors: Elizabeth White, Liz McMillan, Yeshim Deniz, Pat Romanski, Mehdi Daoudi

Related Topics: Artificial Intelligence, Machine Learning , @CloudExpo

Artificial Intelligence: Article

AWS Broke the Internet Again or, Better, a Typo | @CloudExpo #AI #ML #DL

An AI-defined infrastructure can help to avoid service disruptions

Amazon Web Services (AWS) broke the Internet again or better "a typo". On February 28, 2017, an Amazon S3 service disruption in AWS' oldest region US-EAST-1 shuts down several major websites and services like Slack, Trello, Quora, Business Insider, Coursera and Time Inc. Other users were reporting that they were also unable to control devices which were connected via the Internet of Things since IFTTT was also down. Those kinds of disruptions are becoming more and more business critical for today's digital economy. To prevent these situations, cloud users should always consider the shared responsibility model in the public cloud. However, there are also ways where Artificial Intelligence (AI) can help. This article describes that an AI-defined Infrastructure respectively an AI-powered IT management system can help to avoid service disruptions of public cloud providers.

Amazon S3 Service Disruption - What has happened
After every service disruption AWS writes a summary of what was going on during an incident. This is what happened on the morning of February 28.

"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable."

Read more under "Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region".

Bottom line, a typo crashed the AWS powered Internet! AWS outages already have a long history and the more AWS customers running their web infrastructure on the cloud giant, the more issues end customers will experience in the future. According to SimilarTech only Amazon S3 is already used by 152,123 websites and 124,577 unique domains.

However, following the philosophy of "Everything fails all the time (Werner Vogels, CTO Amazon.com)" means if you are using AWS you must "Design for Failure".  Something cloud role model and video on demand provider Netflix is doing in perfection. In doing so, Netflix has developed its Simian Army an open source toolset everyone can use to run a cloud infrastructure on AWS high-available.

Netflix "simply" uses the two levels of redundancy AWS offers. Multiple regions and multiple availability zones (AZ). Multiple regions are the masterclass of using AWS, very complex and sophisticated since you must build and manage entire separated infrastructure environments within AWS' worldwide distributed cloud infrastructure. Multiple AZs are the preferred and "easiest" way for high availability (HA) on AWS. In this case, the infrastructure is built within more than one data center (AZ). In doing so, a single region HA architecture is deployed in at least two or more AZs - a load balancer in front of it is controlling the data traffic.

However, even if "typos" shouldn't happen the recent accident shows, that human error is still the biggest issue running IT systems. In addition, you can blame AWS only to a certain extend since the public cloud is about shared responsibility.

Shared Responsibility in the Public Cloud
An important public cloud detail is the self-service. Depending on its DNA the providers are only taking responsibility for specific areas. The customer is responsible for the rest. In the public cloud, it is about sharing responsibilities - this model is called Shared Responsibility. The provider and its customers divide the field of duties among themselves. In doing so, the customer's self-responsibility plays a major role. In the context of IaaS utilization, the provider is responsible for the operations and security of the physical environment. He is taking care of:

  • Set up and maintenance of the entire data center infrastructure.
  • Deployment of compute power, storage, network and managed services (like databases) and other micro services.
  • Provisioning the virtualization layer customers are using to demand virtual resources at any time.
  • Deployment of services and tools customers can use to manage their areas of responsibility.

The customer is responsible for the operations and security of the logical environment. This includes:

  • Set up of the virtual infrastructure.
  • Installation of operating systems.
  • Configuration of networks and firewall settings.
  • Operations of own applications and self-developed (micro) services.

Thus, the customer is responsible for the operations and security of his own infrastructure environment and the systems, applications, services, as well as stored data on top of it. However, providers like Amazon Web Services or Microsoft Azure provide comprehensive tools and services customers can use e.g. to encrypt their data as well as ensure identity and access controls. In addition, enablement services (micro services) exist that customers can adopt to develop own applications more quickly and easily.

In doing so, the customer is all alone in its area of responsibility and thus must take self-responsibility. However, this part of the shared responsibility can be done by an AI-defined IT management system respectively an AI-defined Infrastructure.

An AI-defined Infrastructure can help to avoid Service Disruptions
An AI-defined Infrastructure can help to avoid service disruptions in the public cloud. However, the basis of this kind of infrastructure is a General AI that combines three major human abilities that enable enterprises to tackle IT and business process challenges.

  • Understanding: By creating a semantic data map the General AI understands the world of the company in which its IT and business exists.
  • Learning: By creating Knowledge Items the General AI learns best practices and reasoning from experts. Knowledge is taught in atomic pieces of information (Knowledge Items) that represent separate steps of a process.
  • Solving: With machine reasoning problems are solved in ambiguous and changing environments. The General AI dynamically reacts to the ever-changing context, selecting the best course of action. Based on machine learning the results are optimized through experiments.

To put this into the context of an AWS service disruption:

  • Understanding: The General AI creates a semantic map of the AWS environment as part of the world in which the company exists.
  • Learning: IT experts create Knowledge Items while they are configuring and working with AWS from what the General AI learns best practices. Thus, the experts teach the General AI contextual knowledge that includes what, when, where and why something needs to be done - for example when a specific AWS service is not responding.
  • Solving: The General AI dynamically reacts to incidents based on the learned knowledge. Thus, the AI (probably) knows what to do at this very moment - even if no high availability setup was considered from the beginning.

Frankly speaking, everything described above is no magic. Like every new born organism an AI-defined Infrastructure needs to be trained but afterwards can work autonomously as well as can detect anomalies as well as service disruptions in the public cloud and solve them. Therefore, you need the knowledge of experts who have a deep understanding of AWS and how the cloud works in general. These experts need to teach the General AI with their contextual knowledge that includes not only what, when and where but also why. They have to teach the AI with atomic pieces (Knowledge Items, KI) that can be indexed and prioritized by the AI. Context and indexing enable this KIs to be combined to form many solutions.

KIs created by various IT experts create pooled expertise that is further optimized by machine selection of best knowledge combinations for problem resolution. This type of collaborative learning improves process time task by task. However, the number of possible permutations grows exponentially with added knowledge. Connected to a knowledge core, the General AI continuously optimizes performance by eliminating unnecessary steps and even changing routes based on other contextual learning. And the bigger the semantic graph and knowledge core gets, the better and more dynamically the infrastructure can act in terms of service disruptions.

On a final note, do not underestimate the "power of we"! Our research at Arago revealed that with an overlap of 33 percent in basic knowledge, this knowledge can and is used outside a specific organizational environment, i.e. across different client environments. The reuse of knowledge within a client is up to 80 percent. Thus, exchanging basic knowledge within a community becomes imperative from an efficiency perspective and improve the abilities of the General AI.

More Stories By Rene Buest

Rene Buest is Director of Market Research & Technology Evangelism at Arago. Prior to that he was Senior Analyst and Cloud Practice Lead at Crisp Research, Principal Analyst at New Age Disruption and member of the worldwide Gigaom Research Analyst Network. At this time he was considered a top cloud computing analyst in Germany and one of the worldwide top analysts in this area. In addition, he was one of the world’s top cloud computing influencers and belongs to the top 100 cloud computing experts on Twitter and Google+. Since the mid-90s he is focused on the strategic use of information technology in businesses and the IT impact on our society as well as disruptive technologies.

Rene Buest is the author of numerous professional technology articles. He regularly writes for well-known IT publications like Computerwoche, CIO Magazin, LANline as well as Silicon.de and is cited in German and international media – including New York Times, Forbes Magazin, Handelsblatt, Frankfurter Allgemeine Zeitung, Wirtschaftswoche, Computerwoche, CIO, Manager Magazin and Harvard Business Manager. Furthermore he is speaker and participant of experts rounds. He is founder of CloudUser.de and writes about cloud computing, IT infrastructure, technologies, management and strategies. He holds a diploma in computer engineering from the Hochschule Bremen (Dipl.-Informatiker (FH)) as well as a M.Sc. in IT-Management and Information Systems from the FHDW Paderborn.

@DevOpsSummit Stories
"We are an IT services solution provider and we sell software to support those solutions. Our focus and key areas are around security, enterprise monitoring, and continuous delivery optimization," noted John Balsavage, President of A&I Solutions, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
DevOps at Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential.
DX World EXPO, LLC., a Lighthouse Point, Florida-based startup trade show producer and the creator of "DXWorldEXPO® - Digital Transformation Conference & Expo" has announced its executive management team. The team is headed by Levent Selamoglu, who has been named CEO. "Now is the time for a truly global DX event, to bring together the leading minds from the technology world in a conversation about Digital Transformation," he said in making the announcement.
"At the keynote this morning we spoke about the value proposition of Nutanix, of having a DevOps culture and a mindset, and the business outcomes of achieving agility and scale, which everybody here is trying to accomplish," noted Mark Lavi, DevOps Solution Architect at Nutanix, in this SYS-CON.tv interview at @DevOpsSummit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices - computers, smartphones, tablets, and sensors - connected to the Internet by 2020. This number will continue to grow at a rapid pace for the next several decades. With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo in Silicon Valley. Learn what is going on, contribute to the discussions, and ensure that your enterprise...
SYS-CON Events announced today that Calligo, an innovative cloud service provider offering mid-sized companies the highest levels of data privacy and security, has been named "Bronze Sponsor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Calligo offers unparalleled application performance guarantees, commercial flexibility and a personalised support service from its globally located cloud platforms. Through its four pillars of focus, Calligo delivers a platform that businesses can trust to deliver the high level of service and protection they expect and is lacking in many cloud offerings.
"With Digital Experience Monitoring what used to be a simple visit to a web page has exploded into app on phones, data from social media feeds, competitive benchmarking - these are all components that are only available because of some type of digital asset," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big Data are challenging the existing design patterns of the Data Center. The increase in complexity of managing legacy systems alongside new systems is beyond the ability of most IT departments. This leads to multiple tiers of storage and high economic costs, during a time in which IT is expected to do more with less.
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications. Kubernetes was originally built by Google, leveraging years of experience with managing container workloads, and is now a Cloud Native Compute Foundation (CNCF) project. Kubernetes has been widely adopted by the community, supported on all major public and private cloud providers, and is gaining rapid adoption in enterprises. However, Kubernetes may seem intimidating and complex to learn. This is because Kubernetes is more of a toolset than a ready solution. Hence it’s essential to know when and how to apply the appropriate Kubernetes constructs.
"I think DevOps is now a rambunctious teenager – it’s starting to get a mind of its own, wanting to get its own things but it still needs some adult supervision," explained Thomas Hooker, VP of marketing at CollabNet, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"I will be talking about ChatOps and ChatOps as a way to solve some problems in the DevOps space," explained Himanshu Chhetri, CTO of Addteq, in this SYS-CON.tv interview at @DevOpsSummit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, provided a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services with Tintri's web services architecture and APIs. Impress your DevOps team with smart and autonomous infrastructure.
As enterprise cloud becomes the norm, businesses and government programs must address compounded regulatory compliance related to data privacy and information protection. The most recent, Controlled Unclassified Information and the EU’s GDPR have board level implications and companies still struggle with demonstrating due diligence. Developers and DevOps leaders, as part of the pre-planning process and the associated supply chain, could benefit from updating their code libraries and design by incorporating changes.
"I'm here to leverage my secret sauce, which is using outsourced development and the company that I utilize is delaPlex Software and they've basically allowed me to win Fortune 500 companies," noted Justin Witz, CTO of FRA and PlanTools, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
In his opening keynote at 20th Cloud Expo, Michael Maximilien, Research Scientist, Architect, and Engineer at IBM, discussed the full potential of the cloud and social data requires artificial intelligence. By mixing Cloud Foundry and the rich set of Watson services, IBM's Bluemix is the best cloud operating system for enterprises today, providing rapid development and deployment of applications that can take advantage of the rich catalog of Watson services to help drive insights from the vast trove of private and public data available to enterprises.
SYS-CON Events announced today that EnterpriseTech has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. EnterpriseTech is a professional resource for news and intelligence covering the migration of high-end technologies into the enterprise and business-IT industry, with a special focus on high-tech solutions in new product development, workload management, increased efficiency, and maximizing competitive edge.
"We began as LinuxAcademy.com about five years ago as a very small outfit. Since then we've transitioned into more of a DevOps training company - the technologies and the tooling around DevOps," explained Doug Vanderweide, an instructor at Linux Academy, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In their Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, and Mark Lavi, a Nutanix DevOps Solution Architect, explored the ways that Nutanix technologies empower teams to react faster than ever before and connect teams in ways that were either too complex or simply impossible with traditional infrastructures.
"NetApp's vision is how we help organizations manage data - delivering the right data in the right place, in the right time, to the people who need it, and doing it agnostic to what the platform is," explained Josh Atwell, Developer Advocate for NetApp, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of engineers can navigate the Carrier Ecosystem for your IT team acting as an extension of your business, producing a hassle-free experience.
SYS-CON Events announced today that Cloud Academy named "Bronze Sponsor" of 21st International Cloud Expo which will take place October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara, CA. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud computing technologies. Get certified, manage the full lifecycle of your cloud-based resources, and build your knowledge based using Cloud Academy’s expert-created content, comprehensive Learning Paths, and innovative Hands-on Labs.
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a single platform to drive unprecedented simplicity in the data center. Customers can start with a base infrastructure and scale to multi-site and multi-geo infrastructures with predictable economics and performance.
SYS-CON Events announced today that CHEETAH Training & Innovation will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CHEETAH Training & Innovation is a cloud consulting and IT training firm specializing in improving clients cloud strategies and infrastructures for medium to large companies.