Welcome!

@DevOpsSummit Authors: Pat Romanski, Carmen Gonzalez, Mehdi Daoudi, XebiaLabs Blog, Elizabeth White

Related Topics: FinTech Journal, @CloudExpo, @BigDataExpo, @ThingsExpo

FinTech Journal: Blog Post

Data Unification at Scale | @CloudExpo #BigData #DataLake #AI #Analytics

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies such as Talend, 1010Data, and TamR. Data unification deals with the domain known as ETL (Extraction, Transformation, Loading), initiated during the 1990s when Data Warehousing was gaining relevance. ETL refers to the process of extracting data from inside or outside sources (multiple applications typically developed and supported by different vendors or hosted on separate hardware), transform it to fit operational needs (based on business rules), and load it into end target databases, more specifically, an operational data store, data mart, or a data warehouse. These are read-only databases for analytics. Initially the analytics was mostly retroactive (e.g. how many shoppers between age 25-35 bought this item between May and July?). This was like driving a car looking at the rear-view mirror. Then forward-looking analysis (called data mining) started to appear. Now business also demands "predictive analytics" and "streaming analytics".

During my IBM and Oracle days, the ETL in the first phase was left for outside companies to address. This was unglamorous work and key vendors were not that interested to solve this. This gave rise to many new players such as Informatica, Datastage, Talend and it became quite a thriving business. We also see many open-source ETL companies.

The ETL methodology consisted of: constructing a global schema in advance, for each local data source write a program to understand the source and map to the global schema, then write a script to transform, clean (homonym and synonym issues) and dedup (get rid of duplicates) it. Programs were set up to build the ETL pipeline. This process has matured over 20 years and is used today for data unification problems. The term MDM (Master Data Management) points to a master representation of all enterprise objects, to which everybody agrees to confirm.

In the world of Big Data, this approach is very inadequate. Why?

  • Data unification at scale is a very big deal. The schema-first approach works fine with retail data (sales transactions, not many data sources,..), but gets extremely hard with sources that can be hundreds or even thousands. This gets worse when you want to unify public data from the web with enterprise data.
  • Human labor to map each source to a master schema gets to be costly and excessive. Here machine learning is required and domain experts should be asked to augment where needed.
  • Real-time data unification of streaming data and analysis can not be handled by these solutions.

Another solution called "data lake" where you store disparate data in their native format, seems to address the "ingest" problem only. It tries to change the order of ETL to ELT (first load then transform). However it does not address the scale issues. The new world needs bottoms-up data unification (schema-last) in real-time or near real-time.

The typical data unification cycle can go like this - start with a few sources, try enriching the data with say X, see if it works, if you fail then loop back and try again. Use enrichment to improve and do everything automatically using machine learning and statistics. But iterate furiously. Ask for help when needed from domain experts. Otherwise the current approach of ETL or ELT can get very expensive.

  • LikeData Unification at scale
  • Comment
  • ShareShare Data Unification at scale



Read the original blog entry...

More Stories By Jnan Dash

Jnan Dash is Senior Advisor at EZShield Inc., Advisor at ScaleDB and Board Member at Compassites Software Solutions. He has lived in Silicon Valley since 1979. Formerly he was the Chief Strategy Officer (Consulting) at Curl Inc., before which he spent ten years at Oracle Corporation and was the Group Vice President, Systems Architecture and Technology till 2002. He was responsible for setting Oracle's core database and application server product directions and interacted with customers worldwide in translating future needs to product plans. Before that he spent 16 years at IBM. He blogs at http://jnandash.ulitzer.com.

@DevOpsSummit Stories
SYS-CON Events announced today that Juniper Networks (NYSE: JNPR), an industry leader in automated, scalable and secure networks, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Juniper Networks challenges the status quo with products, solutions and services that transform the economics of networking. The company co-innovates with customers and partners to deliver automated, scalable and secure networks with agility, performance and value.
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal today!
SYS-CON Events announced today that Hitachi, the leading provider the Internet of Things and Digital Transformation, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Hitachi Data Systems, a wholly owned subsidiary of Hitachi, Ltd., offers an integrated portfolio of services and solutions that enable digital transformation through enhanced data management, governance, mobility and analytics. We help global organizations open new revenue streams, increase efficiencies, improve customer experience and ensure rapid time to market in the digital age. Only Hitachi Data Systems powers the digital enterprise by integrating the best information technology and operational technology from across the Hitachi family of companies. We combine this experience with Hitachi expertise in the internet of things to d...
The goal of Continuous Testing is to shift testing left to find defects earlier and release software faster. This can be achieved by integrating a set of open source functional and performance testing tools in the early stages of your software delivery lifecycle. There is one process that binds all application delivery stages together into one well-orchestrated machine: Continuous Testing. Continuous Testing is the conveyor belt between the Software Factory and production stages. Artifacts are moved from one stage to the next only after they have been tested and approved to continue. New code submitted to the repository is tested upon commit. When tests fail, the code is rejected. Subsystems are approved as part of periodic builds on their way to the delivery stage, where the system is being tested as production ready. The release process stops when tests fail. The key is to shift test ...
Back in February of 2017, Andrew Clay Schafer of Pivotal tweeted the following: “seriously tho, the whole software industry is stuck on deployment when we desperately need architecture and telemetry.” Intrigue in a 140 characters. For me, I hear Andrew saying, “we’re jumping to step 5 before we’ve successfully completed steps 1-4.”
SYS-CON Events announced today that Hitachi, the leading provider the Internet of Things and Digital Transformation, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Hitachi Data Systems, a wholly owned subsidiary of Hitachi, Ltd., offers an integrated portfolio of services and solutions that enable digital transformation through enhanced data management, governance, mobility and analytics. We help global organizations open new revenue streams, increase efficiencies, improve customer experience and ensure rapid time to market in the digital age. Only Hitachi Data Systems powers the digital enterprise by integrating the best information technology and operational technology from across the Hitachi family of companies. We combine this experience with Hitachi expertise in the internet of things to d...
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on quality and value.
“DevOps is really about the business. The business is under pressure today, competitively in the marketplace to respond to the expectations of the customer. The business is driving IT and the problem is that IT isn't responding fast enough," explained Mark Levy, Senior Product Marketing Manager at Serena Software, in this SYS-CON.tv interview at DevOps Summit, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
SYS-CON Events announced today that Grape Up will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company specializing in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the U.S. and Europe, Grape Up works with a variety of customers from emerging startups to Fortune 1000 companies.
SYS-CON Events announced today that Super Micro Computer, Inc., a global leader in compute, storage and networking technologies, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Supermicro (NASDAQ: SMCI), the leading innovator in high-performance, high-efficiency server technology, is a premier provider of advanced server Building Block Solutions® for Data Center, Cloud Computing, Enterprise IT, Hadoop/Big Data, HPC and Embedded Systems worldwide. Supermicro is committed to protecting the environment through its “We Keep IT Green®” initiative and provides customers with the most energy-efficient, environmentally friendly solutions available on the market.
Everyone wants to use containers, but monitoring containers is hard. New ephemeral architecture introduces new challenges in how monitoring tools need to monitor and visualize containers, so your team can make sense of everything. In his session at @DevOpsSummit, David Gildeh, co-founder and CEO of Outlyer, will go through the challenges and show there is light at the end of the tunnel if you use the right tools and understand what you need to be monitoring to successfully use containers in your environments.
Developers want to create better apps faster. Static clouds are giving way to scalable systems, with dynamic resource allocation and application monitoring. You won't hear that chant from users on any picket line, but helping developers to create better apps faster is the mission of Lee Atchison, principal cloud architect and advocate at New Relic Inc., based in San Francisco. His singular job is to understand and drive the industry in the areas of cloud architecture, microservices, scalability and availability. In a keynote presentation, he spoke to a standing-room-only crowd at New York's Cloud Expo about how highly available, highly scalable systems can help developers attain the goal of better apps faster.
Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the USA and Europe, we work with a variety of customers from emerging startups to Fortune 1000 companies.
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 additional third-party data centers across Europe. Its full-service Unified ICT platform serves international enterprises and many of the world’s leading service providers, as well as governments and universities.
@DevOpsSummit at Cloud taking place June 6-8, 2017, at Javits Center, New York City, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential.
In recent years, containers have taken the world by storm. Companies of all sizes and industries have realized the massive benefits of containers, such as unprecedented mobility, higher hardware utilization, and increased flexibility and agility; however, many containers today are non-persistent. Containers without persistence miss out on many benefits, and in many cases simply pass the responsibility of persistence onto other infrastructure, adding additional complexity.
Cloud Expo, Inc. has announced today that Aruna Ravichandran, vice president of DevOps Product and Solutions Marketing at CA Technologies, has been named co-conference chair of DevOps at Cloud Expo 2017. The @DevOpsSummit at Cloud Expo New York will take place on June 6-8, 2017, at the Javits Center in New York City, New York, and @DevOpsSummit at Cloud Expo Silicon Valley will take place Oct. 31-Nov. 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Most companies are adopting or evaluating container technology - Docker in particular - to speed up application deployment, drive down cost, ease management and make application delivery more flexible overall. As with most new architectures, this dream takes a lot of work to become a reality. Even when you do get your application componentized enough and packaged properly, there are still challenges for DevOps teams to making the shift to continuous delivery and achieving that reduction in cost and increase in speed.
Translating agile methodology into real-world best practices within the modern software factory has driven widespread DevOps adoption, yet much work remains to expand workflows and tooling across the enterprise. As models evolve from pockets of experimentation into wholescale organizational reinvention, practitioners find themselves challenged to incorporate the culture and architecture necessary to support DevOps at scale. In his session at @DevOpsSummit at 20th Cloud Expo, Anand Akela, Senior Director of DevOps Solutions at CA Technologies, will discuss how existing adopters are employing unified agile and DevOps techniques to engage functional processes and toolchains that deliver increased software quality, faster time-to-market and measurably improved customer experience.
SYS-CON Events announced today that Twistlock, the leading provider of cloud container security solutions, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Twistlock is the industry's first enterprise security suite for container security. Twistlock's technology addresses risks on the host and within the application of the container, enabling enterprises to consistently enforce security policies, monitor and audit activity and identify and isolate threats in a container or cluster of containers.
In his keynote at 19th Cloud Expo, Sheng Liang, co-founder and CEO of Rancher Labs, discussed the technological advances and new business opportunities created by the rapid adoption of containers. With the success of Amazon Web Services (AWS) and various open source technologies used to build private clouds, cloud computing has become an essential component of IT strategy. However, users continue to face challenges in implementing clouds, as older technologies evolve and newer ones like Docker containers gain prominence. He explored these challenges and how to address them, while considering how containers will influence the direction of cloud computing.
SYS-CON Events announced today that Juniper Networks (NYSE: JNPR), an industry leader in automated, scalable and secure networks, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Juniper Networks challenges the status quo with products, solutions and services that transform the economics of networking. The company co-innovates with customers and partners to deliver automated, scalable and secure networks with agility, performance and value.
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, New York, and 21st International Cloud Expo, which will take place in November in Silicon Valley, California.
DevOps is being widely accepted (if not fully adopted) as essential in enterprise IT. But as Enterprise DevOps gains maturity, expands scope, and increases velocity, the need for data-driven decisions across teams becomes more acute. DevOps teams in any modern business must wrangle the ‘digital exhaust’ from the delivery toolchain, "pervasive" and "cognitive" computing, APIs and services, mobile devices and applications, the Internet of Things, and now even blockchain.