Companion Web-site for: Strategic Prototyping for Developing Big Data Systems


Contents:
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6
Case 7
Case 8
Case 9
Flowchart



 

Case 1: Network Security, Intrusion Prevention

 

Business Goals

Big Data

Technologies

Prototyping / Architecture Analysis

Provide ability for security analysts to improve intrusion detection techniques;
Observe traffic behavior and make infrastructure adjustments:
Adjust company security policies 
Machine generated data7.5 billion event records per day collected from Intrusion Prevention System devices
Near real-time reporting touching
 billions of rows < 1 min.
• ETL: Talend
• Storage/DW: InfoBright EE, HP Vertica
• OLAP: Pentaho Mondrian
• BI: JasperServer Pro

Architecture analysis (addressed risk – support system components performance through selecting proper technologies):

-       Analysis of open source ETL and ROLAP tools (using open source was a project constraint) (1 week)

-       As a result selected Talend as an ETL and Pentaho Mondrian as ROLAP engine

 

Prototyping (addressed risk – achieving system performance of selected technologies as an integrated stack):

-       Vertifcal evolutionary prototype using selected open source technologies and InfoBright column-oriented database. Verified performance of data ingestion and queries (1.5 month)

-       Throwaway prototype with HP Vertica to validate it as a replacement for InfoBright (2 weeks)

 

In this project system performance was a major risk. The velocity requirement of data collection was about 7.5 billion event records per day (i.e., > 85,000 per second) from IPSs (Intrusion Prevention Systems).  Analytics reports (which “touch” billions of rows) must be able to be generated in < 1 min.

This project began with a vertical evolutionary prototype. The main reason for this was that the system was to be deployed as a Virtual Appliance (i.e. a Virtual Machine). It was important to emulate major performance scenarios using all stack technologies (ETL, DWH, OLAP, Reports) and monitor system utilization to make sure that all components are within their limits of CPU and memory usage. A throwaway prototype for any particular component could not adequately answer  this concern: it could  test the performance of a single technology, but not an orchestrated system. As a result of this prototyping an issue with InfoBright (datawarehouse) was detected (deletion of existing data), but it was possible to work around it without compromising  performance.

However, over time, when the project was  about 75% complete, data capacity and throughput requirements changed to become more strict. The unsatisfactory performance of InfoBright became a bottleneck for  project success. A decision was to select an alternative datawarehouse technology. HP Vertica was chosen primarily due to its favorable performance reports and licensing preferences.

Since the development team did not have previous experience with this technology, an isolated throwaway prototype was implemented (to save time and minimize overall project impact if this technology could not show significant performance improvements compared with InfoBright). The result was positive and so we decided to integrate HP Vertica into the main project branch.

A takeaway of this case study is that even when a vertical evolutionary prototype successfully minimizes technical risks, requirements may still change and another round of prototyping becomes vital.  A throwaway prototype can help minimize the time for making a decision in substituting a technology.

 

 



Case 2: Online Coupon Web Analytics


Business Goals

Big Data

Technologies

Prototyping / Architecture Analysis

In-house Web Analytics Platform for "Conversion Funnel Analysis", marketing campaign optimization, user behavior analytics
Clickstream analytics, platform feature usage analysis
• 500 million visits a year
25TB+ HP Vertica datawarehouse
50TB+ Hadoop Cluster
• Near-real-time analytics (15 minutes is supported for clickstream data)
Data Lake - (Amazon EMR) /Hive/Hue/MapReduce/Flume/Spark
DW: HP Vertica, MySQL
ETL/Data Integration: custom using Python
BI: R, Mahout, Tableau

Prototyping (addressed risk – improving performance of data science team when querying data):

-       Throwaway prototype to find replacement for Hive (1 month)

-       Spark, Impala and Hive Stinger were evaluated (mostly performance scenarios) and as a result Spark was selected


This case study was a proof-of-concept for an in-house web analytics platform with more than 50 TB scale of raw semi-structured data (logs) and more than 25 TB of refined structured data.
The platform has been built and evolved iteratively, based on agile practices. After 2.5 years the volume of collected data and its complexity became a significant hurdle for data exploration scenarios. Performance of Hive ad-hoc analytics queries over Data Lake (50+ TB of logs) was unsatisfactory, typically taking several minutes and, for some queries, approaching one hour response times.

As a result, an architectural analysis was started to help a data science team, who was the primary user of a Data Lake. A set of distributed query processing technologies was analyzed to find a replacement for Hive: Hive Stinger (Hive on Tez), Spark and Impala. The analysis took 1 week, primarily involving analyzing product documentation and other public reference materials, trying to identify the most optimal technology. This investigation did not reveal any clear favorite, since the candidate technologies were immature at that moment and the lack of industry case studies and benchmarks made selection difficult.

A throwaway prototype for each candidate technology was chosen as the only effective option to move forward. Within a month Spark, Impala, and Hive Stinger were evaluated based on the most common queries and production data. Since Spark demonstrated the best results during the tests, it was selected as the echnology to replace Hive for an exploratory analysis by the data science team.

A takeaway of this case study is that an architectural analysis alone may be insufficient to make a technology selection decisions. This is especially true for new technologies (or new versions of existing technologies). The sooner the analysis discovers a lack of trustworthy information, the sooner prototypes can start and provide results. However, the analysis can serve as an effective technique to narrow down a list of candidate technologies and hence save prototyping time.

 



Case 3: Cloud-based Mobile App Development Platform


Business Goals

Big Data

Technologies

Prototyping / Architecture Analysis

Provide visual environment for  building custom mobile applications
Charge  customers by usage
Analysis of platform feature usage by end-users and platform optimization
Data Volume > 10 TB
Sources: JSON
Data Throughput > 10K/sec
Analytics: self-service, pre-defined reports, ad-hoc
• Data Latency: < 2 min
Middleware: RabbitMQ,  Amazon SQS, Celery
• DB: Amazon Redshift,  RDS, S3
Jaspersoft          
Elastic Beanstalk
Integration: Python

 

Architecture analysis (addressed risk – selection of cost effective technology for DWH):

-       Analysis of technologies for datawarehouse - Amazon Redshift, Amazon RDS, and HP Vertica (1 week)

-       Selected Amazon Redshift - the criteria for selection were features and cost

 

Prototyping (addressed risk – integration between technologies and performance of entire system):

-       Vertical evolutionary prototype based on Redshift and Jaspersoft (1 month)

-       Verified integration and performance scenarios


 

This project began as an analytics subsystem for cloud-based mobile application development platform. Since the platform itself had been already hosted on AWS, it was decided to design the analytics using AWS as a target hosting platform and as an architectural constraint. The major big data drivers were data volume (> 10 TB) and velocity (near-real-time ad-hoc and static reports, < 2 minutes).

An architecture analysis began with the selection of cost effective technology for a datawarehouse.  Since one of the requirements was to minimize hosting operational costs (the system had to store data for about 1,000 customers), it must offer high data compression without significant performance degradation.

Several technologies for a datawarehouse were compared during a one week period by analyzing industry performance benchmarks and price comparisons between Amazon Redshift, HP Vertica and Amazon RDS. Amazon Redshift was selected as the optimal option. Although a development team had already experience in HP Vertica and used it for similar scenarios, it was more expensive than Amazon Redshift. Amazon RDS didn’t provide the required scale as it was based on traditional (small data) technologies, whereas Amazon Redshift and HP Vertica use a MPP (Massive Parallel Processing) technique which is crucial for big data processing.

After the architectural analysis selected the primary candidate technology for a datawarehouse, it became important to mitigate integration risks and measure overall system performance (end-to-end). A vertical evolutionary prototype started by integrating Redshift and Jaspersoft. (In fact, the team was an early adopter of Amazon Redshift and Jaspersoft BI hosted at Amazon, which meant that there were many risks.)  After two sprints (4 weeks) the prototype demonstrated performance results which were entirely satisfactory.

A takeaway of this case study is that an architectural analysis complements vertical evolutionary prototyping by selecting a primary technology candidate. Prototyping validates that the technology selection was correct. In the worst case scenario, when a primary technology does not satisfy the requirements, it is still possible to fall back to a secondary option (e.g., HP Vertica in this case) with some trade-off however (cost in this case), but in a controllable and manageable way.

 



Case 4: Telecom E-tailing platform



Business Goals

Big Data

Technologies

Prototyping / Architecture Analysis

Build an OMNI-Channel platform to improve sales and operations
Analyze all enterprise data from multiple sources for real-time recommendation and cross/up sales
Analytics on  90+ TB (30+ TB structured, 60+ TB unstructured and semi-structured data)
• Elasticity through SDE principles
Hadoop (HDFS, Hive, HBase)
Cassandra
HP Vertica/Teradata
Microstrategy/ Tableau

 

Architecture analysis (addressed risk – selection of technologies that support target scalability, elasticity and cost criteria):

-       Evaluated OpenStack distributions and Databases for analytics (Teradata, Microsoft PDW and HP Vertica) (2 weeks)

-       Selected RedHat OpenStack and Teradata

-       Criteria for selection were features, maturity of a technology and cost

 

This project was started with a Discovery phase which was a pure architecture analysis. The main goal of the Discovery phase was to support a business case from an architecture standpoint and receive funding from executive sponsors. There was no time and budget for prototyping activities. From the data analytics perspective the platform was required to support more than 90 TB (30+ TB structured, 60+ TB unstructured/semi-structured) of data, which made traditional (small data) designs irrelevant.

The architecture analysis identified key elements of the solution platform such as a Data Lake (with Hadoop as a primary candidate), a datawarehouse and a private cloud-computing software platform with the primary candidate for this being OpenStack. After these major elements were identified, the focus of analysis shifted to a selection of a primary technologies for the datawarehouse and OpenStack distribution (as there are a number of them on the market).

The selection of a datawarehouse was driven primarily by scalability and cost criteria. Approximately 2 weeks were spent to compare Teradata, Microsoft PDW and HP Vertica and to negotiate price options with vendors.  As a result, Teradata was chosen as, although the other technologies were very close to the winner's scores, meaning that a change in technologies would be feasible at a later date if necessary.

The selection of an OpenStack distribution was driven primarily by hardware vendor constraints (the distribution needed to have been certified by a hardware vendor) and other criteria such as market adoption and community support.  The RedHat distribution scored best on these criteria and was selected as the primary candidate.

A takeaway of this case study is that an architecture analysis may be the only feasible option to make early design decisions and to support building a business case. It is important to keep in mind however that decisions made at this stage are not carved in stone and may be reconsidered based on prototyping when a project receives funding.

 



Case 5: Web Analytics and Marketing Optimization


Business Goals

Big Data

Technologies

Prototyping / Architecture Analysis

Optimization of all web, mobile, and social channels
Optimization of  recommendations for each visitor
High return on online marketing investments
Data volume: > 1 PB
5-10 GB per customer/day
Data sources: clickstream data, webserver logs

 

 

Hadoop (HDFS, MapReduce, Oozie, Zookeeper)
Hadoop/HBase
Oracle
Java/Flex/JavaScript

Prototyping (addressed risk – total cost of ownership):

- Spike prototype to replace relational database with a low cost alternative (1 month)

-  HBase was selected as a NoSQL storage

 

An initial architecture for this system was designed in 2005 and was based on relational technologies, using an Oracle RDBMS. However over time new requirements arose and the existing architecture became problematic in terms of performance. The new architectural drivers were more than 1 PB of data volume and support for a near-real-time operational dashboard.

An architecture analysis began with a goal to find a highly scalable and cost effective solution to replace expensive RDBMS licenses. Within a week a team with big data expertise analyzed architecture drivers and open source NoSQL databases to select the most optimal, given the driving scenarios. HBase was selected (a Column-Family implementation) which works well to implement materialized views and static reports, and these were identified as key scenarios. In addition, HBase is natively integrated into the Hadoop ecosystem and Hadoop was already adopted by the client company to collect raw data.

After receiving the technology recommendation from the expert team, a client development team (without little knowledge of NoSQL) began a spike prototype using HBase and demonstrated good results. The Total Cost of Ownership was reduced, as compared with Oracle technologies, and performance was deemed to be satisfactory.

Inspired by the success of the prototype, the development team continued HBase adoption and put it into production. After some time, the team tried to leverage HBase for ad-hoc reporting scenarios, and here they encountered problems.  Several months were spent attempting to optimize performance with no significant progress.

What was eventually done was a new architectural analysis and prototype iteration with the expert team. A new analysis round considered a set of technology candidates which were designed for ad-hoc queries (i.e. Impala or Spark) and a prototype helped to ensure that any selected technology is effective. The lesson learned here was to avoid the “hammer looking for a nail” syndrome (trying to apply the same technology to tackle very different problems) and to pay attention to the early warning signs (such as optimization efforts without much progress). 

 



Case 6: Healthcare Insurance Operational Intelligence


Business Goals

Big Data

Technologies

Prototyping / Architecture Analysis

Operation cost optimization for 3.4 million members
Track anomaly cases (e.g. control schedule 1 and 2 drugs, refill status control)
Collaboration tool between 65,000 providers (delegation, messaging, reassignment)
Velocity: 10K+ events per second
Complex Event Processing: pattern detection, enrichment, projection, aggregation, join
High scalability, High-availability , fault-tolerance

 

AWS VPC
Apache Mesos, Apache Marathon, Chronus
Cassandra
Apache Storm
ELK (Elasticsearch, Logstash, Kibana)
Netflix Exhibitor
• Chef
Architecture analysis (addressed risk – selection of technology to achieve target performance):

-       Analysis and selection of stream processing framework among Apache Storm, Spark Streaming and Samza (1.5 weeks)

-       As a result selected Apache Storm for its sub-second latency capabilities

 

Prototyping (addressed risk – validation of performance of the selected technology with project specific scenarios):

-       Throwaway prototype on Apache Storm to validate the technology selection for the project specific use cases (1 month)

-       Implementation of MVP for demonstration the business idea to stakeholders (4 months)

 

This case study is about a project which was started as an R&D activity. There was no similar solution in the market and requirements didn’t permit the use of traditional (small data) technologies: data velocity > 10,000 events per second, complex event processing (pattern detection, enrichment, projection, aggregation, join), cloud-based elasticity, high-availability and fault-tolerance.

Stream processing was identified as the most critical component. Analysis of existing stream processing frameworks took about 10 days: Apache Storm, Spark Streaming and Samza were evaluated based on publically available documentation. As a result, Apache Storm was selected as a candidate technology due to its sub-second latency support, which was a critical project requirement. Next, a throwaway prototype was started to implement key scenarios using Apache Storm. The prototype revealed a limitation in dynamic reconfiguration, which resulted in immediate architectural feedback to business stakeholders. A product owner then modified the requirements to accommodate the discovered limitation, as the cost of overcoming the limitation was disproportionally higher than the potential benefits. After a month of prototyping, the throwaway prototype was considered a success--it demonstrated that it could meet the critical scenarios and the project turned into a MVP.  Some code snippets from the throwaway prototype were reused in the MVP development, but the most important outcome was know-how—the practical knowledge gained by the prototyping.

In this case study the throwaway prototype was crucial to demonstrate feasibility of the business idea.  Once the technology limitation was discovered, it was communicated back to business stakeholders to make a further tradeoff: whether to spend hundreds of thousands of dollars to implement the missing feature or not. In the end this feature was removed from the product backlog, as the benefits did not outweigh its costs.

 


 

Case 7: Ultra-large scale travel site

 

Business Goals

Big Data

Technologies

Prototyping / Architecture Analysis

Build modern solution for IT operational intelligence (monitoring, troubleshooting and data exploration) to accommodate company scale
Collect and analyze clickstream logs from visitors (200M+ unique visitors per month)
Conduct exploratory analysis over collected data to improve operational efficiency, services and support engineering team
Data sources: 300+ web servers
Throughput: 15000 events/sec
• Volume: 110+ TB of raw and aggregated data

 

Hadoop (HDFS, Oozie, Hive, Impala)
Elasticsearch
Apache Flume
Tableau
Kibana

 

Architecture analysis (addressed risk – achieving scalability and cost efficiency):

- Analysis and selection of the Lambda Architecture and optimal technologies for this reference architecture

 

 

 

Nowadays even relatively young fast growing companies face legacy software challenges. Such companies, being start-ups, quickly build solutions to satisfy their current needs with high time-to-market pressure and then, after some time, are faced with scalability and TCO issues. This case study is about a company which built an in-house clickstream analytics solution using traditional RDBMS technologies. When a small number of web servers became several hundreds, log generation velocity was close to 15,000 events/sec and data volume more than 110 TB, it appeared that a new system was necessary to handle these Big Data requirements.

The main architecture drivers for the new system were scalability and cost efficiency.  An architecture analysis showed that the optimal reference architecture for this case is the Lambda Architecture. This selection was dictated primarily by two reasons:

1.         The Lambda Architecture supports both historical and real-time data analytics. And the target solution must support historical data exploration (several years of history) as well as a real-time dashboards (less than a minute updates)

2.         The Lambda Architecture provides good scalability and a wide choice of modern open-source technology options to lower TCO, e.g. Hadoop, NoSQL, data stream processing, etc.

In addition to the selection of the reference architecture, the result of the architectural analysis was a selection of candidate technologies, deployment topologies and data modeling. The chosen set of primary candidate technologies were: Hadoop (HDFS, Hive), Impala, Elasticsearch, Apache Flume, Tableau, and Kibana.

However, the architectural analysis also discovered a number of limitations related to immaturity of Big Data technologies at the beginning of 2014:

          Elasticsearch did not support complex aggregation functions (such as percentiles);

          Impala had limitations on executing queries that touched large datasets (resulting in out of memory exceptions) and did not support nested types;

          Kibana did not support charts with multiple data series

For some of these limitations it was possible to find feasible workarounds. For others, the only option was to reconsider these functional requirements.

This case study highlights the value of architectural analysis to support executive decision-making, specifically on analyzing whether to embark on a new system development and the cost and risks associated with it. There was no time and budget for prototyping activities. Despite this, the architecture analysis alone revealed a set of technological constraints that directly influenced the functional requirements of the new system.

 



Case 8: Operations Intelligence platform for Content Delivery Network

 

Business Goals

Big Data

Technologies

Prototyping / Architecture Analysis

Build operations intelligence dashboard for customers and internal IT (SLA, security and root cause analysis)
• Collect and analyze clickstream logs from end-users (1 billion event per day)

 

Throughput: 10000-15000 events/sec
• Volume: 300+ TB of raw and aggregated data

 
Hadoop (HDFS, Hive, Impala)
Elasticsearch
Apache Kafka
Kibana
Amazon S3
Architecture analysis (addressed risk - achieving scalability and cost efficiency)

-       Analysis and selection of optimal technologies for Lambda Architecture

 

Prototyping (addressed risk – validation of performance of the selected technology with project specific scenarios):

-       Throwaway prototoype on Elasticsearch to validate the technology selection for  project-specific use cases (2 weeks)

-       Implementation of MVP for getting early feedback from customers (3 months)

 

 

This project began with a Discovery phase to analyze an existing architecture and create a new architecture that would support an online analytics dashboard for customers and internal IT staff. The requirements were to collect and analyze clickstream logs from end-users (1 billion events/day) with potential volumes of up to 300 TB of raw and aggregated data. An additional requirement was real-time update of the dashboard with new data (with less than 1 minute latency). This project was to be developed in an agile manner, receiving early feedback from customers and updating a product backlog accordingly.

An expert team which conducted the Discovery phase had past experience with similar systems and could quickly propose a reference architecture--in this case based on the Lambda Architecture--and provide a list of primary candidate technologies based on a big data design concept catalog. However, prior to this, the team analyzed the existing system and made certain discoveries and decisions based on this analysis. For these purposes the lightweight variant of the ATAM (Architecture Tradeoff Analysis Method) was used. The ATAM helped to generate and prioritize quality attribute scenarios and revealed gaps in the existing architecture. The resulting new architecture augmented the existing system to satisfy newly identified business and system requirements. From the existing system Flume and Kafka were re-used to collect logs, which then suggested the use of Elasticsearch, Hadoop and Spark to analyze the historical and real-time data.

The architecture analysis was relatively quick and finished within 1.5 weeks. However it discovered a risk related to Elasticsearch performance on large data sets, and the potential high cost of hosting it on Amazon Web Services. To mitigate this risk a throwaway prototype was executed on test data by using different Elasticsearch configurations (number of shards, size of indices, number of nodes, etc) and different AWS instances (such as compute optimized, storage optimized, etc). The prototype lasted two weeks and allowed the team to select the best configuration for the system scenarios from performance and cost perspectives. After this the new system development started with a MVP (lasting about 3 months) to get early customer feedback and come up with an updated feature list for the 1st release version.

This case study shows that an architecture analysis, throwaway prototype(s), and a MVP is an effective combination in agile big data product development. An architecture analysis helps create a technical vision, throwaway prototypes help to mitigate technical risks, and a MVP helps reduce time to market by providing customers with only essential features, and helps to elicit feedback and prioritize future development.


 

Case 9: Big Data as a Service (BDaaS) platform

 
 

Business Goals

Big Data

Technologies

Prototyping / Architecture Analysis

• BDaaS as a part of the vendor overall cloud strategy, which is to build the platform for the Internet of Everything
• Scalability and elasticity, simplicity for access, make things easy for  developers
• Cross-cloud deployment (public and private cloud)
• One click deployment with automated VM provisioning and bootstrapping microservices for large scale big data solutions
• Fault-tolerance and automatic switch-over
• Centralized log collection
Mesos
Marathon
Terraform
Ansible
Hadoop (Cloudera, Hortonworks, MapR)
Apache Spark
Apache Cassandra
Apache Storm
Apache Kafka

Prototyping (addressed risks – technology integration and validating architecture concept)

- MVP prototype and a number of throwaway prototypes in an agile R&D-like development process to be able quickly detect technology limitations and propose appropriate mitigations



A goal of this project was to build a BDaaS (Big Data-as-a-Service) platform for the Internet of Everything. The top requirements were scalability and elasticity, ease of development and cross-cloud deployment (public and private clouds). More granular requirements included one-click deployment with automated provisioning of Virtual Machines, bootstrapping microservices, resource management, monitoring and troubleshooting capabilities.

One of the most major challenges of this project was to design a system that could satisfy future requirements which were unknown at the beginning. The system should have support multiple Big Data architectures, different alternative technologies and infinite number of deployment topologies. An initial architecture analysis could only result with very basic concept and selection of a set of technologies that should have been integrated and thoroughly tested prior to usage in production (such as Terraform, Mesos, Kubernetes, Consul, Marathon, Docker, Vault).

Thus, after the initial architecture analysis the project continued as a long-term agile R&D activity with multiple spikes to address needs for additional analysis or throwaway prototypes. A main project branch included an open source evolutionary prototype which was tested by early (alpha) customers and updated with results of the spikes.

During the research and development, an architecture underwent multiple changes which resulted in incompatibilities with previous versions. It required more efforts from adopters and professional services team to keep up with the changes, but at the same time it let an R&D team achieve significant improvements and offer customers the latest features of newly released technologies.

 


  Strategic Prototyping Decision Flowchart

Flowchart