As always with AWS re:Invent, there were a ton of product announcements - new products and services, new functionality with existing services, various service improvements, and new functional and operational aspects of the AWS platform as a whole. Due to the overwhelming amount of new information revealed during re:Invent, it's tough to determine which ones might be important or even game changing for one's specific project, solution, or technical strategy. Having attended re:Invent 2021, here are some of the announcements that we thought were particularly intriguing and think could have large impacts across the technology landscape.
While serverless technologies have been around for a little while now - such as AWS Lambda, there were some important announcements regarding various services that will have a serverless option for hosting and deployment. This is important, because in many scenarios, serverless by far has the lowest operational overhead associated with it as a hosting and compute model, and is also one of most scalable ways to host an application, data store, or even an entire software platform. It is also inherently fault-tolerant and reliable. These reasons alone make going serverless quite a compelling choice for either large enterprises or small start-ups, and everything in between. Even if ultimately going with another type of solutions (dedicated clusters, Kubernetes, etc.) serverless solutions should always be carefully considered. With that said, let's take a look at some of the important new serverless options that AWS announced at re:Invent.
As many know, Redshift is AWS's cloud-based data warehousing platform. One of the most important things to get right with Redshift is the creation/management of the cluster it will be deployed on. Not only are there various instance types available for hosting, there is also knowing how many of each you need for appropriate compute capacity. These two aspects of your Redshift cluster contribute more so to the overall health and performance of your cluster than any other. Unfortunately, it's not always easy to get these right, and there could be a variety of reasons as to why:
- Might not have the necessary knowledge to determine workload/compute capacity
- Might not know what kind of analytics or query patterns might be needed
- Workloads, reporting, and other analytics might change patterns over time, and might not reflect the best use of the currently deployed cluster instances
In addition to the above, a dedicated Redshift cluster can also become costly to operate as well, due to the fact that they are dedicated cloud resources that are not provisioned in a transient manner. (This can especially be true of larger clusters.) There is also operational overhead with cluster management. Compared to other data warehousing platforms, Redshift mitigates a lot of this overhead; however, it's still there and needs to be managed.
With the introduction of Redshift Serverless, AWS has largely eliminated or greatly mitigated many of the above issues. Obviously, the biggest change (and consequent benefit), is that there is no need to setup and manage any clusters. Eliminating the cluster from the equation also eliminates the largest ramp-up step in getting started using Redshift; it also eliminates quite a bit of the costs of running a cluster, as you don't pay for anything when the warehouse is idle. Another great benefit is that serverless Redshift will automatically scale your warehouse as your needs evolve or grow, and automatically provision the right amount of compute resources based on usage and workload patterns.
In addition to autoscaling based on workloads, serverless Redshift also allows for you to specify a base warehouse size for additional control in handling your data and providing a sound foundation to scale. This can also help in situations where there are also SLA.
The above benefits combine to drastically decrease the total cost of ownership and the operational overhead for running a Redshift based data warehouse.
EMR is AWS's big data processing framework - it's essentially a cloud-based managed Spark platform (and other tools). As with Redshift, EMR is cluster based, where before you begin any processing, you need to create and define a cluster of nodes that will be used to process your data. Your usual options are either creating a dedicated always-on cluster (which is not particularly cost-effective), or creating a "transient" cluster - a cluster that is provisioned solely for the purpose of running one job, which after job completion shuts itself off and terminates. This is a much more cost effective option than a dedicated cluster. Either way though, you're still responsible for the creation and management of the cluster resources. Based on either team or individual experience, it's not always easy to know how to determine the appropriate cluster resources upfront as well, especially instance types and number of instances. In addition, once a job is running, it's difficult to dynamically scale an EMR cluster.
Like serverless Redshift, serverless EMR eliminates all the hassles and overhead of managing a cluster of resources for your data processing. You simply specify what software you want to use (EMR is a collection of Spark and Hadoop based tools), and then submit your job. Jobs can be submitted via API, EMR Studio, or JDBC/ODBC clients. Not only does serverless EMR remove cluster management, but it also handles scaling as well via autoscaling - serverless EMR will automatically determine the compute and memory resources needed to process incoming jobs and scale appropriately. Serverless EMR can also adjust resources while a job is in progress, and scale out or in, based on changing requirements. As always with all serverless resources, you'll only pay for what you use while the resources are actually running.
MSK Serverless/Kinesis Data Streams On-Demand
While not quite as big of an announcement as serverless Redshift and EMR, serverless MSK and serverless Kinesis are still highly fantastic additions to the original MSK and Kinesis services. MSK is AWS's managed Apache Kafka service and allows a pure AWS cloud-native approach for Kafka adoption; Kinesis is AWS native streaming platform. Again, as with Redshift and EMR, to use both these services, you would need to manually provision compute and cluster (MSK) resources. With both Kafka and Kinesis, it's very important that you know what your capacity requirements will be prior to provisioning your resources, so you can allocate resources the proper capacity appropriately and prevent scaling issues. Miscalculating your capacity requirements can have drastically affect how well either of these services can scale and perform.
With the serverless editions, you can now setup and provision either of these services with no initial upfront capacity requirements. Both Serverless MSK and Kinesis On-Demand will automatically scale the compute and storage requirements as needed. As your I/O capacity and usage grows, they will provision the necessary resources to accommodate and handle this growth and make sure your application will scale appropriately. Of course, as with serverless Redshift and EMR as well, you only pay when your resources are actually in use, so there is some very real savings potential and cost optimization.
While there were many announcements regarding AWS's SageMaker service, there were two in particular that were highly interesting. These were the new ability to run EMR and Spark jobs through SageMaker studio, and their new no-code ML tool SageMaker Canvas.
For some time now, SageMaker studio has been able to integrate with EMR, however, you needed to run them in the same account (which wasn't always realistic) and you would have to leave your development environment in order to manage/operate your EMR cluster. In practice, this is can really hurt productivity and workflow. Depending on skill sets, it might also be unrealistic. With this new functionality, Data scientists can operate/manage EMR clusters right within their SageMaker studio environment, without having to leave and break their flow, or learn another tool. In addition, you can also monitor and debug Spark jobs that are running from within your SageMaker notebooks. The key takeaway here is that by incorporating EMR into SageMaker studio as a first-class experience, you can now work with data sets that would be much larger than normal, and you're not limited or constrained by your local machine resources.
With the introduction of SageMaker Data Wrangler, AWS entered into the low-code/no-code ML tool space. With the introduction of SageMaker Canvas at re:Invent, they've taken that much, much further. Canvas is a no-code ML solution intended for use by business analysts and other non-ML specialists. It gives the ability to generate machine learning models with a simple click based visual UI, and doesn't require any machine learning experience of coding knowledge.
Out of the box, Canvas enables the following:
- Data import and dataset generation: Canvas can import data from separate data sources and allow you to combine and unify various data sets into a consolidated set for training and modeling purposes.
- It can automatically detect errors, and help scrub, clean, and prepare your data for ML modeling. All without user input.
- One-click model creation. Canvas will analyze your dataset and automatically determine the most appropriate and best ML model to use on your data.
- Generate prediction either in bulk or as single records and provide understanding and explainability to predictions.
So, with the above, you get the complete ability to create an ML solution right out of the box with no code. In addition, Canvas integrates with SageMaker Studio, so any model created in Canvas can then be shared with data scientists for further enhancement and refinement. With that said, Canvas can be a real game changer in terms of extending ML reach throughout the enterprise. It allows non-specialists access to all the power and insights that only machine learning based models and analytics provide.
Once again, AWS re:Invent did not disappoint with its plethora of new announcements, products and services available for the new year. After having attended re:Invent, and then checking on all the related announcements, these were a few that we definitely feel are worth taking a look at, and diving into. For more details, please see the links below:
The JBS Quick Launch Lab
1/2 Day Assessment
Quantify what it will take to implement your next big idea!
Our intensive 1/2 day session will deliver tangible timelines, costs, high-level requirements, and recommend architectures that will work best. Let JBS show you why over 20 years of experience matters.