Designing the Jit Analytics Architecture for Scale and Reuse - InfoQ.com | Canada News Media
Connect with us

Tech

Designing the Jit Analytics Architecture for Scale and Reuse – InfoQ.com

Published

 on


Key Takeaways

  • With analytics becoming a key service in SaaS products – both for vendors and users -, many times, we can leverage the same architecture for multiple use cases.
  • When using serverless and event-driven architecture, the AWS building blocks available make it easy to design a robust and scalable analytics solution with existing data.
  • Leveraging cost- effective AWS services like AWS Kinesis Data Firehose, EventBridge, and Timestream, it is possible to quickly ramp up analytics for both internal and user consumption.
  • It’s important to note that different serverless building blocks have different limitations and capabilities, and working around these so you don’t have any bottlenecks is critical in the design phase of your architecture.
  • Be sure you consider the relevant data schemas, filtering, dimensions, and other data engineering aspects you will need to extract the most accurate data that will ensure reliable and high-quality metrics.
     

Analytics has become a core feature when building SaaS applications over event-driven architecture, as it is much easier to monitor usage patterns and present the data visually. Therefore, it isn’t surprising that this quickly became a feature request from our clients inside our own SaaS platform.

This brought about a Eureka! moment, where we understood that at the same time we set out to build this new feature for our clients, we could also better understand how our clients use our systems through internal analytics dashboards.

At Jit, we’re a security startup that helps development organizations quickly identify and easily resolve security issues in their applications. Our product has reached a certain level of maturity, where it is important for us to enable our users to have a visual understanding of their progress on their security journey. At the same time, we want to understand which product features are the most valuable to our clients and users.

This got us thinking about the most efficient way to architect an analytics solution that ingests data from the same source but presents that data to a few separate targets.

The first target is a customer metric lake, essentially an over-time solution that is tenant separated. The other targets are 3rd party visualization and research tools for better product analysis that leverages the same data ingestion architecture.

At the time of writing, these tools are Mixpanel and HubSpot, both used by our go-to-market and product teams. This allows the aforementioned teams to collect valuable data on both individual tenant’s usage and general usage trends in the product.

[Click on the image to view full-size]

If you’ve ever encountered a similar engineering challenge, we’re happy to dive into how we built this from the ground up using a serverless architecture.

As a serverless based application, our primary data store is DynamoDB; however, we quickly understood that it does not have the time series capabilities that we would require to aggregate and present the analytics data. Implementing this with our existing tooling would take much longer and would require significant investment for each new metric we’d like to monitor, measure, and present to our clients. So we set out to create something from scratch that we could build quickly with AWS building blocks and provide the dual functionality we were looking to achieve.

To create individualized graphs for each client, we recognized the necessity for processing data in a time series manner. Additionally, maintaining robust tenant isolation, ensuring each client can only access their unique data and thus preventing any potential data leakage, was a key design principle in this architecture. This took us on a journey to finding the right tools for the most economical job with the lowest management overhead and cost. We’ll walk through the technical considerations and implementation of building new analytics dashboards for internal and external consumptions.

Designing the Analytics Architecture

The architecture begins with the data source from which the data is ingested – events written by Jit’s many microservices. These events represent every little occurrence that happens across the system, such as a newly discovered security finding, a security finding that was fixed, and more. Our goal is to listen to all of these events and be able to eventually query them in a time series manner and present graphs that are based on them to our users.

[Click on the image to view full-size]

Into the AWS EventBridge

These events are then fed into AWS EventBridge, where the events are processed and transformed according to predefined criteria to convert them to a unified format that consists of data, metadata, and metric name. This can be achieved by using EventBridge Rules. Since our architecture is already event driven and all of these events are already written to different event bridges, we simply needed to add EventBridge Rules only in places where we wanted to funnel the “KPI-Interesting” data into the analytics feed, which was easy to do programmatically.

Once the data and relevant events are transformed as part of the EventBridge Rule, they are sent into Amazon Kinesis Firehose. This can be achieved with EventBridge Rule’s Target feature, which can send the transformed events to various targets.

The events that are transformed into a unified schema must contain the following parameters to not be filtered out:

  1. metric_name field, which maps to the metric being measured over time.
  2. metadata dictionary – which contains all of the metadata on the event, where each table (the tenant isolation) is eventually created based upon the tenant_id parameter.
  3. data dictionary – which must contain event_time which tells the actual time that the event arrived (as the analytics and metrics will always need to be measured and visualized over a period of time).

Schema structure:


 "metric_name": "relevant_metric_name",
 "metadata": 
   "tenant_id": "relevant_tenant_id",
   "other_metadata_fields": "metadata_fields",
   ...
 ,
 "data": 
   "event_time": <time_of_event_in_UTC>,
   "other_data_fields": <other_data_fields>,
   ...
 

AWS Kinesis Firehose

AWS Kinesis Data Firehose (Firehose in short) is the service that aggregates multiple events for the analytics engine and sends it to our target S3 bucket.

[Click on the image to view full-size]

Once the number of events exceeds the threshold (which can be size or a period of time), these are then sent in a batch to S3 buckets to await being written to the time series database, as well as any other event subscribers, such as our unified system that needs to get all tenant events.

Firehose’s job here is a vital part of the architecture. Because it waits for a threshold and then sends the events as a small batch, we know that when our code kicks in and begins processing the events, we’ll work with a small batch of events with a predictable size. This allows us to avoid memory errors and unforeseen issues.

Once one of the thresholds is passed, Kinesis performs a final validation on the data being sent, verifies that the data strictly complies with the required schema format, and discards anything that does not comply.

Invoking a lambda that runs inside Firehose allows us to discard the non-compliant events and perform an additional transformation and enrichment of adding a tenant name. This involves querying an external system and enriching the data with information about the environment it’s running on. These properties are critical for the next phase that creates one table per tenant in our time series database.

In the code section below, we can see:

  • A batching window is defined, in our case – 60 seconds or 5MB (the earlier of the two)
  • The data transformation lambda that validates and transforms all events that arrive to streamline services and ensure reliable, unified, and valid events.

The lambda that handles the data transformation is called enrich-lambda. Note that Serverless Framework transforms its name into a lambda resource called EnrichDashdataLambdaFunction, so pay attention to this gotcha if you are also using Serverless Framework.

MetricsDLQ:
 Type: AWS::SQS::Queue
 Properties:
   QueueName: MetricsDLQ
KinesisFirehouseDeliveryStream:
 Type: AWS::KinesisFirehose::DeliveryStream
 Properties:
   DeliveryStreamName: metrics-firehose
   DeliveryStreamType: DirectPut
   ExtendedS3DestinationConfiguration:
     Prefix: "Data/" # This prefix is the actual one that later lambdas listen upon new file events
     ErrorOutputPrefix: "Error/"
     BucketARN: !GetAtt MetricsBucket.Arn # Bucket to save the data
     BufferingHints:
       IntervalInSeconds: 60
       SizeInMBs: 5
     CompressionFormat: ZIP
     RoleARN: !GetAtt FirehoseRole.Arn
     ProcessingConfiguration:
       Enabled: true
       Processors:
         - Parameters:
             - ParameterName: LambdaArn
               ParameterValue: !GetAtt EnrichDashdataLambdaFunction.Arn
           Type: Lambda # Enrichment lambda
EventBusRoleForFirehosePut:
 Type: AWS::IAM::Role
 Properties:
   AssumeRolePolicyDocument:
     Version: '2012-10-17'
     Statement:
       - Effect: Allow
         Principal:
           Service:
             - events.amazonaws.com
         Action:
           - sts:AssumeRole
   Policies:
     - PolicyName: FirehosePut
       PolicyDocument:
         Statement:
           - Effect: Allow
             Action:
               - firehose:PutRecord
               - firehose:PutRecordBatch
             Resource:
               - !GetAtt KinesisFirehouseDeliveryStream.Arn
     - PolicyName: DLQSendMessage
       PolicyDocument:
         Statement:
           - Effect: Allow
             Action:
               - sqs:SendMessage
             Resource:
               - !GetAtt MetricsDLQ.Arn

Below is the code for the eventbridge rules that map Jit events in the system to a unified structure. This EventBridge sends the data to Firehose (below is the serverless.yaml snippet).

A code example of our event mappings:

FindingsUploadedRule:
 Type: AWS::Events::Rule
 Properties:
   Description: "When we finished uploading findings we send this notification."
   State: "ENABLED"
   EventBusName: findings-service-bus
   EventPattern:
     source:
       - "findings"
     detail-type:
       - "findings-uploaded"
   Targets:
     - Arn: !GetAtt KinesisFirehouseDeliveryStream.Arn
       Id: findings-complete-id
       RoleArn: !GetAtt EventBusRoleForFirehosePut.Arn
       DeadLetterConfig:
         Arn: !GetAtt MetricsDLQ.Arn
       InputTransformer:
         InputPathsMap:
           tenant_id: "$.detail.tenant_id"
           event_id: "$.detail.event_id"
           new_findings_count: "$.detail.new_findings_count"
           existing_findings_count: "$.detail.existing_findings_count"
           time: "$.detail.created_at"
         InputTemplate: >
           
             "metric_name": "findings_upload_completed",
             "metadata": 
               "tenant_id": <tenant_id>,
               "event_id": <event_id>,
             ,
             "data": 
               "new_findings_count": <new_findings_count>,
               "existing_findings_count": <existing_findings_count>,
               "event_time": <time>,
             
           

Here we transform an event named “findings-uploaded” that is already in the system (that other services listen to) into a unified event that is ready to be ingested by the metric service.

Timestream – Time Series Database

While, as a practice, you should always try to make do with the technologies you’re already using in-house and extend them to the required use case if possible (to reduce complexity), in Jit’s case, DynamoDB simply wasn’t the right fit for the purpose.

To be able to handle time series data on AWS (and perform diverse queries) while maintaining a reasonable total cost of ownership (TCO) for this service, new options needed to be explored. This data would later be represented in a custom dashboard per client, where time series capabilities were required (with the required strict format described above). After comparing possible solutions, we decided on the fully managed and low-cost database with SQL-like querying capabilities called Timestream as the core of the architecture.

Below is a sample piece of code that demonstrates what this looks like in practice:

SELECT * FROM "Metrics"."b271c41c-0e62-48d2-940e-d8c80b1fe242" 
WHERE time BETWEEN ago(1d) and now()

While other technologies were explored, such as Elasticsearch, we realized that they’d either be harder to manage and implement correctly as time series databases (for example, there would be greater difficulty with rolling out indexes and to perform tenant separation and isolation) or would be much more costly. Whereas with Timestream, a table per tenant is simple, and it is by far more economical, as it is priced solely by use. The pricing includes writing, querying, and storage usage. This may seem like a lot at first glance, but our comparison showed that with our predictable usage and the “peace of mind” that using it provides (given that it’s a serverless Amazon service with practically no management overhead), it is the more economically viable solution.

There are three core attributes for data in Timestream that optimize it for this use case (you can learn more about each in their docs):

  • Dimensions
  • Measures
  • Time

The dimensions are essentially what describe the data, such as unique identifiers per client (taken from the user’s metadata) and environment in our case. The data is then leveraged to strip out the tenant_id from the event and use it as a timestream table name, which is how the tenant isolation is achieved. The remaining data enables partitioning by these fields, which makes it great for querying the data later. The more dimensions we utilize, the less data needs to be scanned during queries. This is because the data is partitioned based on these dimensions, effectively creating an index. This, in turn, enhances query performance and provides greater economies of scale.

Measures are essentially anything you require for incrementation or enumeration (such as temperature or weight). In our case, these are values we measure in different events, which works well for aggregating data.

Time is pretty straightforward; it is the timestamp of the event (when it was written to the database), which is also a critical function in analytics, as most queries and measurements are based on a certain time frame or window to evaluate success or improvement.

Visualizing the Data with Mixpanel and Segment

Once the ingestion, transformation, batch writing, and querying technology were defined, the dashboarding was easy. We explored the option of using popular open-source tools like Grafana and Kibana that integrate pretty seamlessly with Timestream; however we wanted to provide maximum customizability for our clients inside their UI. We decided to go with homegrown and embeddable graphs.

Once Firehose has written the data to the S3 in the desired format, there is a dedicated Lambda to read and then transform the data to a Timestream record and write it (as noted above, as a table per tenant, while utilizing `tenant_id` in the metadata field). Another lambda then sends this pre-formatted data to Segment and Mixpanel, providing a birds-eye-view of all the tenant data for both internal ingestion and external user consumption. This is where it gets fun.

We leveraged the Mixpanel and Segment data internally to build the UI for our clients by exposing the API that performs the query against Timestream (which is tenant separated by IAM permissions), making it possible for each client to only visualize their own data.

[Click on the image to view full-size]

This enabled leveraging Mixpanel and Segment as the analytics backbone to give our clients Lego-like building blocks for the graphs our customers can consume.

Leveraging tools like Mixpanel and Segment enables us to have cross-tenant and cross-customer insights for our graphs, to optimize our features and products for our users.

Important Caveats and Considerations

When it comes to Timestream and deciding to go with a fully serverless implementation, this does come with cost considerations and scale limitations. We spoke about the Timestream attributes above; however, in each one, there is a certain threshold that cannot be exceeded, and it’s important to be aware of these. For example, there is a limit of 128 dimensions and a maximum of 1024 measures per table, so you have to ensure you are architecting your systems not to exceed these thresholds.

When it comes to memory, there are two primary configurations, memory and magnetic (i.e., long-term. Note that “magnetic” here refers to AWS Timestream’s long-term, cost-effective storage, not magnetic tapes). In contrast, memory storage is priced higher, but comes with a faster querying speed but with a limited window (2 weeks in our case). You can feasibly store up to 200 years of storage on magnetic, but everything has cost implications (we chose one year, as we felt that was sufficient storage – and it can be dynamically upgraded as needed). The great thing about AWS-based services is that much of the heavy lifting is done automatically, such as the data tiering automatically being migrated from magnetic to disk.

Other limitations include the number of tables per account (a 50K threshold), and there is also a 10MB minimum required for querying (and a 1-second querying time – which might not be as fast as other engines, but the cost was a significant enough advantage for us to compromise on query speed). Therefore, you should be aware of the TCO and optimize queries to always be above the 10MB limitation and even higher when possible, while also reducing latency for clients at the same time. A good method to combat this issue is to cache data, and not do a full query in real time, where you can consolidate data into a single query through unions.

Serverless Lego FTW!

By leveraging existing AWS services over serverless architecture, we were able to ramp up the analytics capabilities quite quickly, with little management and maintenance overhead with a low-cost, pay-per-use model that enables us to be cost effective. The greatest part of this scalable and flexible system is that it also provides the benefit of adding new metrics as our clients’ needs evolve.

Since all the events already exist in the system, and are parsed through event bridges, finding a new and relevant metric is an easy addition to the existing framework. You can create the relevant transformation, and have a new metric in the system that you can query nearly instantaneously.

Through this framework it is easy to add “consumers” in the future, leveraging the same aggregated data. By building upon serverless building blocks, like Legos, it was possible to develop a scalable solution to support a large and growing number of metrics in parallel, while future proofing the architecture as business and technology requirements continuously evolve.

About the Authors

Adblock test (Why?)



Source link

Continue Reading

Tech

How to Preorder the PlayStation 5 Pro in Canada

Published

 on

Sony has made it easy for Canadian consumers to preorder the PlayStation 5 Pro in Canada directly from PlayStation’s official website. Here’s how:

  • Visit the Official Website: Go to direct.playstation.com and navigate to the PS5 Pro section once preorders go live on September 26, 2024.
  • Create or Log in to Your PlayStation Account: If you don’t have a PlayStation account, you will need to create one. Existing users can simply log in to proceed.
  • Place Your Preorder: Once logged in, follow the instructions to preorder your PS5 Pro. Ensure you have a valid payment method ready and double-check your shipping information for accuracy.

Preorder Through Major Canadian Retailers

While preordering directly from PlayStation is a popular option, you can also secure your PS5 Pro through trusted Canadian retailers. These retailers are expected to offer preorders on or after September 26:

  • Best Buy Canada
  • Walmart Canada
  • EB Games (GameStop)
  • Amazon Canada
  • The Source

Steps to Preorder via Canadian Retailers:

  • Visit Retailer Websites: Search for “PlayStation 5 Pro” on the website of your preferred retailer starting on September 26.
  • Create or Log in to Your Account: If you’re shopping online, having an account with the retailer can speed up the preorder process.
  • Preorder in Store: For those who prefer in-person shopping, check with local stores regarding availability and preorder policies.

3. Sign Up for Notifications

Many retailers and websites offer the option to sign up for notifications when the preorder goes live. If you’re worried about missing out due to high demand, this can be a useful option.

  • Visit Retailer Sites: Look for a “Notify Me” or “Email Alerts” option and enter your email to stay informed.
  • Use PlayStation Alerts: Sign up for notifications directly through Sony to be one of the first to know when preorders are available.

4. Prepare for High Demand

Preordering the PS5 Pro is expected to be competitive, with high demand likely to result in quick sellouts, just as with the initial release of the original PS5. To maximize your chances of securing a preorder:

  • Act Quickly: Be prepared to place your order as soon as preorders open. Timing is key, as stock can run out within minutes.
  • Double-Check Payment Information: Ensure your credit card or payment method is ready to go. Any delays during the checkout process could result in losing your spot.
  • Stay Informed: Monitor PlayStation and retailer websites for updates on restocks or additional preorder windows.

Final Thoughts

The PlayStation 5 Pro is set to take gaming to the next level with its enhanced performance, graphics, and new features. Canadian gamers should be ready to act fast when preorders open on September 26, 2024, to secure their console ahead of the holiday season. Whether you choose to preorder through PlayStation’s official website or your preferred retailer, following the steps outlined above will help ensure a smooth and successful preorder experience.

For more details on the PS5 Pro and to preorder, visit direct.playstation.com or stay tuned to updates from major Canadian retailers.

Continue Reading

Tech

Introducing the PlayStation 5 Pro: The Next Evolution in Gaming

Published

 on

Since the PlayStation 5 (PS5) launched four years ago, PlayStation has continuously evolved to meet the demands of its players. Today, we are excited to announce the next step in this journey: the PlayStation 5 Pro. Designed for the most dedicated players and game creators, the PS5 Pro brings groundbreaking advancements in gaming hardware, raising the bar for what’s possible.

Key Features of the PS5 Pro

The PS5 Pro comes equipped with several key performance enhancements, addressing the requests of gamers for smoother, higher-quality graphics at a consistent 60 frames per second (FPS). The console’s standout features include:

  • Upgraded GPU: The PS5 Pro’s GPU boasts 67% more Compute Units than the current PS5, combined with 28% faster memory. This allows for up to 45% faster rendering speeds, ensuring a smoother gaming experience.
  • Advanced Ray Tracing: Ray tracing capabilities have been significantly enhanced, with reflections and refractions of light being processed at double or triple the speed of the current PS5, creating more dynamic visuals.
  • AI-Driven Upscaling: Introducing PlayStation Spectral Super Resolution, an AI-based upscaling technology that adds extraordinary detail to images, resulting in sharper image clarity.
  • Backward Compatibility & Game Boost: More than 8,500 PS4 games playable on PS5 Pro will benefit from PS5 Pro Game Boost, stabilizing or enhancing performance. PS4 games will also see improved resolution on select titles.
  • VRR & 8K Support: The PS5 Pro supports Variable Refresh Rate (VRR) and 8K gaming for the ultimate visual experience, while also launching with the latest wireless technology, Wi-Fi 7, in supported regions.

Optimized Games & Patches

Game creators have quickly embraced the new technology that comes with the PS5 Pro. Many games will receive free updates to take full advantage of the console’s new features, labeled as PS5 Pro Enhanced. Some of the highly anticipated titles include:

  • Alan Wake 2
  • Assassin’s Creed: Shadows
  • Demon’s Souls
  • Dragon’s Dogma 2
  • Final Fantasy 7 Rebirth
  • Gran Turismo 7
  • Marvel’s Spider-Man 2
  • Ratchet & Clank: Rift Apart
  • Horizon Forbidden West

These updates will allow players to experience their favorite games at a higher fidelity, taking full advantage of the console’s improved graphics and performance.

 

 

Design & Compatibility

Maintaining consistency within the PS5 family, the PS5 Pro retains the same height and width as the original PS5 model. Players will also have the option to add an Ultra HD Blu-ray Disc Drive or swap console covers when available.

Additionally, the PS5 Pro is fully compatible with all existing PS5 accessories, including the PlayStation VR2, DualSense Edge, Pulse Elite, and Access controller. This ensures seamless integration into your current gaming setup.

Pricing & Availability

The PS5 Pro will be available starting November 7, 2024, at a manufacturer’s suggested retail price (MSRP) of:

  • $699.99 USD
  • $949.99 CAD
  • £699.99 GBP
  • €799.99 EUR
  • ¥119,980 JPY

Each PS5 Pro comes with a 2TB SSD, a DualSense wireless controller, and a copy of Astro’s Playroom pre-installed. Pre-orders begin on September 26, 2024, and the console will be available at participating retailers and directly from PlayStation via direct.playstation.com.

The launch of the PS5 Pro marks a new chapter in PlayStation’s commitment to delivering cutting-edge gaming experiences. Whether players choose the standard PS5 or the PS5 Pro, PlayStation aims to provide the best possible gaming experience for everyone.

Preorder your PS5 Pro and step into the next generation of gaming this holiday season.

Continue Reading

Tech

Google Unveils AI-Powered Pixel 9 Lineup Ahead of Apple’s iPhone 16 Release

Published

 on

Google has launched its next generation of Pixel phones, setting the stage for a head-to-head competition with Apple as both tech giants aim to integrate more advanced artificial intelligence (AI) features into their flagship devices. The unveiling took place near Google’s Mountain View headquarters, marking an early debut for the Pixel 9 lineup, which is designed to showcase the latest advancements in AI technology.

The Pixel 9 series, although a minor player in global smartphone sales, is a crucial platform for Google to demonstrate the cutting-edge capabilities of its Android operating system. With AI at the core of its strategy, Google is positioning the Pixel 9 phones as vessels for the transformative potential of AI, a trend that is expected to revolutionize the way people interact with technology.

Rick Osterloh, Google’s senior vice president overseeing the Pixel phones, emphasized the company’s commitment to AI, stating, “We are obsessed with the idea that AI can make life easier and more productive for people.” This echoes the narrative Apple is likely to push when it unveils its iPhone 16, which is also expected to feature advanced AI capabilities.

The Pixel 9 lineup will be the first to fully integrate Google’s Gemini AI technology, designed to enhance user experience through more natural, conversational interactions. The Gemini assistant, which features 10 different human-like voices, can perform a wide array of tasks, particularly if users allow access to their emails and documents.

In an on-stage demonstration, the Gemini assistant showcased its ability to generate creative ideas and even analyze images, although it did experience some hiccups when asked to identify a concert poster for singer Sabrina Carpenter.

To support these AI-driven features, Google has equipped the Pixel 9 with a special chip that enables many AI processes to be handled directly on the device. This not only improves performance but also enhances user privacy and security by reducing the need to send data to remote servers.

Google’s aggressive push into AI with the Pixel 9 comes as Apple prepares to unveil its iPhone 16, which is expected to feature its own AI advancements. However, Google’s decision to offer a one-year free subscription to its advanced Gemini Assistant, valued at $240, may pressure Apple to reconsider any plans to charge for its AI services.

The standard Pixel 9 will be priced at $800, a $100 increase from last year, while the Pixel 9 Pro will range between $1,000 and $1,100, depending on the model. Google also announced the next iteration of its foldable Pixel phone, priced at $1,800.

In addition to the new Pixel phones, Google also revealed updates to its Pixel Watch and wireless earbuds, directly challenging Apple’s dominance in the wearable tech market. These products, like the Pixel 9, are designed to integrate seamlessly with Google’s AI-driven ecosystem.

Google’s event took place against the backdrop of a significant legal challenge, with a judge recently ruling that its search engine constitutes an illegal monopoly. This ruling could lead to further court proceedings that may force Google to make significant changes to its business practices, potentially impacting its Android software or other key components of its $2 trillion empire.

Despite these legal hurdles, Google is pressing forward with its vision of an AI-powered future, using its latest devices to showcase what it believes will be the next big leap in technology. As the battle for AI supremacy heats up, consumers can expect both Google and Apple to push the boundaries of what their devices can do, making the choice between them more compelling than ever.

Continue Reading

Trending

Exit mobile version