Uprise SIEM Strategy

Root cause analysis

 

·      Why did the application go down?

·      Why are transactions failing?

·      Why is performance slow?

 

 

Any of the issues could be caused by system crashes, bugs in the code, timeout of a third-party service, network problems, or more.

 

Steps: Look at Loggly aggregated log metrics to see if any problems “jump out.”

 

For example, a spike in error rates or a spike in traffic from a few IP addresses.

 

1.     If you have the users email or other unique identifier, use it to trace the transaction through the stack.

 

2.     When you have some ideas of what to look for, search your logs to pinpoint the root cause.

 

Error and exception reporting

 

·      How can I see the top errors or exceptions affecting the service of Uprise?

·      How will I know when there is a big spike in errors?

 

Errors and exceptions do happen in the normal course of business. But if there are more than usual, something is probably wrong with the application.

 

Steps:

 

1.     Search all logs and filter the data for errors and exceptions. For example, filter on 4xx and 5xx status codes from the server (nginx).

 

2.     Create saved searches that aggregate errors for the key components of the stack.

 

3.     Track these searches by building a custom dashboard in Loggly. Monitor the error and exception dashboards every time you push new code.

 

4.     Create alerts to let you know of major spikes in errors.

 

5.     Alerting thresholds set may vary based on the “normal” number of errors and the impact that the error has on user success.

 

For example, an error in the booking process is critical.

 

Performance monitoring

 

·      How is my technology stack performing?

·      Is my site any slower than expected?

·      Are transactions completing successfully?

 

With the right logs, you have all the data you need to monitor performance and pinpoint problems at all levels in your stack.

 

Of course, you need to act quickly to resolve problems in order to keep your users happy and your business KPIs on track.

 

1.     Pull in AWS cloud watch data that characterizes the performance of the application

 

For example:

·      web server logs

·      database slow query logs

·      application-specific performance metrics logged as JSON

·      response times for third-party services

 

2.     Visualize performance by charting this data in Loggly.

 

3.     Determine parameters for unacceptable performance in key components of the application and set alerts to trigger when these parameters are exceeded.

 

Transaction and request tracing

 

·      Why did a high-value transaction fail?

·      Which step of the transaction had a problem?

 

If the logs include either:

 

·      a unique user identifier (such as a Globally Unique Identifier, GUID – email address)

·      a unique number

·      an API key

·      session ID

 

The progress of a user can then be viewed through your entire application

 

This type of analysis can be very useful in spotting elusive technical problems.

 

Steps:

 

1. Incorporate a unique user identifier into all log events.

2. Search for the unique identifier to trace logs across the stack.

3. Follow the user through the transaction process.

 

Trend analysis and planning

 

·      What are the biggest opportunities to improve the success of the application?

·      How is the capacity usage changing as Uprise scales?

·      Is Uprise getting the most out of our investments in AWS?

·      How should we prioritize our development efforts?

 

Taken as a whole, log data can provide us with a bird’s eye view of what’s happening with the application now, and these insights can inform us on where to go next.

 

Use SIEM to optimize the application’s performance, prioritize development projects, and understand future capacity needs.

 

Steps:

 

Build a metrics API for every single component in the technology stack.

 

The API provides the stats for the components in the service.

For example: it could provide input bytes received per second (bps), processed bytes per second, transactions per second, or a list of the top 10, 20, or 50 users based on received bps, etc.

 

Log these metrics and send the data to log management system.

 

Analyse historical behaviour to understand how the app is scaling and what changes will have the most impact on the application.

 

Tracking unusual activity

 

·      Is this pattern normal?

 

Suspicious events can be tracked in logs where login attempts and other system activities are recorded.

 

Proactive, automated detection of unusual activity is a requirement.

 

It may not be possible to know every potential attack pattern in advance.

 

Steps:

 

Multiple failed login attempts by a user might be considered normal, but hundreds or thousands of failed login attempts might point to a brute force or dictionary attack.

 

For example:

 

Detect with SIEM tool:

 

·      excessive traffic from a single IP address

·      bot activity

·      unusual logins to AWS services

·      any behaviour patterns that users wouldn’t exhibit in real life

 

 

SIEM evolution

 

A mature SIEM configuration takes time and is an ongoing process.

 

Alerts need to be constantly maintained and updated as the environment changes and grows.

Jay Spence