By Eran Levy — May 6, 2023

Making sense of the security data lake

Can new offerings solve old problems?

Executive Summary:

Security data lakes are touted as a way to store and query petabytes security logs in an efficient, scalable, and cost-effective way.
Vendors and proponents claim they improve upon SIEMs with by reducing costs, increasing flexibility, enabling enhanced rule implementation, and faster processing times.
Amazon and Matano have introduced new security data lake offerings in recent months.
Criticisms of security data lakes highlight potential project failures, messy data, and lack of value for security use cases.
The overall trend in data infrastructure is towards hybrid solutions such as lakehouses and next-gen data warehouses, rather than ‘pure’ data lake approaches.
Organizations should explore these offerings cautiously and iteratively, and explore shifting certain workloads to a security lake rather than completely overhauling their existing stack.

What is a security data lake?

A security data lake is an architectural pattern meant to provide an efficient, scalable, and cost-effective way to store and analyze security logs, making it easier for security teams to detect threats and send alerts.

The security data lake has emerged in recent years address the ever-growing volumes of security data and the complexity of modern cyber threats. The idea is to have a centralized repository for storing and analyzing large volumes of security logs from various sources (network, host, cloud, and SaaS audit logs). Logs are normalized into a common structure, processed into an optimized columnar format (e.g., Apache Parquet), and stored in object storage such as Amazon S3.

This type of implementation is meant to help security engineers handle vast amounts of unstructured security data in real-time, and in some cases to analyze this data directly from object storage — bypassing the need for databases and further reducing the costs and complexity associated with analyzing massive amounts of security data.

Definition vary. As with many concepts in the data and cybersecurity space, definitions will vary according to the influencer or vendor you’re hearing from. According to some, security data lakes are not meant to replace SIEMs entirely — in this version, the security lake is a place to collect, transform, and normalize of security logs, but does not provide search or analytics functions. Others will add these capabilities to provide a more ‘full-stack’ solution, in addition to the storage ingestion layer.

Claimed advantages of security data lake over SIEM

Proponents of security data lakes argue that they offer several key advantages over traditional security information and event management (SIEM) systems, such as Splunk. We’ve summarized these claims below. Needless to say, even in the best case, you can’t say for sure that all of these will pan out in a particular implementation:

Scalability: Traditional SIEMs are built on top of databases like Elasticsearch, which can make scaling expensive or difficult. In contrast, security data lakes store log data in cost-effective object storage solutions like Amazon S3, providing a more efficient approach to store and analyze vast amounts of security data.
Lower costs: Enterprise SIEM vendors typically have expensive ingest-based licenses, making it challenging for organizations to manage costs at larger scales. Decoupled cloud storage is cheaper.
Flexibility and vendor lock-in: SIEM platforms will store data in proprietary formats, which can make it difficult to use the data outside of their ecosystem or switch to another analytics stack. The security lake can use open format such as Parquet to enable more cross-operability.
Enhanced rule implementation and detection capabilities: SIEM systems may lack flexibility in implementing and managing complex detection rules, limiting analysts' ability to respond to evolving threats. Security data lakes give DIY-oriented organizations more flexibility in modeling complex attacker behaviors. If done correctly, this can reduce the number of false-positive alerts and improve overall detection capabilities.
Faster processing and response times: Some traditional SIEMs can take a long time to process and detect errors, which may not be acceptable for security purposes. A well-implemented security data lake can give analysts access to events closer to their time of occurrence, which could improve metrics like MTTD / MTTR.

New offerings in this space

In recent months we’ve learned of two new offerings in the security data lake space, from Amazon and Matano. Both solutions promise similar scalability and efficiency gains, although the implementation details are quite different.

Amazon Security Lake, unveiled at the AWS re:Invent 2022 event, centralizes an organization's cloud and on-premise security data into a single location and simplifies its analysis on a petabyte scale. Amazon claims that their offering provides centralized data visibility, open standard data normalization, improved and managed security data, and support for custom analytics and preferred tools.

Matano's security data lake offering is an open-source, cloud-native platform that deploys a vendor-agnostic data lake to your AWS account. According to the vendor, key features include ingesting petabytes of security and log data, storing and querying data in an open data lake, creating Python detections as code for real-time alerting, normalizing unstructured security logs, supporting native integrations, and avoiding vendor lock-in.

The difference: As best as we can tell, the AWS offering is not very different from their ‘normal’ data lake stack, which combines S3 storage with multiple Amazon services to operationalize the data. ‘All your logs are belong to us’, basically, with promises of cost savings down the road. Matano offers more security-native features, and of course is going for the open-source, vendor agnostic angle.

Critical takes on the security lake

Not everyone is enthusiastic about the security lake as a paradigm, or some of its specific implementation. Critics have pointed out that projects can fail due to dirty data, trouble with collecting and accessing data, lack of value beyond collection and keyword search, lack of threat detection value, and challenges in conceptualizing and defining security analytics use cases.

Other potential pitfalls include the usual challenges with unstructured data stores and essentially building a data platform from scratch — mostly in terms of creating a significant engineering burden, which often spirals out of control and far beyond initial estimations. This is especially problematic for organizations that suffer from immature data models and processes. The centralized nature of security lakes also means a single point of failure.

Amazon's Security Lake offering has also faced criticisms and concerns from users, such as skepticism about additional costs, the perception of being "half-baked," and concerns about its current state and future development.

Our take

The conversation around the security lake has many parallels with the general conversation about data lakes. But while other meandering, overdue data lake projects might only result with annoyed stakeholders not seeing their dashboards update on time, in cybersecurity mess-ups are more problematic.

In general, the industry appears to be moving away from 'pure' data lake approaches towards hybrid solutions such as lakehouse and 'next-gen' data warehouses such as Snowflake and BigQuery. Both of these approaches introduce a centralized, SQL-based layer on top of the raw data. This shift may offer improved manageability, accessibility, and analytics capabilities, addressing some of the challenges faced by traditional data lakes.

Finally, the skills shortage that affects engineering departments across the board, particularly in relation to cloud technologies, will not skip security data lake projects. Things that sound great on paper might look more grim when you realize you need to hire a new engineering group, all of whom are currently on a $300K contract with another company.

While there may be potential savings on infrastructure, it is essential to consider whether these savings could be offset by the additional costs associated with engineering hours. . Lake-based approaches will always be more complicated; technology tends to reduce these overheads on the margin, but not remove them completely.

Security data lake projects may be best suited for the more ambitious and engineering-heavy companies. Even there, you should think about hybrid solutions - combining SIEM, the lake, and data warehouses for different workloads, and making changes to your architecture incrementally rather than all at once.

Found a mistake in this article? Please tell us about it.