Can data lakes solve cloud security challenges?

"Data Lake" is a proprietary term. "We have built a series of big data platforms that enable clients to inject any type of data and to secure access to individual elements of data inside the platform. We call that architecture the data lake," says Peter Guerra, Principal, Booze, Allen, Hamilton. Yet, these methods are not exclusive to Booze, Allen, Hamilton.

"I have read what's available about it," says Dr. Stefan Deutscher, Principal, IT Practice, Boston Consulting Group, speaking of the data lake; "I don't see what's new. To me, it seems like re-vetting available security concepts with a name that is more appealing." Still, the approach is gaining exposure under that name.

In fact, enterprises are showing enough interest that vendors are slapping the moniker on competing solutions. Such is the case with the Capgemini / Pivotal collaboration on the "business data lake" where the vendors are using the name to highlight the differences between the offerings.

This enterprise curiosity stems from real big data ills that need equally genuine cures. Enterprises from government agencies to large concerns and on down use big data inside public multitenant cloud environments. All the risks of mutlitenancy apply in these scenarios including the vulnerabilities that come with the weaker security of another tenant, potential access by users of an adjacent tenant, PII/PHI exposure, and other regulatory non-compliance. Data lakes could protect big data from all the perils of the public cloud.

But, while Defense agencies need the protection data lakes offer for each individual data element, the typical enterprise does not. Nor can most enterprises afford the performance hit that comes with using data lakes in this way. That's why some vendors are using data lakes to protect the whole of big data rather than each piece while also avoiding the performance lag of the former approach. Enterprises in the market for solutions to security challenges that come with the public cloud should consider one or both data lake approaches.

Securing data elements

"The overarching concept is the ability to pull in different types of data, tag that data, and enable users and administrators to secure the individual data elements within the data lake," says Guerra. Rather than deidentifying PII/PHI and providing data privacy on the whole, this data lake approach determines what pieces of data are sensitive and what pieces are not and works from there.

"We like to bring all the data into the data lake in its rawest format," says Guerra; "we don't do any extraction or transformation of data ahead of time." Instead, this approach tags each data element with a set of metadata tags and attributes that describes the data and how the IAM systems that access it should handle it.

According to Guerra, the IAM system enforces the security of individual data elements using XACML (Extensible Access Control Markup Language) -based rules. An administrator or system writes rules in the IAM system, which enforces those rules when a user authenticates. The system passes the user's security authorizations to the big data architecture. "The big data architecture then matches the individual security authorizations with the XACML rules and returns only the appropriate data," says Guerra.

Pros and cons

Data lakes still require role-based access, policies, and policy enforcement. "You use PKI to ensure the person is who they say they are and to bind their attributes to the platform that stores the individual data attributes to ensure that security is complete," says Guerra. The system requires policies and policy enforcement to limit and permit access based on the metadata tags and attributes. The system uses a technology that brokers the data access requests in order to enforce the security policy.

"It's very difficult to implement those systems and attribute enforcements throughout the data lake platform stack," says Guerra. But Guerra has worked closely with clients to define policies, he says.

With this kind of system, a data assailant would have to break through the perimeter security around the data lake and through the security protecting the individual data elements in order to retrieve anything. The system uses PKI to cryptographically sign and enforce security tags for the data elements. "You can't change them nor can you break them. An attacker would have to break each tag in order to gain access to all data elements," says Guerra.

However, this kind of approach requires an IAM system with attribute-based access controls (ABAC). There are a number of ABAC vendors in the market. But, system scalability and performance are still concerns with ABAC systems, according to NIST Special Publication 800-162, "Guide to Attribute Based Access Control (ABAC) Definition and Considerations" (January, 2014).

But ABAC IAM systems in an unstructured data lake work differently than existing structured systems and legacy security solutions do, says Jerry Irvine, CIO, Prescient Solutions and member of the National Cyber Security Task Force. "Access and authorization controls within the data lake are distributed across multiple categories of service and systems," says Irvine. This offsets the potential for these IAM systems to experience load and performance issues at a single point of failure.

How data lakes identify and tag data from legacy platforms is another concern. "Most applications don't provide sufficient meta-information about data they generate," says Dr. Deutscher. This can make it difficult for data lakes to know how to tag data elements with attributes.

"We've handled that a couple of ways," says Guerra. One method is to query legacy systems and apply tagged attributes to the results. Another way is to classify legacy systems as a whole. A small subset of people can read an older financial transaction system, for example. "We integrate the output from that legacy system and pull it into the data lake," says Guerra. The data becomes part of the lake while retaining access rights for the appropriate people.

Ultimately, data lakes enable enterprises to swiftly input variegated data types and make it easier to process and exploit. "Because all the data is stored as unaltered, queries provide a more accurate report with a greater depth of information reported about the data," says Irvine. Data lakes provide higher levels of information to executive management, revealing correlations between data that they may have overlooked, allowing them to make more intelligent decisions, Irvine notes.

Securing only the lake

"Data lakes can act as repositories of log file information, user information, and behavioral and transactional information about the user," says Steve Jones, Strategy Director, Big Data & Analytics, Capgemini. Enterprises can use massive amounts of data to establish a robust baseline of expected user behavior. With a fine grain model of normal behavior, data lakes can quickly and precisely detect anomalous behavior, intrusions, IP theft, and data leakage.

This data lake approach avoids the costs and performance lags of the other approach, which are associated with enriching every single piece of data with the right metadata and with validating every query and hit on every piece of information against the security policy, explains Jones.

While the level of security detail in the other data lake approach is laudable, says Jones, it is probably too expensive for most enterprises. "The raw data that data lakes can store is, however, useful in securing a cloud approach by performing threat, intrusion, and anomalous behavior analysis," says Jones.

CSOs need to know what they are trying to achieve. "Is it fine-grained security in the defense sector or simply a better way to create a 360-degree view of internal and external threats," asks Jones? "Understanding the real business challenge will help them undertake the right approach," says Jones.

For many, the simpler solution is the right one.

Show Comments