Much as law came later to the old west than the original settlers, security is coming late to big data. It’s an ironic situation, since so many big data projects being developed and experimented with in these years are themselves intended as security solutions.

In the paper A System Architecture for the Detection of Insider Attacks in Big Data Systems, IEEE researchers Santosh Aditham and Nagarajan Ranganathan explore a new system architecture, run over Hadoop and Spark instances, which addresses big data security via, amongst other things, the same kind of hash-based profiling which defines the blockchain: data unit replication is used as a signature against undesirable changes from those who already have access to the system, providing accountability and change history.

Proposed-System-Architecture-for-Detecting-Insider-Attacks-in-Big-Data-SystemsBut the centre of the system proposed is an attack detection algorithm. It’s here that the traditional trade-off between security processes and general performance takes place since the detection routines run in a sandboxed secondary node similar to heuristic sandboxes in consumer security software. However, the system adheres to Hadoop’s stipulation that security processes should not affect operations by more than 3%.

The detection routines run in secure space and perform hash matching against known and hashed modifications, seeking undocumented commits.

The individual hash-matching performed during security runs finally contribute to a total program hash – effectively a checksum which will be altered by any changes in any of the sub-hash statuses. The use of secure protocols, hashing, and the overhead needed for decryption at runtime make the system’s adherence to the recommended maximum delay time an apparently impressive achievement.

The ultimate practical application of the system requires that the security modules operate in isolation from the central big data workflow, and to this end, the researchers have devised a proposition involving bespoke security hardware chips. This decouples the security system from the big data platform itself and thus has no effect on scalability – within the limits of the security system to handle throughput as it scales up.

However, it’s not exactly a ‘bolt-on’ solution, since the modules will operate at best efficiency according to their proximity to the main processor node. The nearer the placement, the more efficient the process and the lower the cost.

In testing the system the researchers made use of AWS-supported Hadoop and Spark clusters. The team reported minimal network performance for the Hadoop cluster (a five-node build over Amazon EC2 and EBS). The 4-node Spark cluster used generic m1.large.nodes of Amazon EC2 and EBS, each with 2 vCPU and 7.5gb of memory, and the researchers report moderate network performance with this configuration.

Commenting on the developing lag between security research and functionality exploration in big data applications, the paper observes that there is ‘an immediate need to address architectural loopholes in order to provide better security. For instance, current big data security platforms focus on providing fine-grained security through extensive analysis of stored data. But such models indirectly facilitate the abuse of user data in the hands of the provider.’