Interviews

Managing big data

by Mark Rowe

Dr James Kent, Head of Investigations at Nuix, looks at the challenges presented by investigations involving huge volumes of complex digital evidence, and writes of how a content-based forensic triage approach can greatly increase efficiency for time and resource-limited investigators.

The Big Data era has pushed digital forensics and investigations to crisis point. Growing volumes of digital storage are making the traditional methods for evaluating electronic evidence unsustainable. The list of devices involved in investigations that contain electronic storage, and potential evidence, grows every year. Digital investigators regularly encounter evidence stored in laptop and desktop computers, smartphones, tablets, digital cameras, flash memory devices and cloud storage services, to list a few.

Despite these changes, many investigators steadfastly stick to the traditional method of analysing each data repository individually using forensic tools, then manually correlating the evidence they have uncovered. This approach has become immensely time-consuming, leading to large backlogs of unsolved cases.

In recent years, we have seen law enforcement and corporate investigators take a different approach, which achieves the same or better results as traditional methods but much faster. Content-based forensic triage involves collecting all available data in a single storage location, then uses a combination of data management, analytical and forensic techniques to understand the content and context of digital evidence. This makes it possible to focus rapidly on the most relevant evidence sources until the key facts emerge.

Data growth

According to technology analyst firm Gartner, the often-used term ‘Big Data’ does not simply refer to large volumes of information. Rather, it means “high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

In my experience as a forensic investigator, I have seen this data – ever growing, moving, changing and becoming more complex – stretch most investigators to capacity.

Increasing volume and variety

As a rule of thumb, the number of devices containing data involved in a typical investigation doubles every two years, and the volume of data grows even faster.

Information technology analyst firm IDC estimated that in 2012, the average adult in the developed world generated 1.8 terabytes of data annually. However, the total “ambient information” in the digital universe about each person extended to 4.1 terabytes.

As for variety, the United Kingdom Association of Chief Police Officers’ most recent Good Practice Guide For Computer-Based Electronic Evidence recommends officers at a crime scene should seize devices including PC or laptop main units, external hard drives, dongles, modems, wireless network cards, routers, digital cameras, floppy disks, backup tapes, Jaz/Zip cartridges, CD-ROMs, DVD-ROMs, PCMCIA cards, memory sticks, memory cards and all USB or FireWire-connected devices.

Growing complexity

As well as increasing in volume and variety, we are seeing digital evidence become more complex. For investigators and regulators working in corporate environments, evidence can be stored in file shares, email databases, email archives, collaboration and document management systems, among others.

These repositories have intricate ways of storing and embedding data multiple levels deep. They often use closed, proprietary formats that typically require a vendor-supplied software interface to read the information within them.

Traditional forensic tools

When handling electronic evidence, most investigators continue to apply traditional forensic tools and methodologies. For each evidence repository, they typically:

•Plug the device into a write blocker
•Acquire a forensic image of the entire device
•Make a copy of the forensic image
•Analyse the data stored on the forensic image copy
•Write a report on the results of this analysis.

An investigator would then repeat this process for each device related to this case. Having completed this process for all devices, investigators would then use human brainpower to find connections and correlations between the data sources.

I have seen this approach encounter a number of limitations, particularly in light of the growing number of devices and volume of data investigators must examine.

Traditional forensic tools can only effectively analyse one repository at a time and cannot thoroughly analyse complex information stores such as Lotus Notes, Microsoft Exchange etc. They struggle to process large volumes of data in a reasonable timeframe and, finally, do not automatically identify and organise important intelligence such as names, email addresses, phone numbers and credit card numbers – investigators have to know what they are looking for.

Content-based forensic triage

In recent years, we have seen law enforcement and corporate investigators take a different approach. Content-based forensic triage involves collecting all available data in a single storage location, then using a combination of data management, analytical and forensic techniques to focus on the most critical evidence sources until the key facts emerge. In my opinion, it achieves the same or better results as traditional forensic methods, but much faster and more efficiently. The content-based forensic triage process follows a series of seven logical steps:

1)Ingest all data: the first stage of this process requires ingesting all data sources into a single repository.

2)Conduct a light metadata scan: A light metadata scan tabulates information such as the owner or sender, size, format, file name or subject line and relevant dates for each file, email message and attachment in the evidence store. It does not extract the full text of each item but is much faster than full indexing.

3)Analyse relationships between people and evidence: Using techniques such as network diagrams and timelines, investigators can see connections and flows of information between suspects or custodians. This can help quickly narrow down dates, data sources and people to examine in greater depth.

4)Deeply index relevant data sources: Having identified the most likely evidence sources, investigators now extract full text and metadata from them. At this stage, the speed and thoroughness of the indexing tool is critical.

5)Search and investigate: A complete index of data and metadata enables investigators to conduct their standard workflows to search for evidence across all sources at the same time. They can also use a range of sophisticated searching and analysis techniques, some of which originated in complementary fields such as legal discovery and information governance.

6)Cross-reference intelligence: Advanced investigative tools can automatically extract and highlight intelligence items including names, email addresses, IP addresses, credit card numbers, bank account numbers and amounts of money.

Cross-referencing this intelligence across all available evidence can rapidly reveal relationships between people and entities, deliver points to prove and also offer broader intelligence. It brings to light connections that human investigators might miss.

7)Forensically examine only the most relevant data sources: In most cases, this process will already have located the critical facts of the case. If not, it will almost certainly have provided clues as to where such information is hidden. Investigators can then use their digital forensics skills to dig deep into the likeliest evidence sources. In this way, they avoid wasting countless hours forensically analysing irrelevant material.

Benefits

I know today’s investigators, however brilliant, can’t hope to consistently and accurately cross-reference and find correlations across millions of data points. It is easy to miss connections, particularly without an automated way to identify intelligence items. As a result, investigating digital media for ‘points to prove’ or ‘elements of the crime’ is the norm. Investigators rarely have the luxury of higher level analysis.

Content-based forensic triage allows investigators to access data stored in complex corporate repositories and cloud-based services, and then automatically cross references the intelligence – revealing connections that may not have been immediately obvious. Additionally, using advanced techniques such as word clusters – delivering more relevant results and fewer false positives than basic keyword searches – and visually analysing data makes it much easier to detect trends and isolate outliers across massive volumes of evidence.

Content-based forensic triage has the ability to reduce repetitive tasks by automating workflows and saves human brain-power and person-hours, providing a powerful way to address investigative backlogs and unsolved cases.

Related News

Newsletter

Subscribe to our weekly newsletter to stay on top of security news and events.

© 2024 Professional Security Magazine. All rights reserved.

Website by MSEC Marketing