SEC595: Applied Data Science and AI/Machine Learning for Cybersecurity Professionals

GIAC Machine Learning Engineer (GMLE)
GIAC Machine Learning Engineer (GMLE)
  • In Person (6 days)
  • Online
36 CPEs

SEC595 provides students with a crash-course introduction to practical data science, statistics, probability, machine learning, and AI. The course is structured as a series of short discussions with extensive hands-on labs that help students to develop useful intuitive understandings of how these concepts relate and can be used to solve real-world problems. The best analogy is that we are using an apprenticeship approach to bring you from beginner to journeyman in AI and related fields. If you've never done anything with data science or machine learning but want to use these AI techniques, this is definitely the course for you! 30 Hands-on Labs

Course Authors:

What You Will Learn

Harness Data Science and AI for Advanced Cybersecurity Threat Hunting Solutions

Data Science, Artificial Intelligence, and Machine Learning aren't just the current buzzwords, they are fast becoming one of the primary tools in our information security arsenal. The problem is that, unless you have a degree in mathematics or data science, you're likely at the mercy of the vendors. This course completely demystifies machine learning and data science. More than 70% of the time in class is spent solving machine learning and data science problems hands-on rather than just talking about them. You will leave the class not only understanding how these tools and techniques work, but understanding how to think about your data, making it into something that you can apply machine learning and AI techniques to.

Unlike other courses in this space, this course is squarely centered on solving information security problems - in other words, applied rather than theoretical. Where other courses tend to be at the extremes, teaching almost all theory or solving trivial problems that don't translate into the real world, this course strikes a balance. While this course will cover necessary mathematics, we cover only the theory and fundamentals you absolutely must know, and only so as to allow you to understand and apply the machine learning tools and techniques effectively. We show you how the math works but don't expect you to do it. The course progressively introduces and applies various statistic, probabilistic, or mathematic tools (in their applied form), allowing you to leave with the ability to use those tools and to be able to troubleshoot your results since you have developed strong intuitions about the underlying mathematics. The hands-on projects covered were selected to provide you a broad base from which to build your own machine learning solutions. If you want or need to know how AI tools like ChatGPT really work so that you can intelligently discuss their potential uses in your organization, in addition to knowing how to build effective solutions to solve real cybersecurity problems using machine learning and AI today, this is the class you need to take. Check out the extensive course description below for a detailed run down of course content and don't miss the free demo available by clicking the "Course Demo" button above!

NOTE: All the concepts in this course are discussed using Python examples. You should have an intermediate understanding of the Python language! There is no need to be a Python expert. If you have successfully written at least a handful of Python scripts, your Python knowledge is likely sufficient. We will review key Python data structures in class in the first section of the course. If you need assistance determining if your Python knowledge is sufficient, please contact us for more information.

This course is for cybersecurity professionals who are seeking to add machine learning, data science, and artificial intelligence skills to their repertoire. This course is also very useful for individuals with a data science background who are seeking to understand how to use cybersecurity data in meaningful ways for threat hunting, anomaly detection, and monitoring. Intermediate Python fluency is important. Pre-calculus mathematics skills are important, but not required.

"The course content's design is superb in my opinion. It begins by covering the fundamentals of data extraction from diverse sources using Python, followed by a dive into the basics of statistics. From there, it delves into ML models and DNNs. I appreciate the thoughtfulness behind this progression." -Viswanath Chirravuri, Thales

What Is Machine Learning?

Machine Learning is a branch of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It involves the development of algorithms that can analyze and make predictions or decisions based on data. This technology is fundamental in creating applications that adapt and become more accurate over time, revolutionizing industries by automating complex tasks and unlocking new insights from data.

Business Takeaways

This course will help your organization:

  • Generate useful visualization dashboards
  • Solve problems with Neural networks
  • Improve the effectiveness, efficiency, and success of cybersecurity initiatives
  • Build custom machine learning solutions for your organization's specific needs
  • This course prepares you for the GMLE certification

Skills Learned

  • Apply statistical models to real world problems in meaningful ways
  • Generate visualizations of your data
  • Perform mathematics-based threat hunting on your network
  • Convert the data you have into representations to which ML/AI techniques can be applied
  • Understand and apply unsupervised learning/clustering methods
  • Build Deep Learning Neural Networks
  • Build and understand Convolutional Neural Networks
  • Understand how to build representative synthetic data
  • Understand and build Genetic Search Algorithms
  • Understand the fundamentals of containerized deployment

Major Topics Covered Include

  • Data acquisition from SQL, NoSQL document stores, web scraping, and other common sources
  • Data exploration and visualization
  • Descriptive statistics
  • Inferential statistics and probability
  • Bayesian inference
  • Unsupervised learning and clustering
  • Deep learning neural networks
  • Autoencoders
  • Anomaly detection with neural networks
  • Loss fuctions
  • Convolutional networks
  • Embedding layers
  • Practical containerized deployment

Hands-On Machine Learning Training

The hands-on portion of SEC595 and especially suited to the student with a data science background who are seeking to understand how to use cybersecurity data in meaningful ways for threat hunting, anomaly detection, and monitoring. The course includes 30 hands-on labs and over 70% of the class is spent solving machine learning and data science problems hands-on.

  • Section 1: Python Refresher; Accessing, Manipulating, and Retrieving SQL Data; Accessing, Manipulating, and Retrieving NoSQL data: MongoDB; Webscraping for data acquisition
  • Section 2: Statistics Fundamentals: Medians and Means; Statistics Fundamentals: Variance, Deviations, and Robust Measures; Applications of Statistics to Data Identification; Probability, Beyes, and Phishing; Threat Hunting through Signals Analysis
  • Section 3: K-Means/KNN; Elbow Functions and PCA; DNSCAN for Clustering; Support Vector Classifiers; Support Vector Machines; Decision Trees; Random Forests
  • Section 4: Polyfit Regressions; Hello, World! Sentiment Analysis; Ham vs. Spam via Deep Learning; Identifying Protocols; Protocol Anomaly Detection
  • Section 5: Predictive Malware Identification -- Finding Zero Days; Ham vs. Spam, CNN Style; Multi-class text classifications via CNNs; Log Anomaly Detection using Autoencoders; Real-time Network Anomalies
  • Section 6: Solving CAPTCHAs: POC; Solving CAPTCHAs: Functional API; Solving Algorithms

"Labs and exercises have been very helpful, going over them a second time is helping to reinforce what I've learned this week, and to put it all in better context." - Blake Hickson

"The labs gave me the opportunity to use theory that we were taught during the training and gain some hands on experience." - Vasiliki Politopoulou

"SANS SEC595 emphasizes practical, hands allows participants get to work with Python scripts and tools to automate various aspects of information security. This approach ensures that students can apply what they learn immediately in their work." - Louis Valencia, US Government

Syllabus Summary

  • Section 1: Data Acquisition, Cleaning, and Manipulation
  • Section 2: Data Exploration and Statistics
  • Section 3: Essentials of Machine Learning: Trees, Forests, & K-Means
  • Section 4: Essentials of Machine Learning: Deep Learning
  • Section 5: Essentials of Machine Learning: Autoencoders
  • Section 6: Essentials of Machine Learning: Functional Models and Deployment

Additional Free Resources

  • Anaconda
  • TensorFlow (and supporting libraries)
  • Matploitb
  • VMWare Workstation/Player/Fusion

What You Will Receive

  • Jupyter notebooks of all labs and complete solutions
  • Sample data for real-world cybersecurity problems

Syllabus (36 CPEs)

Download PDF
  • Overview

    This section introduces some of the terminology in the data science and machine learning fields, in addition to introducing a number of the technologies that are used as data sources. Since the first step in any data science or machine learning project is to acquire data, the balance of the day is focused on hands-on exercises to prepare the student for these tasks.

    The first necessary skill is the use of Python, our chosen language for this course. The only course prerequisite is a fundamental understanding of Python. If you've written even one line of Python, you are probably knowledgeable enough to get started! We will cover lists, arrays, tuples, dictionaries, comprehensions and then begin introducing the numpy variants.

    Following the Python refresher the course provides some theory followed immediately by hands-on exercises to give you just enough knowledge of SQL, MongoDB, and webscraping to get real work done.

    Exercises
    • Python Refresher
    • Accessing, Manipulating, and Retrieving SQL data
    • Accessing, Manipulating, and Retrieving NoSQL data: MongoDB
    • Webscraping for data acquisition
    Topics
    • Data Science
    • Python
    • SQL
    • NoSQL
    • Webscraping
  • Overview

    This section begins with the fundamentals of statistics that matter for data science and machine learning. Following this introduction and hands-on exercises that provide practical uses for these techniques against real-world data, the course transitions to probability theory.

    Probability theory is an extensive field of its own. Following the introduction of some fundamentals, the course works directly toward deriving the Bayesian theorem. Building on this introduction, students then engage in a hands-on lab that builds a useful Bayesian analysis tool, upon which students will improve later in the course.

    The remainder of this section is translating the statistical knowledge gained into the field of signals analysis. After a discussion concerning the derivation and applications of the Fourier series, the Fast Fourier Transformation, and the Discrete Fourier Transformation, students use these tools in a real-world threat hunting activity.

    Exercises
    • Statistics Fundamentals: Medians and Means
    • Statistics Fundamentals: Variance, Deviations, and Robust Measures
    • Applications of Statistics to Data Identification
    • Probabiltiy, Beyes, and Phishing
    • Threat Hunting through Signals Analysis
    Topics
    • Statistics
    • Robust Measures
    • Probability
    • Bayes Theorem and Inference
    • Fourier Series and Related Derivations
  • Overview

    The remaining 18+ contact hours of this course are spent learning about and immediately applying various machine learning models. After each topic is introduced and discussed, students engage in lengthy hands-on labs to develop an intuitive understanding and apply the technique to real problems.

    The section begins with various clustering approaches and unsupervised machine learning. The exploration begins with Support Vector Classifiers, kernel functions, and Support Vector Machines. Following this discussion and exercises, we continue the clustering theme by considering the K-Means and KNN approaches. After working through examples in just two or three dimensions, we turn our attention to methods for determining the ideal number of clusters. With this done, we finally explore high-dimensional applications and dimensionality reduction through Primary Component Analysis. The DBSCAN algorithm is covered in some depth, with application made to threat hunting and efficient SOC analysis of large scale data.

    The balance of this section is spent discussing Decision Trees. After a hands-on activity and discussion of the limitations of Decision Trees, we expand into Random Forests and explore hands-on how these provide better inferences in most cases. The section wraps up with a cluster-based approach to finding anomalies in user activity on a network.

    Exercises
    • K-Means / KNN
    • Elbow Functions and PCA
    • DNSCAN for Clustering
    • Support Vector Classifiers
    • Support Vector Machines
    • Decision Trees
    • Random Forests
    Topics
    • Support Vector Classifiers
    • Support Vector Machines
    • Kernel Functions
    • Primary Component Analysis
    • DBSCAN
    • K-Means
    • KNN
    • Elbow Functions
    • Decision Trees
    • Random Forests
    • Anomaly Detection
  • Overview

    The entire focus of this section is on the theory, development, and use of supervised learning approaches in the field of information security. Building on the mathematics and statistics covered in section 2, this section begins with linear regressions and ends with the application of deep learning neural networks to multi-class classification problems involving real-time network data.

    The material is focused on using supervised machine learning and mathematics to create predictive models. The initial discussion and exercises center around forecasting and trends analysis for anomaly detection. Following this, the majority of the material focuses on classification problems.

    Building on the Bayes approach used in section 2, this section introduces deep learning neural networks and fully connected dense networks through the development of a far more accurate phishing detection network. Following this, the course explores visualization and measurement of neural network training performance, in addition to discussing overfitting, overtraining, and how to identify (and avoid!) them.

    The next portion of this section turns to categorical problems, during which students will build a real-time network protocol classification system. More importantly, students will implement anomaly detection in this classification system, a task typically reserved for unsupervised approaches.

    Exercises
    • Polyfit Regressions
    • Hello, World! Sentiment Analysis
    • Ham vs. Spam via Deep Learning
    • Identifying Protocols
    • Protocol Anomaly Detection
    Topics
    • Regression and fitting
    • Loss and Error functions
    • Vectors, Matrices, and Tensors
    • Fundamentals of the Perceptron
    • Dense Networks
  • Overview

    This section of the course is dedicated to expanding students' knowledge of deep learning solutions. The first half of the section is focused entirely on convolutional networks (CNNs). The class explores the application of CNNs to text classification problems, but also to predictive identification of zero-day malware.

    The second half of this section of the course focuses on autoencoders. The class examines what autoencoders do, why they work, how to select a latent representation, and how reconstruction loss functions work. This knowledge is then applied to creating an automatic log anomaly detection solution that does not use any signatures or human intervention to identify anomalies. Building on this, students work on the building blocks for a large-scale ensemble autoencoder for detecting network threats.

    Exercises
    • Predictive Malware Identification - Finding Zero Days
    • Ham vs. Spam, CNN Style
    • Multi-class text classification via CNNs
    • Log Anomaly Detection using Autoencoders
    • Real-time Network Anomalies
    Topics
    • Convolutional Neural Networks
    • Embedding Layers
    • Applying CNNs to text problems
    • Autoencoders
    • Reconstruction loss measurements
    • Creating ensemble autoencoders
  • Overview

    The final section of this course continues discussing Convolutional Neural Networks and the application of CNNs and fully connected networks for solving regression problems. The major focus of this section is on the creation of a deep neural network using TensorFlow's functional pattern, allowing you to build networks with complex structures, multiple inputs, and multiple outputs. The main task used to learn about these techniques will be using neural networks for both testing the quality of and solving CAPTCHAs. Whether you are on a red, blue, or purple team, you will learn how to think through and use machine learning to solve what amounts to a computer vision problem and to solve it at greater than 95% accuracy! Along the way you will also learn the key concepts behind the creation of representative synthetic data, how to build synthetic data with generators, and how things can go wrongly. You will also learn how to make use of data augmentation layers.

    Following this project, the class covers the use of genetic techniques for hyperparameter optimization. Students are provided with a starting point for genetic optimization for use on their own after class.

    The final discussion and demonstration in the course covers practical deployment approaches, including stand-alone deployments for real time critical applications and, for less time critical applications, the more common containerized approaches that can be used with Docker, Rancher, or Kubernetes.

    Exercises
    • Solving CAPTCHAs: POC
    • Solving CAPTCHAs: Functional API
    • Solving CAPTCHAs: Split model
    Topics
    • Convolutional Neural Networks and Regressions
    • Functional definition of Neural Networks
    • Deep Learning Networks with Multiple Outputs
    • Thinking about Machine Learning Problems
    • Genetic Algorithms
    • Deployment using Containers

GIAC Machine Learning Engineer

The GIAC Machine Learning Engineer (GMLE) certification validates a practitioner’s knowledge of practical data science, statistics, probability, and machine learning. GMLE certification holders have demonstrated that they are qualified to solve real-world cyber security problems using Machine Learning.

  • Anomaly detection and optimization
  • Convolutional neural networks
  • Data acquisition
  • Data exploration and visualization
  • Data manipulation and analysis
  • Deep learning neural networks
  • Inferential statistics and probability
  • Loss functions
  • Probability and inference
  • Python scripting
  • Supervised and unsupervised learning
More Certification Details

Laptop Requirements

Important! Bring your own system configured according to these instructions!

A properly configured system is required to fully participate in this course. If you do not carefully read and follow these instructions, you will likely leave the class unsatisfied because you will not be able to participate in hands-on exercises that are essential to this course. Therefore, we strongly urge you to arrive with a system meeting all the requirements specified for the course.

It is critical that you back-up your system before class. it is also strongly advised that you do not bring a system storing any sensitive data. Your system should meet these requirements:

  • Modern 64-bit processor (ARM/AMD/Intel) running Linux (Ubuntu or similar recommended, Linux kernel version 6 or higher), Windows 10 or later, or MacOS 11.x or later
  • A minimum of 16 GB RAM
  • 80 GB Free Hard Drive Space
  • Your account must have the necessary rights to install Anaconda or Anaconda must be preinstalled.

Your course media will be delivered via download. The media file for class is large, more than 50GB. You need to allow plenty of time for the download to complete. Internet connections and speed vary greatly and are dependent on many different factors. Therefore, it is not possible to give an estimate of the length of time it will take to download your materials. Please start your course media downloads as soon as you get the link. You will need your course media immediately on the first day of class. Waiting until the night before the class starts to begin your download has a high probability of failure.

SANS has begun providing printed materials in PDF form. Additionally, certain classes are using an electronic workbook in addition to the PDFs. The number of classes using eWorkbooks will grow quickly. In this new environment, we have found that a second monitor and/or a tablet device can be useful by keeping the class materials visible while the instructor is presenting or while you are working on lab exercises.

If you have additional questions about the laptop specifications, please contact support.

Author Statement

"AI and Machine Learning are everywhere. How do the vendor solutions work? Is this really black magic? I wrote this course to fill an enormous knowledge gap in our field. I believe that if you are going to use a tool, you should understand how that tool works. If you don't, you don't really know what the results mean or why you are getting them. This course provides you a crash-course in statistics, mathematics, Python, and machine learning, taking you from zero to...I'm reluctant to promise 'Hero...' Let's say competent-person-who-can-solve-real-problems-today!"

- David Hoelzer

"I can think of no one else who could explain the material better. His deep understanding of the technology and his ability to present it in such a way that allowed those not as proficient to understand it was great." - Thomas L, US Military

Reviews

I really like that this is pulling from experience rather than a textbook. The added anecdotes about the history behind various topics really helped pull it together for me.
Brian Morris
City of Austin
AI/ML for cybersecurity is poorly understood & misrepresented too often. This course provides that balance between what management needs to know in order to grow understanding of the technologies and hands-on experience.
Thomas L
US Military
This course covers a wide breath with great depth. I am excited to apply everything after the course.
Denise Berger
MITRE

    Register for SEC595

    Learn about Group Pricing

    Prices below exclude applicable taxes and shipping costs. If applicable, these will be shown on the last page of checkout.

    Loading...