Machine Learning for Cyber Security： 01 Introduction

术语

什么是机器学习关于网络安全的算法？

没有特定算法，是算法工具集，这些工具集可以应用于各个领域，其中重要的不同点是在于数据集。

来源于网络安全的数据

数据包（packets）
网络数据包（Network packets）
日志文件（log files）
文件本身等

不同的信息具有不同的格式、来源，这些信息需要转化成一种特定的形式——向量空间模型（vector space model）

这些内容被转化为空间模型，他就可以应用于机器学习

这门课程中的重点在于这些信息建模，而不是机器学习的算法，因为你可能使用的是几乎一样的代码，但是会碰到不同的数据集

工具

WEKA，一个基于Java的GUI环境，用于机器学习
Python
- 使用Numpy和Pandas两个库进行数据预处理、清洗数据等
- Sklearn，传统机器学习
- Tensorflow，深度学习
Hadoop，集群
Spark，基于系统分配数据
AWS
GPUs，CPUs，TPUs，cognitive processors

什么是机器学习

一种基于模型的用来预测、探测或者分组数据样本的方法（Methods for predicting, detecting , or grouping data samples based on a model）
这个模型必须用数据进行学习（The model must be learned with data ）
建模方法可以是几何（基于距离指标和线性边界）的，也可以不是（Methods can be geometrical (or not) and the model is based on distance metrics or linear boundaries）

机器学习其实是一种几何，通常我们在考虑“空间向量模型”的时候，我们也在考虑“点”，从而构建一些曲线。

为什么把机器学习应用于网络安全

网络安全通常依赖于特定领域的专业知识，一个人分析数据可能性需要花费很长的时间

过多的数据（Too much data）
手工构造模型会耗费大量的精力（Building models by hand is labor intensive）
机器学习可以学习模型，从而帮助你了解在数据中的模式（Machine learning can also learn models ）

什么是大数据

Lots of data
- Terabytes
So much data that a single computer with 8 RAM and latest CPU cannot do the work
Instead, need more powerful computer
Better yet, several computers working in parallell
Two main approaches:
- Parallel CPUs
- GPUs (1 or several also in parallel)

机器学习的专业术语

监督学习（Supervised）

有样本、标签、网络流量

Data	Labels
[network]	[0]	Bad
[network]	[1]	Good

数据集将被分为测试集和训练集，通过训练集选择分类器来进行建模，然后建立模型，将其用于与测试集，得到Y-test值，然后比较训练集得出的Y-predicted值。

非监督学习（Unsupervised）

没有标签，也就是不知道一个数据的好坏，这样就无法学习相关性，但是可以学习分组。（聚类？）

其他

特征：Features
数据集：Data sets
数据预处理：Data pre-processing
性能指标：Performance metrics

理解如何获取原始数据，并把它转化成向量空间模型，然后把他应用于机器学习的模型。我们需要花费大量的时间来寻找特征。

常见的机器学习算法

朴素贝叶斯（Naïve Bayes）
决策树（Decision trees）
随机森林（Random forest）
KNN
线性回归（Linear regression）
Logistic回归（Logistic regression）
神经网络（Neural networks）
支持向量机（Support Vector Machines）
深度神经网络（deep neural networks）

什么是深度学习

在输入层和输出层之间有更多层的神经网络（Neural networks with more layers between the input and output layers）
批处理大数据（Batch processing for big data）
Matrix multiplication operation takes advantage of GPUs
Have outperformed all others since around 2012

机器学习基本流程

数据集格式

.csv
- 0,tcp,http,SF,162,4528,0,0,0,1, … ,normal.
.libsvm
- [label] [index 1]:[value 1] [index 2]:[value 2] …
.arff（用于weka）
- The format of Weka storage data
- @duration numeric
  @protocol_type {tcp,udp,icmp}
  …
  @data
- 0,tcp,http,SF,162,4528,0,0,0,1, … ,normal
etc…

数据集

NSL-KDD network intrusion——入侵检测（IDs）
Unsw big data networking（IOT）
Iris——标准Iris数据集
Phishing——网络钓鱼数据集
Honeypot unsupervised
Denial of service
Malware——恶意软件数据集
Ransomware
Biometrics