Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some issue about reproduce the project #7

Open
HideLakitu opened this issue Oct 12, 2024 · 4 comments
Open

Some issue about reproduce the project #7

HideLakitu opened this issue Oct 12, 2024 · 4 comments

Comments

@HideLakitu
Copy link

HideLakitu commented Oct 12, 2024

Hello,
Recently I try to take kind of realistic experiment, just find this repo and think looks good for its lightness: like total line of codes, small size of used dataset (so the pipeline seems easy). Now I plan to carry out FLAD on Google Colab with drive, cause I'm Windows rather than Linux(only practiced a little bit about some basic command by vitural machine before).

So about the Traffic pre-processing part, u give instruction python3 lucid_dataset_parser.py --dataset_type DOS2019 --dataset_folder /path_to/dataset_folder/ --packets_per_flow 10 --dataset_id DOS2019 --traffic_type all --time_window 10, which means I should also download the before released repo lucid right? then use the related generated files for functions in FLAD to train.

Also I wonder is it possible to implement totally on online's Colab rather than local.

@doriguzzi
Copy link
Owner

Hi,
yes, you need the LUCID parser to convert the traffic traces of the dataset into traffic samples. Once done, you won't need the LUCID's code anymore.
I think both LUCID and FLAD can be executed on Colab, but I've never tried so far.

All the best,
Roberto

@HideLakitu
Copy link
Author

HideLakitu commented Nov 7, 2024

Hi, yes, you need the LUCID parser to convert the traffic traces of the dataset into traffic samples. Once done, you won't need the LUCID's code anymore. I think both LUCID and FLAD can be executed on Colab, but I've never tried so far.

All the best, Roberto

Thx for reply! I do managed to start training with those subfolders, but now I stuck on preprocess the dataset part. I mean when I try to utilize the customized data created by myself, ran command python3 lucid_dataset_parser.py --dataset_type DOS2019 --dataset_folder /path_to/dataset_folder/ --packets_per_flow 10 --dataset_id DOS2019 --traffic_type all --time_window 10 and python3 lucid_dataset_parser.py --preprocess_folder /path_to/dataset_folder/, there are only scattered data in both train, val and test .hdf5 file, significant mismatch between the scale of inputs and outputs in other word.

Specifically after downloaded CIC-2019(part data of it), the raw files are just like SAT-03-11-2018_03 , so I added .pcap as suffix, used Wireshark to open, then exported partial data of them.

For example, below one is the output of randomly choose 80,000 traffic data to preprocess, u can see only single digit of data here generated: Train/Val/Test sizes: (9,1,2) . But meantime the intermediate .data process file seems normal, for its size is proportional to the original traffic data.

So how to deal with this? Like modify which part of lucid_dataset_parser.py to handle. In normal cases I think 80,000 data should corresponds at least 2k for generated training file (just an estimate).

Running-ipynb-Colab-11-18-2024_02_58_AM

currentScreenshot0

@doriguzzi
Copy link
Owner

Hi,
if you take a look at the output of command python3 lucid_dataset_parser.py --dataset_type DOS2019 --dataset_folder /path_to/dataset_folder/ --packets_per_flow 10 --dataset_id DOS2019 --traffic_type all --time_window 10, you will notice that there are only 6 benign flows. Therefore, when you execute the second step, the balancing method reduces the number of DDoS flows from 38837 to 6 to create a balanced dataset.
You can notice that the number of flows goes from ((tot,ben,ddos):(38843,6,38837)) to ((tot,ben,ddos):(12,6,6)).
To solve this issue, I suggest adding some pcaps with benign traffic in the same folder and restarting the whole process from scratch.

@HideLakitu
Copy link
Author

HideLakitu commented Dec 13, 2024

Hi, if you take a look at the output of command python3 lucid_dataset_parser.py --dataset_type DOS2019 --dataset_folder /path_to/dataset_folder/ --packets_per_flow 10 --dataset_id DOS2019 --traffic_type all --time_window 10, you will notice that there are only 6 benign flows. Therefore, when you execute the second step, the balancing method reduces the number of DDoS flows from 38837 to 6 to create a balanced dataset. You can notice that the number of flows goes from ((tot,ben,ddos):(38843,6,38837)) to ((tot,ben,ddos):(12,6,6)). To solve this issue, I suggest adding some pcaps with benign traffic in the same folder and restarting the whole process from scratch.

THX for reply, issue has already been solved, it's due to exactly almost without the benign traffic in pcap file, and there won't happen only generate single number of training data after preprocess or out of bound(AxisError) now --- BTW I can't figure out this bug when almost consist of ddos traffic a little bit.

Feel a little embarrassed to query u again, usually in a segment of CIC2019 I found there are too little benign ones , for checked some SAT-03-11-2018_023-xx, only a few hundred benign traffic out of hundreds of thousands in a file. So can I just capture some random traffic under pcap format on my own? For all the ones you capture yourself in real-time, almost all should be benign, then merge it with file full of attack traffic.

And if this make sense, should I rewrite Source and Destination to specified IP address u mentioned in LUCID?

DOS2019_FLOWS = {'attackers': ['172.16.0.5'], 'victims': ['192.168.50.1', '192.168.50.4']}

Here are two screenshots, above is the benign file in sample-dataset in FLAD; Below is my DIY, just press the capture button in Wireshark (under WLAN port), catptured data and exported as .pcap. I didn't apply filter, like protocol, length limitation or something else.

benign

cap1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants