-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add code examples #52
Comments
Could an example(s) be added for those new to programming for GPU's? I've done a bit of OpenCL logic for adding an algorithm to the Hashcat project, but there was more freedom compared to how ArrayFire seems to restrict you to working with arrays? // FNV-1a (32-bit) hashing algorithm
fn main() {
let key = [0x68, 0x65, 0x6c, 0x6c, 0x6f]; // "hello"
let mut hash: u32 = 2166136261; // Hash init value defined by algorithm
for byte in key.iter() {
hash ^= *byte as u32;
hash *= 16777619; // Another magic number defined by the algorithm
}
println!("Hash is: {:x}", hash); // 4f9f2cab
} Really simple code, take a string key as bytes and XOR them + multiply hash by a fixed value iteratively. With a large set of keys, how do you approach this in ArrayFire? Do you treat each key as a 1D array(column/ horizontal?) and rows of keys? If so does this mean I can only work on fixed string key lengths? This example would make me think that I should have an array for each letter position(5 in this case + 1 for the hash)? Also curious about optimizing/setting up the array and data right, no idea if 3rd/4th dimensions are helpful for this or row/column optimal sizes for performance, my actual algorithm is doing about 32 billion hashes a second with OpenCL and Hashcat, I'm wanting to see how ArrayFire compares. I have a slightly more complicated hash algorithm to implement which starts with a I can share C/OpenCL/Rust implementations I've done, if it'd be worthwhile how a fairly straightforward algorithm with loop/conditional/arithmitic diffs/ports to ArrayFire? |
I've had a try at porting the simple algorithm, wasn't too difficult with the one fixed key. Tried to add some dynamic strings and ran into memory issues somewhere between 625 and 3125 elements spread over 5 arrays. Code and discussion can be found on this Rust reddit post. Any tips on what I'm doing wrong and could try? Or would ArrayFire not be the appropriate choice for this type of work on the GPU? I'm thinking I need to do the permutations via ArrayFire in GPU memory via some batching/loop? |
I think the variable On Tue, Oct 18, 2016, 4:44 PM Brennan Kinney [email protected]
|
A small clarification, I was referring to the variable hashes from the code On Tue, Oct 18, 2016, 7:48 PM Pradeep Garigipati [email protected]
|
let dims = Dim4::new(&[1, v0.len() as u64, 1, 1]);
// FNV32-1a defined starting hash value
let hash: [u32;1] = [2166136261];
// Not sure but I think I need to fill Array of the same shape for these two to compute with
let mut hashes = Array::new(&hash, dims);
From what I've read, I thought I could not pass vectors in and only Arrays/slices? dims is defined like any other example, I am referencing the length of a vector to fit in the length of each vectors
Good catch :) I actually meant to populate the entire array with that value, forgot to update it. Bit odd that it was happy to take 625 length slices everywhere else and not throw an error despite the 1 element here? I'll fix this and get back to you. Can you advise if it's a bad idea to generate the permutations of string keys on the host side, should I be handling this in ArrayFire instead somehow? 36/4 for example is 1.7 million bytes roughly.I'm wanting to target around 36^10+.. |
@polarathene My bad, what i meant was that the dims seems to indicate that the data is of vector shape (not a vector type of rust) - I have corrected earlier message to reflect this. The data pointed by the slices is being forwarded to C-API as a pointer. So if you say the dims are of so and so length but actually pass in a slice that has single element, It is going to cause a segmentation fault. If you intend to fill an Array with same value and that has shape specified by The common thumb rule is to keep the data on GPU as much as possible to reduce the cost of data transfers between the host and the GPU. So, if you can generate the permutations on the GPU, that will eliminate the cost of initial transfer of permutations to the GPU memory. ArrayFire doesn't have any native functions that can permutate data as of now. You may be able to implement it on top of ArrayFire, although i can't tell you how that may be done because i am not aware of the algorithms required to do so. |
@9prady9 Is the approach with an array for each character in the string correct? or should I be doing that all in one array somehow? The string data is just bytes/numbers, I should be able to iterate through those values with some loops assuming I can tell AF to do so with a Rust loop? How to handle the hash comparison? After calculations I'd want to compare all the calculated values against another collection/array of hashes to look for matches, should I have a 2 dimensional array filled with constants?, eg 625 elements would be 625 constant filled hash value * number of hashes, then iterate through those to get boolean(0/1) results. Last problem was how to return that data from AF to Rust. Regardless of if I can filter it out, AF expects me to provide a fixed size array for it to store the data in? In this example do I need to manually create 625 elements to make the compiler happy? If there are good examples you know of already in the wild using Rust that show how these problems are tackled I'd be happy to look through them. The example algorithm is very simple without AF, I'd love to know how to properly use it with AF at scale. |
Having a single character for one array is probably not an efficient way to do it. I would think of a way to handle bunch of chars in one single array. I didn't understand your confusion about loops. ArrayFire runs computation on GPUs which work on data in parallel, there by avoiding the need to iterate in loops. May be if you show a small code that illustrates what you are trying to explain, i can understand better. Rust wrapper has functions such as eq that lets you compare data in two different Arrays. I am kind of confused as to what you mean by ArrayFire expects a fixed size array ? Because that is not the case unless i unable to understand what you are trying to explain. Sure, we are working on adding more examples. |
TL;DR: First code example in Rust, how to convert this simple algorithm to work in ArrayFire with 32 billion calculations(this is limit of the algorithm I'm actually trying to port, Bob Jenkins Jenkins hash aka
I might not have been clear with what I'm trying to communicate. I'm exhausting the keyspace of a given charset(in the example code posted on reddit link,
I understand that, but it's not clear to me how to translate code that uses a loop/iterations to ArrayFire for GPU parallelization.
I did earlier, here it is again: // FNV-1a (32-bit) hashing algorithm
fn main() {
let key = [0x68, 0x65, 0x6c, 0x6c, 0x6f]; // "hello"
let mut hash: u32 = 2166136261; // Hash init value defined by algorithm
for byte in key.iter() {
hash ^= *byte as u32;
hash *= 16777619; // Another magic number defined by the algorithm
}
println!("Hash is: {:x}", hash); // 4f9f2cab
} That's it, it's very simple. It gets more complicated when I want to process much more than a single string with AF, as well as comparing if the calculated hash matches any of the ones I am comparing against(not shown). Both tasks are easy to do out of ArrayFire, I would like an example that shows how to scale this code and benefit from ArrayFire on the GPU instead of writing OpenCL kernel myself. All this code does is take string input as bytes, iterate through it with a fixed starting value
Yes I used this in the Reddit example...I will share it on this issue as well. We discussed the memory error might have been due to extern crate arrayfire as af;
use af::*;
extern crate permutate;
use permutate::Permutator;
fn main() {
// Defining the keyspace, keys of 5 characters long, with an unfortunately small charset?
// Perhaps I should skip the crate and find a way to generate these through ArrayFire?
let alphabet = ["a","b","c","d"];//,"e","f","g","h"];
let lists = [
&alphabet[..],
&alphabet[..],
&alphabet[..],
&alphabet[..],
&alphabet[..],
];
let permutator = Permutator::new(&lists[..]);
let mut v0: Vec<u8> = Vec::new();
let mut v1: Vec<u8> = Vec::new();
let mut v2: Vec<u8> = Vec::new();
let mut v3: Vec<u8> = Vec::new();
let mut v4: Vec<u8> = Vec::new();
for permutation in permutator {
let x = permutation.concat();
// Not sure if I'm structuring the data out right here, or if the next two lines are the right way to approach it?
let mut y: [u8;5] = [0,0,0,0,0];
y.copy_from_slice(x.as_bytes());
v0.push(y[0]);
v1.push(y[1]);
v2.push(y[2]);
v3.push(y[3]);
v4.push(y[4]);
}
// The way I understand it, for each letter in a string to iterate I should have separate Arrays to calculate in parallel?
let dims = Dim4::new(&[1, v0.len() as u64, 1, 1]);
let af0 = Array::new(&v0.as_slice(), dims);
let af1 = Array::new(&v1.as_slice(), dims);
let af2 = Array::new(&v2.as_slice(), dims);
let af3 = Array::new(&v3.as_slice(), dims);
let af4 = Array::new(&v4.as_slice(), dims);
let key = vec![af0,af1,af2,af3,af4];
// FNV32-1a defined starting hash value
let hash: [u32;1] = [2166136261];
// Not sure but I think I need to fill Array of the same shape for these two to compute with
let mut hashes = Array::new(&hash, dims);
let hashmul = constant(16777619, dims));
// The expected hash for "hello", note won't match due to current charset
// Could fill with `constant()`, I intend to use a large array of hashes to compare calculations to
// Not quite sure how to approach that(iterate through 2D array of constants()?)
let hash_list = Array::new(&[0x4f9f2cab], dims);
// Returns an array of of 0's & 1's based on matches.
let result = eq(&hashes, &hash_list, false);
// Fixed array required to store the results into? Do I have to define 625 individual values?
let mut r_data: [u32;625] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
result.host::<u32>(&mut r_data);
println!("Hash is: {:?}", r_data);
}
AF does not need fixed size array to do it's work, it's great with slices. My problem is with the host method on AF array(see bottom of code example, |
Question that can be included in doc(not sure if rust api docs) FAQ. For developers wanting to adopt ArrayFire but not familiar with OpenCL/CUDA or programming with parallelism in mind. When processing large datasets where you could process millions/billions(a large enough workload to keep the hardware busy at full) at a time, if the algorithm permitted working on 8 individual u8 values(bytes) vs 1 single u64 value, would it make a difference? I know that on a smaller scale dataset, it would allow for more parallelism, but if that's not an issue due to large size perhaps there is no advantage or it would perform better on u64 values. As the end of the dataset is reached, at some point switching to individual bytes might keep the speed/performance up where it would otherwise decrease? I think I read somewhere about my GPU having effectively 32 threads? Example from existing OpenCL algorithmI am looking at porting over an OpenCL algorithm I wrote, to use ArrayFire. The original C algorithm was for the CPU and processed each byte individually, I noticed that it worked in chunks of 3 u64 values at a time effectively and altered the algorithm assuming that was an optimization. Each string is processed at 24 bytes at at time with the remaining <24 bytes handled a little differently. Here are some snippets of calculations that are done: // The function that does this calculation has 4 repetitions but the integers at the end change
// each line and repetition depends on the prior calculations I don't think this can be improved
abc.x ^= (abc.x -= abc.y, abc.x -= abc.z, abc.z >> 43);
abc.y ^= (abc.y -= abc.x, abc.y -= abc.z, abc.x << 9);
abc.z ^= (abc.z -= abc.x, abc.z -= abc.y, abc.y >> 8);
// This is the same without the vector usage and a slightly different order, far as I know they're effectively the same a/b/c are equivalent to x/y/z above
a -= b; a -= c; a ^= c >> 43;
b -= a; b -= c; b ^= a << 9;
c -= a; c -= b; c ^= b >> 8;
// Another way to express it that looks more readable, my notes say it did not perform as well, perhaps ArrayFire would allow this readable version but still optimize to perform better than my poor understanding of coding for opencl and performance
__constant u32 BIT_SHIFT[12] = {
43, 9, 8,
38, 23, 5,
35, 49, 11,
12, 18, 22
};
for (int i = 0; i < 12; i +=3)
{
a = (a-(b + c)) ^ (c >> BIT_SHIFT[i ]);
b = (b-(a + c)) ^ (a << BIT_SHIFT[i + 1]);
c = (c-(a + b)) ^ (b >> BIT_SHIFT[i + 2]);
}; abc vector or non vector a/b/c values were created fro 24 bytes like this: // The program(hashcat) that I wrote this for broke string data into 32-bit values(4 letter chunks), with ArrayFire I have a bit more control and could use either u8 values with correct offset(bitshift) or u64 values without combining two 32-bit like below, no idea if it matters for parallelism if the the amount of strings to process is already high, so u8 values may not benefit vs u64?
// w[] is an array of 32-bit values, W64 is a vector of 3 64-bit values
W64.x = w[offset ] + (w[offset + 1] << 32);
W64.y = w[offset + 2] + (w[offset + 3] << 32);
W64.z = w[offset + 4] + (w[offset + 5] << 32); Final <24 bytes was handled with conditonal: // Last <24 bytes were assigned like above then added, original was a much larger switch statement with fallthrough and lots of bitshifting
abc.z += pw_len; // The first byte encodes the length of the original string
abc.z += (len > 16) ? (W64.z << 8) : 0;
abc.y += (len > 8) ? W64.y : 0;
abc.x += (len > 1) ? W64.x : 0; I think my OpenCL port and improvements should port over well to ArrayFire looking quite similar, x/yz vector I'm not sure what that would look like. Input would be a rows(strings) with columns of bytes(the string a row represents). The computation on those bytes(x/y/z or a/b/c in snippets) would be a seperate array I guess where 3 columns for each component might make sense, only the lazy one z/c is required as the final result. |
I am sorry it has been quite some time since last comment was made here. Here are some updates w.r.t clarifications requested.
That is because
Created a separate issue for ^ #233
No, none of jit operations involve memory until their time of evaluation comes forth.
C++ syntax is uniquely in a position to give an outlook that appears like individual elements are being indexed where as in the background even a single element indexed is still an
Yes, CUDA program's CUDA kernels can be profiled using nvprof, nsigh-compute etc. Unfortunately, the situation is not that great for OpenCL as old AMD GPUs are not supported in CodeXL. Traditional tools can be used to profile CPU, to accurately know each function's timing, one can run the program with environment variable
There has been recent addition to rust documentation of arrayfire wrapper - tutorial book. It is not comprehensive but we keep updating it with new info. The indexing tutorial does mention about performance issues associated with individual access of elements as a note.
As you have found out, the reduction functions without The reason the returned scalar is is double is different and is related to overflow issues with certain operations and types. To convert that result to boolean on rust side, we can always do Apart from the book and other updates to docs, we have also created a separate repo that has information of API comparisons from different libraries w.r.t ArrayFire, API call comparison across wrappers of ArrayFire etc. |
@9prady9 awesome, thanks for taking the time to respond :) I don't know when I would have the time, but I'd still like to port the CPU rust code I have for FNV1a-32 hashing function with the permutation generator to ArrayFire with a good way to check the GPU computations for matches("hello" input bytes to "4f9f2cab" output matching equality to the same "4f9f2cab" from a provided variable or array value prior to computation to search for original string(s)). Example to port (Rust Playground): // Purpose, given the target hash value 0x4f9f2cab, find an input that creates it.
fn main() {
const TARGET_KEY: u32 = 0x4f9f2cab;
// Some loop logic, that iterates through permutations/inputs, until a
// hash result matches the one a value is being searched for
let not_matched = hash_algorithm(vec![0xca, 0xfe, 0xba, 0xbe]);
let matched = hash_algorithm(permutator());
assert_ne!(not_matched, TARGET_KEY); // No match keep going
assert_eq!(matched, TARGET_KEY); // A match to the target is found, stop
// At this point, the input bytes or "hello" string would need to be
// matched to the computed hash, as we're interesting in knowing that
// "hello"(or it's bytes) is the underlying value for the hash: 0x4f9f2cab
// If batch processing an array of inputs to compute hashes, then an index
// should be sufficient.
// Inform the user about the matched result:
println!("Matched hash: {:x} to value: {:?}", matched, b"hello");
}
// Generates permutations for a keyspace and length
// eg charset "abc" len 3 -> aaa, aab, aac, aba, abb, abc, acc, and so forth
// Actual permutator logic not below, example just returning "hello" as a permutation
fn permutator() -> Vec<u8> {
// ['h' , 'e' , 'l' , 'l' , 'o' ]
// [0x68, 0x65, 0x6c, 0x6c, 0x6f]
// [104 , 101 , 108 , 108 , 111 ]
let key: Vec<u8> = "hello".bytes().collect();
key
}
// FNV-1a (32-bit) hashing algorithm
const FNV_OFFSET: u32 = 2166136261;
const FNV_PRIME: u32 = 16777619;
fn hash_algorithm(key: Vec<u8>) -> u32 {
// Initialize hash with seed value
let mut hash = FNV_OFFSET;
// Iterate through bytes folding into a u32 value (hash)
for byte in key.iter() {
hash ^= *byte as u32;
hash = hash.wrapping_mul(FNV_PRIME);
}
hash
} Ignoring the permutator, the Should serve as a good example of where bottlenecks can be hit if not approached right when porting for GPU compute, and how to handle such situations? If I recall correctly, the GPU could compute through a large set of inputs(or generating the inputs within the GPU instead of transferring them) far faster than the CPU, it was just a problem when you need to check for a match and if one is found stop processing and identify the input that created that target output value. |
@polarathene The FNV-1a algorithm is classic for loop algorithm where each iteration's input is dependent on the previous iteration's output unless there is a parallelized version that I am not unaware of. I doubt FNV-1a can be parallelized. |
I take it you're referring to parallelization on a single input, instead of a larger array of inputs to each process iteratively as the algorithm works. let inputs = [
"aaa",
"aab",
"aac",
"aba",
"abb",
"abc",
// ...
] When I opened this issue years ago, I had several GB of inputs in GPU memory that I used ArrayFire to generate for a pass, it would then run each input through the algorithm individually(in parallel to one another), and then I could send the resulting hashes back to the CPU and check for any hashes that match the one I am searching for to know what value created that hash. I could then generate the next set of inputs to do another pass through the hashing algorithm of choice and compare until I found a match(or all matches if searching for several hashes). ArrayFire is capable of this, the slowness was the post-processing part for me. I can only recall performance suffering from having to do a device transfer of results back for the CPU to check. My ArrayFire program handled compute on the GPU well, just not conditional logic if I remember right. |
That is true, I was referring to parallelization of the algorithm w.r.t single input stream of bytes. Sure, hashing on a batch of inputs is trivially parallel, not doubt about that. However, at the moment there is no efficient way to do fnv-1a hash on even one input using ArrayFire. The reason I say that is accessing individual elements from GPU memory using indexing is going to be very inefficient, and it would have to individual element access because of data dependency across iterations. Having said that, a custom kernel is going to be different approach all together and can most likely be more efficient. If you are requesting hashing support via a function in upstream - ArrayFire, then the request has to be moved an issue on upstream repository. I am curious as to how you managed to do FNV-1a on single input byte stream using ArrayFire efficiently. Is there a chance you have your old code archived somewhere ? |
If I remember right, I believe I did something like take my charset ("abc" in this case) and created the permutations of the charset against the length(keyspace) where each new byte/index is a column, and the first column, then I could run the loop iterating against the columns. I should have the old code somewhere I'll try to locate it for you.
No feature request afaik, I just wanted to know how to best handle this type of computation where I am needing to keep processing/iterating through permutations until a result being searched for is found, and be able to identify that input. The hashing algorithm itself shouldn't matter. If I can find the code it should make more sense :) |
@9prady9 My old code was in a bit of a messy state and I had some trouble getting it to run, but I've pieced together this example that seems to work now, it's only processing on 3 inputs instead of GB of permutations being generated on the GPU, but should show one of the ways I approached the hashing algorithm with ArrayFire: use arrayfire as af;
fn main() {
// af::set_backend(Backend::CUDA);//Backend::OPENCL;//Backend::DEFAULT);
af::init();
af::info();
/////////////////////////////////
// Generate string inputs to hash
/////////////////////////////////
// Inputs are collapsed into a sequence/stream of bytes, single array dimension
let test_strings = vec!["hello", "howdy", "hallo"];
let test_bytes = test_strings.iter()
.flat_map(|s| s.as_bytes())
.cloned()
.collect::<Vec<u8>>();
// println!("test_strings as byte array: {:?}", test_bytes);
// [104, 101, 108, 108, 111, 104, 111, 119, 100, 121, 104, 97, 108, 108, 111]
///////////////////////////
// Initialize values for AF
///////////////////////////
// Used to size AF array dimensions
let input_len = test_strings[0].len() as u64;
let num_inputs = test_strings.len() as u64;
// Convert to an ArrayFire 2D array (Column Major)
// 5x3 col(length of bytes for input) x row(number of inputs)
let inputs_dims = af::Dim4::new(&[input_len, num_inputs, 1, 1]); // [5, 3, 1, 1]
let inputs = af::Array::new(&test_bytes, inputs_dims);
// Each input now has it's bytes in it's own column
// print(&inputs);
// [5 3 1 1]
// 104 104 104
// 101 111 97
// 108 119 108
// 108 100 108
// 111 121 111
let fnv_dims = af::Dim4::new(&[1, num_inputs, 1, 1]); // [1, 3, 1, 1]
////////////////////////////
// Compute hashes for inputs
////////////////////////////
let hashes = fnv1a32_gpu(&inputs, fnv_dims);
// print(&hashes);
// [1 3 1 1]
// 1335831723 3497709110 4182196071
// println!("hashes: {:x} | {:x} | {:x}", 1_335_831_723_u32, 3_497_709_110_u32, 4_182_196_071_u32);
// 4f9f2cab | d07ace36 | f9473f67
///////////////////////
// Find matching hashes
///////////////////////
// Creates an AF array filled with the same target value to match for,
let target_hashes: af::Array<u32> = af::constant(0x4f9f2cab as u32, fnv_dims);
// let target_hashes: af::Array<u32> = af::Array::new(&[0xf9473f67, 0x4f9f2cab], fnv_dims);
let matches = check_for_matches(hashes, target_hashes);
// println!("matched: {:?}", match_data);
// matched: [1335831723]
for matched in matches {
println!("Matched the hash: {:x}", &matched);
};
// Matched the hash: 4f9f2cab
}
// FNV-1a (32-bit) hashing algorithm
const FNV_OFFSET: u32 = 2_166_136_261;
const FNV_PRIME: u32 = 16_777_619;
pub fn fnv1a32_gpu(inputs: &af::Array<u8>, fnv_dims: af::Dim4) -> af::Array<u32> {
let input_len = inputs.dims().get()[0];
let mut hashes = af::constant(FNV_OFFSET, fnv_dims);
let prime = af::constant(FNV_PRIME, fnv_dims);
// Iterates through all inputs(columns) in parallel a byte each at a time(row)
for row_index in 0..input_len {
hashes = (hashes ^ af::row(&inputs, row_index)) * ′
}
hashes
}
fn filter_matches(array: &af::Array<u32>, bools: &af::Array<bool>) -> af::Array<u32> {
let indices = &af::locate(bools);
let mut idxr = af::Indexer::default();
idxr.set_index(indices, 0, None);
af::index_gen(array, idxr)
}
fn check_for_matches(hashes: af::Array<u32>, target_hashes: af::Array<u32>) -> Vec<u32> {
// If we computed the same hash value, then `eq()` will let us know which values matched
let is_matched: af::Array<bool> = af::eq(&hashes, &target_hashes, false);
// Only keep results that were matched
let result = filter_matches(&hashes, &is_matched);
// Transfer the matches to the host CPU to access
let length = result.elements() as usize;
let mut match_data: Vec<u32> = vec![0; length];
result.host::<u32>(&mut match_data);
match_data
} As the data input to process is finite and not going to take long, there's no logic for continuing/stopping the computation if a match is found. I can try to get my GPU permutator code working again and share that if helpful, I think for the long duration processing, I wanted to create a streaming iterator and was blocked by GAT(Generic Associated Types) which is still not implemented in Rust. Perhaps the above could still be refined into a useful example for ArrayFire? I could provide a similar non-ArrayFire version if you see value in this as an example of how to think/approach writing for the GPU with ArrayFire? |
Thank you for digging up the old code. As I said earlier, you are doing individual accesses here in the below section of code // Iterates through all inputs(columns) in parallel a byte each at a time(row)
for row_index in 0..input_len {
hashes = (hashes ^ af::row(&inputs, row_index)) * ′
} Sure, all these operations are async operations but access to bytes of single column of input is being accessed individually in each iteration of the loop which is what I said earlier is going to be inefficient. As in, it can done in a better way with a kernel written especially for this kind of hashing algorithm to run on a batch of inputs. |
Oh? I had thought that it was taking the whole row and computing against that, instead of each being a separate access. I could always rework the data to use columns instead of rows, eg 3x5? Or is that the same problem? Again, this particular part (hash algorithm) wasn't the performance concern I was having trouble with, it was identifying matches which involved transferring data to the host each time to lookup the matches to decide what to do next. In the above snippet, I am only informed there was a match, not what the input value was. Following approach instead retains an index before removing the unmatched elements: fn check_for_matches(hashes: af::Array<u32>, target_hashes: af::Array<u32>) -> Vec<(usize, u32)> {
// If we computed the same hash value, then `eq()` will let us know which values matched
let is_matched: af::Array<bool> = af::eq(&hashes, &target_hashes, false);
// Any unmatched value becomes 0
let result = af::mul(&hashes, &is_matched, false);
// Transfer the matches to the host CPU to access
let length = result.elements() as usize;
let mut match_data: Vec<u32> = vec![0; length];
result.host::<u32>(&mut match_data);
let results: Vec<(usize, u32)> = match_data.into_iter()
.enumerate()
.filter(|&(_,val)| val!=0)
.collect();
// println!("results: {:?}", results);
// results: [(0, 1335831723)]
results
}
// ... main()
let matches = check_for_matches(hashes, target_hashes);
for (index, value) in matches {
println!(
"Matched the hash: {:x}, original input was: {:?}",
value,
test_strings[index]
);
};
// Hash is: 4f9f2cab, input: "hello" According to my notes, multiplying the bool array was faster than the
How does one go about writing these kernels? I take it that's outside of ArrayFire? Is that direct OpenCL/CUDA code? One that would be of interest is AES CBC 256-bit decryption, I guess that's something where a custom kernel would be useful too? |
Well, it is running on GPU except that the memory access is not continuous. Notice the elements of single input are being handled in the host side for loop. Each such value for every input in every iteration is Having said that, I still believe a custom kernel can be written for this algorithm that can be faster than my above suggestion.
|
Is there a example / documentation on approaching that?
I don't have specific time logged unfortunately, just the old comment about it.
The input values would iterate through lengths, so for a given length, all inputs were that fixed length each, typical lengths would be 6-12 if I recall correctly. For the Jenkins hash which is more involved, it was being used on filepaths as inputs, those would be much longer but permutations were often a similar 6-12 length of the filename with the paths only permutating known directories. Actual size of the array with permutations would depend on GPU memory, the number of permutations could be very large and require many hours if not days to brute force, it was similar to a popular software called Hashcat commonly used for password recovery. In my code I was moving permutation generation to the GPU and referred to it as "tiling" the charset, it would fit as many of those tiles to cover the keyspace in GPU memory and operate on in as few passes/iterations of inputs as possible. I have a Nvidia GTX 1070 with 8GB of VRAM, and used 6-8GB I think with GPU computing with high/full usage of the CUDA cores.
I timed most parts individually, there was a macro for it, possibly one for CPU and another that was ArrayFire specific, I would need to look through that code again. While I could probably run the code again with the timing logic added to compare for you, presently I'm only able to use the CPU backend I think. My system is low on disk space and can't install CUDA as a result, unsure about OpenCL. |
I would have to think through it. Top of my head, it would most likely start similar to your OpenCL kernel implementation I believe. But one thing is certain, the data would have to laid out so that GPU global memory accesses from all threads are coalesced from input bytes for different streams are read. That is most obvious part. Then it would probably involve some loop unrolling for processing single input byte stream if we have prior knowledge of minimum number of bytes etc.
I wonder how locate is throttling the operation if single input is so small and actual number of such inputs is huge which are run in parallel as they are independent operations. You should rearranging the data so that access to GPU memory is coalesced as I mentioned here
If possible, please share the timing code. ArrayFire is better timed as an average of multiple runs given that there is device warmup cost and kernel compilation of any functions used on their first call. Also, multiple JIT operations coalesce into single kernel, if you are timing individual statements you are likely doing more work than necessary - ArrayFire JIT operations are better timed a whole functional unit. |
Also let us more move this discussion to slack or perhaps another issue as we moved away from the issues original intent some time back :) which is "code examples". Feel free to raise another issue with relevant details or we can discussion slack DM. |
Probably would make more sense to rename this issue and recreate the "Add code examples" one as a new issue at this point 😅 I've created a new issue regarding where the bulk of this thread ended up focusing on, I'll try find some time to meet your requests for the additional code and data layout change. Do you want me to continue sharing code fences on the issue thread comments, or would a repo be preferred?
Not a slack user, I am fine with Discord, Facebook, email or a forum if you have one? Otherwise the github issue works well for me and might be helpful to someone else (although I guess this is a bit of a niche topic). |
Perhaps :) , I meant to have this issue as placeholder to keep updating examples, both standalone and documentation snippets - not specifically about any one particular one. So, I basically don't expect this issue to be closed anytime soon - at least not until every API has at least one example in documentation. Thanks for moving the discussion to specific issue. GitHub issue works fine for us, just didn't wanted too much of specific discussion in placeholder issue. |
I'd appreciate keeping the history of the discussion here, so if wanting to hide the lengthy discussion, please consider marking as outdated/off-topic rather than deleting :) |
👍 I am not deleting anything, just wanted to move it to a separate issue so that this issue doesn't get too populated with auxiliary/tangential conversation. |
No description provided.
The text was updated successfully, but these errors were encountered: