Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate from fileinfo.com list of file extensions #112

Open
rpdelaney opened this issue Dec 3, 2018 · 24 comments
Open

Integrate from fileinfo.com list of file extensions #112

rpdelaney opened this issue Dec 3, 2018 · 24 comments

Comments

@rpdelaney
Copy link
Collaborator

rpdelaney commented Dec 3, 2018

Should be some kind of metadata, but maybe not in the existing metadata group

Edit: This was originally about .vcf and .vcard extensions but it sprawled into something much more ambitious.

@trapd00r
Copy link
Owner

I've added a few filetypes in my personal branch. About 2500 of them, so I didn't want to destroy the work you've done with sorting files in the master branch, and sorting all of these would be way too tedious. Just a heads up. :)

@rpdelaney
Copy link
Collaborator Author

Showing 1 changed file with 3,038 additions and 651 deletions.

Holy ... what? Where did you get all this?

@trapd00r
Copy link
Owner

I scraped data from the wikipedia article on file extensions. Wanted to send you a pm about it but there's no such feature yet on github it seems. :)

@rpdelaney
Copy link
Collaborator Author

Hahaha. Holy crap. You don't get any performance issues or anything with that? When you type env does your terminal explode?

It might be possible to organize these (relatively) rapidly using a script that prints Wikipedia's description of a filetype and gives you buttons to hit for which category to put it in. But that's still 3000 bloody mouse clicks.

In #109 I planned to make a CONTRIBUTING.md. Maybe we could stick them in there and invite people to add them to the appropriate category, but then we might as well just hyperlink to the Wikipedia article you got them from. I dunno what's best.

@trapd00r
Copy link
Owner

env 0.00s user 0.00s system 51% cpu 0.007 total
I haven't noticed any performance drops what so ever. :)

Yeah, I don't know the best way to approach this either. If the wiki description for each extension read something like audio/mp3, image/jpeg or whatever, it would be possible to do this programmaticly...

However, I've made a somewhat clean dump of the extensions and descriptions if anyone's up for sorting all of these out somehow: https://github.com/trapd00r/LS_COLORS/blob/japh/wiki_fileext.txt

@trapd00r
Copy link
Owner

I figured using libmagic would work wonders (file uses it: file foo.*). The issue, though, is that it guesses the filetype based on the first few bytes of a file, so you can't just touch all of these 3k file.extensions since they'll be empty. You'll have to actually create the files in question.

Here's the database that libmagic uses: https://github.com/threatstack/libmagic/tree/master/magic/Magdir

@trapd00r
Copy link
Owner

ftftft

Give me a few days...

@rpdelaney
Copy link
Collaborator Author

Why days? I could cross-reference that super fast in sqlite. If this is a lot of work for you, stand back! I got this.

Still not sure we want to do this though. env is going to fill up my terminal buffer...

@trapd00r
Copy link
Owner

Sure, go ahead. I added everything in a dictionary: https://gist.github.com/trapd00r/554f03450ed114fee191e794c87b0215

I am not sure either, but there's no performance issues so why not, really. :D

@rpdelaney
Copy link
Collaborator Author

rpdelaney commented Dec 14, 2018

Great, that will be super easy to parse.

Some of these are kind of giving me lulz though. '9.PNG' => "NinePatchDrawable Image", really? But I'll probably only include those that I can cross-reference with libmagic.

I am not sure either, but there's no performance issues so why not, really. :D

I use direnv and environment variables for various purposes so I often do env | grep -i foo. I'm not going to enjoy all the extra accidental collisions with LS_COLORS, especially since each false match will scroll everything off my terminal buffer. Might just have to write a wrapper of some kind that extracts what I need with some explicit exclusion of LS_COLORS so I don't ever accidentally hit it. Edit: Now that I think about it I bet there is something that could handle this for me. I'll look around.

Anyway, the point is my use case is probably not the normal one, so if performance is really that much of a non-issue then there's little reason not to include these if we can automate the categorization.

@rpdelaney
Copy link
Collaborator Author

rpdelaney commented Dec 14, 2018

Also, would it be a goal to automate the scraping / categorization? That seems horrendously over-engineered but people will be updating the list on Wikipedia ...

edit: a script to build the LS_COLORS out of some kind of database (dunno what format yet, probably simple json would do it) could be useful regardless. That would enable us to do things like have names/labels for the colors themselves and then associate extensions with the named labels, etc.

@rpdelaney rpdelaney changed the title Set colors for .vcf and .vcard files Integrate from Wikipedia's list of file extensions Dec 14, 2018
@trapd00r
Copy link
Owner

Yeah, forgot to tell you but the extensions in my dict above is scraped from fileinfo.com - their descriptions were a lot better (and also more extensions). And yeah, some of them are pretty bonkers...

I'm all for automation, I'll tinker more with this tomorrow after a good nights sleep...

@trapd00r
Copy link
Owner

trapd00r commented Dec 14, 2018

We could cheat and scrape from their already defined categories but not sure if every filetype is categorized. Maybe it's good enough anyway.

Edit: If you're going to scrape anything, do note that only 500 results are showed by default - you'll have to scroll down and click view full list

@trapd00r trapd00r changed the title Integrate from Wikipedia's list of file extensions Integrate from fileinfo.com list of file extensions Dec 14, 2018
@trapd00r
Copy link
Owner

trapd00r commented Dec 15, 2018

https://github.com/trapd00r/LS_COLORS/tree/motherofgod/bin/scrape_fileinfo

  • auto-scrape from fileinfo.com
  • every file extension categorized
  • every file extension commented
  • a valid LS_COLORS file generated on STDOUT, including folding markers for vim :)

@trapd00r
Copy link
Owner

Btw. This wasn't an issue with the entries I scraped from wikipedia (only +2500), but this is over 11k entries and, welp, we run into the 120KiB limit per env var.

MAX_ARG_STRLEN is a constant defined as PAGESIZE*32 in /path/to/linux/headers/include/uapi/linux/binfmts.h. Cannot be changed without recompiling the kernel.

It's kind of a big deal because:

git⸢motherofgod」% eval $(dircolors -b ./auto.LS_COLORS)
% env
zsh: argument list too long: env
% perl -ehi
zsh: argument list too long: perl
% date
zsh: argument list too long: date

@rpdelaney
Copy link
Collaborator Author

Heh. Maybe some kind of shell extension could delegate highlighting to a subprocess. Zsh might be able to do that with a plugin, but I use bash. Regardless, even if it were workable to do that delegation, we'd actually really have to worry about performance now. Some directories have thousands of files in them. And only really weird people are going to want to install extensions like that just so that they can have a special color for LogonStudio Windows Vista Logon Screen. Speaking for myself, I may be weird, but I'm not likely to be among those weird people.[1]

What I'm saying is, I think we need two things:

  1. Figure out what the upper limit actually is, in concrete terms, for how many file types we can support in a cross-platform way (read: without doing anything radical like what I described above).
  2. Figure out a way to reduce these into a list of types that constitute the low hanging fruit, within that limit.

[1]: Speaking of which, most of these file types have no significance in a *nix environment, which is where 99% of users of LS_COLORS will be.

@trapd00r
Copy link
Owner

Given that all extensions use the ecma-48 spec notation and each extension have 5 chars we could do roughly 9k (13 chars per entry). And agree, a curated list would work better, however then this whole automation thing falls short.

@trapd00r
Copy link
Owner

I might be able to trim the list quite effectively, I happened to write a thing while playing around with this...

https://github.com/trapd00r/File-Extension

@rpdelaney
Copy link
Collaborator Author

That's cool. Let me look into this and get back to you.

Btw, are we concerned about who holds the copyright for the descriptions of the file types at fileinfo.com? I haven't looked into that at all.

@rpdelaney
Copy link
Collaborator Author

Kind of small potatoes but if you have imagemagick installed, identify -list format is a pretty handy list of graphics extensions with descriptions.

@pagerc
Copy link

pagerc commented Dec 13, 2019

Why days? I could cross-reference that super fast in sqlite. If this is a lot of work for you, stand back! I got this.

Still not sure we want to do this though. env is going to fill up my terminal buffer...

For folks concerned with blowing up their environment, try something like this:
Use dircolors to set the environment variable, but strip the export and eval the output.
Then set an alias that expands the current environment's LS_COLORS value. This should be in your rc file and not your profile so that its executed on every interactive shell invocation.

type dircolors >/dev/null 2>&1 && {
    eval `{ dircolors -b ${XDG_CONFIG_HOME}/sh/dir_colors 2>/dev/null || dircolors -p | dircolors -b ; } | sed '$d'`
}
alias ls="LS_COLORS=\"\${LS_COLORS:-${LS_COLORS}}\" ls ${COLOR_OPTS} -h --time-style=long-iso"

@trapd00r
Copy link
Owner

On my way to Zanzibar right now but I stumbled upon this on hackernews:
http://fileformats.archiveteam.org/wiki/Category:File_formats_by_extension

Pretty comprehensive and with a lot of information on each type.

@nogweii
Copy link

nogweii commented Mar 17, 2024

Would it make sense to compile this file list into a YAML file, ala vivid's config? This could be done as part of #195 .

@rpdelaney
Copy link
Collaborator Author

  • These are two big pieces of work with a lot of risk of breakage. They should be done separately, to make the work easier to perform and easier to roll back in the worst case.
  • The migration to vivid should be done first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants