-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to CLD2 #9
base: master
Are you sure you want to change the base?
Conversation
Repository Root: http://cld2.googlecode.com/svn Repository UUID: b252ecd4-b096-bf77-eb8e-91563289f87e Revision: 161
- Use ExtDetectLanguageSummary - Build shared object - Update tests: - Chinese (Traditional) is now "zh-Hant" - "Unknown" is no longer "reliable", see https://code.google.com/p/cld2/issues/detail?id=1
- Makes the installed gem about 4% smaller. - The shared object is installed to lib.
- Installs the shared object in lib. - A test cycle now takes one Rake command: `rake clean compile spec`
- Defaults to the smaller version of CLD2. - Allow cleanup of temporary files to be disabled.
wow, that is a very large gem! is there any way we can reduce this? 6mb was already too much. |
I found that some of the CLD2 source files are not necessary to build the libraries. The gem is now 17 MiB and installed uses 46 MiB. If we commit to just one of The unavoidable fact is that the source contains large tables of pre-computed n-grams. cld2_generated_quad0122.cc is required to build If CLD2 were to release an archive/tarball, we could ship zero source files and download it before compiling the extension using something like I looked into downloading bare files from the project repository, but we either need to
|
Another option is to ship binary/pre-compiled gems. At first pass, it looks like the smaller gem would be less than 2 MiB and the larger would be less than 5 MiB. I don't have any experience releasing a binary gem. |
Any chance there has been any progress or updates with this? I'd love to help out with this if possible. |
I would also like to contribute. Let's solve this issue asap. This issue p is pending for more than a year just because of size of CLD. |
Here is similar implementation in JavaScript. We can take cues from that: https://github.com/dachev/node-cld |
@jtoy can we reconsider this? The gem did get larger, but so did the source library. I don't think there is a clean way to avoid this and still allow anyone to use the gem. |
any update on this? |
@craig-day can you merge and release this ? |
I'll take a look hopefully tomorrow or Monday morning at the latest.
|
CLD2 project has moved to https://github.com/CLD2Owners/cld2/ |
@cbandy is this still ready to go? I'd like to merge and release a new major version. |
It has been a long time since I looked at this.
I still don't see a tarball; at least not one provided by GitHub tags/releases.
Maybe this is more reasonable now that it hosted in Git? I forget how common it is for Gem installers to have |
This appears to be the revision/commit that I imported in this PR: CLD2Owners/cld2@d076f5e |
@cbandy can you update the readme link and pull in any new changes? I'm not sure if the tarball is a concern right now. I'd rather avoid a git dependency because not all places gems get installed have git (like production servers). |
As far as licensing, I think you can copy the apache license from the CLD2 owners. It looks like our original license was just copied from them anyway. |
@cbandy I don't think this project will be updated, I suggest you to release your code as a new |
See #8.
Before this is merged, we should update our licensing. The library has changed to the Apache license.
The size of the bundled library has grown significantly. The source itself is over 90 MiB. The gem is now 35 MiB (up from 6 MiB) and installed it uses 93 MiB (up from 17 MiB). If CLD2 ever releases a tarball, we can stop bundling it here and shrink the installed size to 2 MiB.
There are two possible CLD2 libraries to link against:
libcld2.so
andlibcld2_full.so
. The latter can detect twice as many languages and is 4 MiB larger. I arbitrarily chose the former, smaller library in this PR. Which would you prefer to be used by default? In either case, we can also make this configurable duringgem install
.