Skip to content

Commit

Permalink
Merge branch 'develop' for MeTA v2.1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
smassung committed Feb 13, 2016
2 parents 7d7540c + 8de3b89 commit b56ab0f
Show file tree
Hide file tree
Showing 66 changed files with 2,994 additions and 527 deletions.
4 changes: 2 additions & 2 deletions .appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@ install:
- bash -lc "pacman --noconfirm --needed -Sy bash pacman pacman-mirrors msys2-runtime msys2-runtime-devel"
# we don't actually need ada, fortran, libgfortran, or objc, but in
# order to update gcc we need to also update those packages as well...
- bash -lc "pacman --noconfirm -S mingw-w64-x86_64-{gcc,gcc-ada,gcc-fortran,gcc-libgfortran,gcc-objc,cmake,make,icu,jemalloc}"
- bash -lc "pacman --noconfirm -S mingw-w64-x86_64-{gcc,gcc-ada,gcc-fortran,gcc-libgfortran,gcc-objc,cmake,make,icu,jemalloc,zlib}"
before_build:
- cd C:\projects\meta
- git submodule update --init --recursive
- bash -lc "export PATH=/mingw64/bin:$PATH && cd $APPVEYOR_BUILD_FOLDER && mkdir build && cd build && cmake .. -G \"MSYS Makefiles\""
build_script:
- bash -lc "export PATH=/mingw64/bin:$PATH && cd $APPVEYOR_BUILD_FOLDER/build && make"
test_script:
- bash -lc "export PATH=/mingw64/bin:$PATH && cd $APPVEYOR_BUILD_FOLDER/build && cp ../config.toml . && ctest --output-on-failure"
- bash -lc "export PATH=/mingw64/bin:$PATH && cd $APPVEYOR_BUILD_FOLDER/build && cp ../config.toml . && ./unit-test --reporter=spec"
4 changes: 2 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -96,5 +96,5 @@ before_script:

script:
- git submodule update --init --recursive
- ../travis/cmake.sh Debug && make && make clean
- rm -rf CMake* && ../travis/cmake.sh Release && make && ctest --output-on-failure
- ../travis/cmake.sh Debug && make -j2 && make clean
- rm -rf CMake* && ../travis/cmake.sh Release && make -j2 && ./unit-test --reporter=spec
40 changes: 39 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,40 @@
# [v2.1.0][2.1.0]
## New features
- Add the [GloVe algorithm](http://www-nlp.stanford.edu/pubs/glove.pdf) for
training word embeddings and a library class `word_embeddings` for loading and
querying trained embeddings. To facilitate returning word embeddings, a simple
`util::array_view` class was added.
- Add simple vector math library (and move `fastapprox` into the `math`
namespace).

## Bug fixes
- Fix `probe_map::extract()` for `inline_key_value_storage` type; old
implementation forgot to delete all sentinel values before returning the
vector.
- Fix incorrect definition of `l1norm()` in `sgd_model`.
- Fix `gmap` calculation where 0 average precision was ignored
- Fix progress output in `multiway_merge`.

## Enhancements
- Improve performance of `printing::progress`. Before, `progress::operator()` in
tight loops could dramatically hurt performance, particularly due to frequent
calls to `std::chrono::steady_clock::now()`. Now, `progress::operator()`
simply sets an atomic iteration counter and a background thread periodically
wakes to update the progress output.
- Allow full text storage in index as metadata field. If `store-full-text =
true` (default false) in the corpus config, the string metadata field
"content" will be added. This is to simplify the creation of full text
metadata: the user doesn't have to duplicate their dataset in `metadata.dat`,
and `metadata.dat` will still be somewhat human-readable without large strings
of full text added.
- Allow `make_index` to take a user-supplied corpus object.

## Miscellaneous
- ZLIB is now a required dependency.
- Switch to just using the standalone `./unit-test` instead of `ctest`. There
aren't really many advantages for us to using CTest at this point with the new
unit test framework, so just use our unit test executable.

# [v2.0.1][2.0.1]
## Bug fixes
- Fix issue where `metadata_parser` would not consume spaces in string
Expand Down Expand Up @@ -304,7 +341,8 @@
# [v1.0][1.0]
- Initial release.

[unreleased]: https://github.com/meta-toolkit/meta/compare/v2.0.1...develop
[unreleased]: https://github.com/meta-toolkit/meta/compare/v2.1.0...develop
[2.1.0]: https://github.com/meta-toolkit/meta/compare/v2.0.1...v2.1.0
[2.0.1]: https://github.com/meta-toolkit/meta/compare/v2.0.0...v2.0.1
[2.0.0]: https://github.com/meta-toolkit/meta/compare/v1.3.8...v2.0.0
[1.3.8]: https://github.com/meta-toolkit/meta/compare/v1.3.7...v1.3.8
Expand Down
9 changes: 2 additions & 7 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,15 @@ set(CMAKE_CXX_STANDARD 14)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)

include(CTest)
include(CheckCXXCompilerFlag)
include(CheckCXXSourceCompiles)
include(CheckCXXSourceRuns)
include(CMakePushCheckState)
include(ExternalProject)
include(FindZLIB)
include(cmake/FindOrBuildICU.cmake)

find_package(Threads REQUIRED)
find_package(ZLIB REQUIRED)

cmake_push_check_state()

Expand Down Expand Up @@ -118,11 +117,7 @@ if(STDOPT)
target_compile_options(meta-definitions INTERFACE ${STDOPT})
endif()

if(ZLIB_FOUND)
target_include_directories(meta-definitions SYSTEM INTERFACE
${ZLIB_INCLUDE_DIRS})
target_compile_definitions(meta-definitions INTERFACE -DMETA_HAS_ZLIB)
endif()
target_include_directories(meta-definitions SYSTEM INTERFACE ${ZLIB_INCLUDE_DIRS})

if(LIBDL_LIBRARY)
target_link_libraries(meta-definitions INTERFACE ${LIBDL_LIBRARY})
Expand Down
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ make
You can now test the system by running the following command:

```bash
ctest --output-on-failure
./unit-test --reporter=spec
```

If everything passes, congratulations! MeTA seems to be working on your
Expand Down Expand Up @@ -136,7 +136,7 @@ sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update

# this will probably take a while
sudo apt-get install g++ g++-4.8 git make wget libjemalloc-dev
sudo apt-get install g++ g++-4.8 git make wget libjemalloc-dev zlib1g-dev

wget http://www.cmake.org/files/v3.2/cmake-3.2.0-Linux-x86_64.sh
sudo sh cmake-3.2.0-Linux-x86_64.sh --prefix=/usr/local
Expand Down Expand Up @@ -193,7 +193,7 @@ make
You can now test the system by running the following command:

```bash
/usr/local/bin/ctest --output-on-failure
./unit-test --reporter=spec
```

If everything passes, congratulations! MeTA seems to be working on your
Expand All @@ -216,7 +216,7 @@ sudo add-apt-repository ppa:george-edison55/cmake-3.x
sudo apt-get update

# install dependencies
sudo apt-get install cmake libicu-dev git libjemalloc-dev
sudo apt-get install cmake libicu-dev git libjemalloc-dev zlib1g-dev
```

Once the dependencies are all installed, you should double check your
Expand Down Expand Up @@ -269,7 +269,7 @@ make
You can now test the system by running the following command:

```bash
ctest --output-on-failure
./unit-test --reporter=spec
```

If everything passes, congratulations! MeTA seems to be working on your
Expand All @@ -283,7 +283,7 @@ To install the dependencies, run the following commands.

```bash
sudo pacman -Sy
sudo pacman -S clang cmake git icu libc++ make jemalloc
sudo pacman -S clang cmake git icu libc++ make jemalloc zlib
```

Once the dependencies are all installed, you should be ready to build. Run
Expand All @@ -310,7 +310,7 @@ make
You can now test the system by running the following command:

```bash
ctest --output-on-failure
./unit-test --reporter=spec
```

If everything passes, congratulations! MeTA seems to be working on your
Expand Down Expand Up @@ -381,7 +381,7 @@ make
You can now test the system with the following command:

```bash
ctest --output-on-failure
./unit-test --reporter=spec
```

## EWS/EngrIT Build Guide
Expand Down Expand Up @@ -449,7 +449,7 @@ make
You can now test the system by running the following command:

```bash
ctest --output-on-failure
./unit-test --reporter=spec
```

If everything passes, congratulations! MeTA seems to be working on your
Expand All @@ -470,7 +470,7 @@ you should run the following commands to download dependencies and related
software needed for building:

```bash
pacman -Syu git make mingw-w64-x86_64-{gcc,cmake,icu,jemalloc}
pacman -Syu git make mingw-w64-x86_64-{gcc,cmake,icu,jemalloc,zlib}
```

Then, exit the shell and launch the "MinGW-w64 Win64" shell. You can obtain
Expand All @@ -497,7 +497,7 @@ make
You can now test the system by running the following command:

```bash
ctest --output-on-failure
./unit-test --reporter=spec
```

If everything passes, congratulations! MeTA seems to be working on your
Expand Down
4 changes: 1 addition & 3 deletions include/meta/corpus/all.h
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
#include "meta/corpus/corpus.h"
#if META_HAS_ZLIB
#include "meta/corpus/gz_corpus.h"
#endif
#include "meta/corpus/file_corpus.h"
#include "meta/corpus/gz_corpus.h"
#include "meta/corpus/libsvm_corpus.h"
#include "meta/corpus/line_corpus.h"
26 changes: 25 additions & 1 deletion include/meta/corpus/corpus.h
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,17 @@ namespace corpus
* The corpus spec toml file also requires a corpus type and an optional
* encoding for the corpus text.
*
* Optional config parameters: none.
* Required config parameters:
* ~~~toml
* type = "line-corpus" # for example
* ~~~
*
* Optional config parameters:
* ~~~toml
* encoding = "utf-8" # default value
* store-full-text = false # default value; N/A for libsvm-corpus
* metadata = # metadata schema; see metadata object
* ~~~
*
* @see https://meta-toolkit.org/overview-tutorial.html
*/
Expand Down Expand Up @@ -80,6 +90,18 @@ class corpus
*/
const std::string& encoding() const;

/**
* @return whether this corpus will create a metadata field for full text
* (called "content")
*/
bool store_full_text() const;

/**
* @param store_full_text Tells this corpus to store full document text as
* metadata
*/
void set_store_full_text(bool store_full_text);

protected:
/**
* Helper function to be used by deriving classes in implementing
Expand All @@ -96,6 +118,8 @@ class corpus
std::string encoding_;
/// The metadata parser
util::optional<metadata_parser> mdata_parser_;
/// Whether to store the original document text
bool store_full_text_;
};

/**
Expand Down
101 changes: 101 additions & 0 deletions include/meta/embeddings/coocur_iterator.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
/**
* @file coocur_iterator.h
* @author Chase Geigle
*
* All files in META are dual-licensed under the MIT and NCSA licenses. For more
* details, consult the file LICENSE.mit and LICENSE.ncsa in the root of the
* project.
*/

#ifndef META_EMBEDDINGS_COOCUR_ITERATOR_H_
#define META_EMBEDDINGS_COOCUR_ITERATOR_H_

#include <fstream>

#include "meta/embeddings/coocur_record.h"
#include "meta/io/filesystem.h"
#include "meta/util/shim.h"

namespace meta
{
namespace embeddings
{

/**
* An iterator over coocur_record's that live in a packed file on disk.
* Satisfies the ChunkIterator concept for multiway_merge support.
*/
class coocur_iterator
{
public:
using value_type = coocur_record;

coocur_iterator(const std::string& filename)
: path_{filename},
input_{make_unique<std::ifstream>(filename, std::ios::binary)},
total_bytes_{filesystem::file_size(filename)},
bytes_read_{0}
{
++(*this);
}

coocur_iterator() = default;
coocur_iterator(coocur_iterator&&) = default;

coocur_iterator& operator++()
{
if (input_->peek() == EOF)
return *this;

bytes_read_ += record_.read(*input_);
return *this;
}

coocur_record& operator*()
{
return record_;
}

const coocur_record& operator*() const
{
return record_;
}

bool operator==(const coocur_iterator& other) const
{
if (!other.input_)
{
return !input_ || !static_cast<bool>(*input_);
}
else
{
return std::tie(path_, bytes_read_)
== std::tie(other.path_, other.bytes_read_);
}
}

uint64_t total_bytes() const
{
return total_bytes_;
}

uint64_t bytes_read() const
{
return bytes_read_;
}

private:
std::string path_;
std::unique_ptr<std::ifstream> input_;
coocur_record record_;
uint64_t total_bytes_;
uint64_t bytes_read_;
};

bool operator!=(const coocur_iterator& a, const coocur_iterator& b)
{
return !(a == b);
}
}
}
#endif
Loading

0 comments on commit b56ab0f

Please sign in to comment.