-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
golang python #815
Comments
Hello @caijielong, With Python, you have the option to utilize the heavyai package for connecting, running queries, and more on HeavyDB. This package adheres to the Python DB API 2.0 standards. To get started, you can follow the instructions for creating an environment and installing the packages available at this link: For a quick start and a comprehensive introduction to the API, including Release Notes and API Documentation, please refer to: For additional examples, you can explore our documentation site, specifically in sections related to Python and Data Science: Currently, we do not have a Golang-specific API, but you can explore using GDBC with our JDBC driver as an alternative option. |
@cdessanti Thank you for your careful and detailed answer |
@cdessanti Hello, I have a question. I now have 50 billion data that I want to import into heavyai to test its performance. I have tried csv, StreamImporter, KafkaImporter and insert into. The results are not satisfactory and the import speed is a bit slow. Is there any way to speed up my import speed? |
The performance of data loading depends on the data's location and format. Generally, using the COPY command with local Parquet files as the source is the fastest option. Loading data using Arrow is also quite efficient, and it's an excellent choice when you need to query data during the load process. Data doesn't necessarily have to be in Parquet format, but if your ETL process involves Pandas, you can create Parquet files on-the-fly.
In the first case, so usign the copy mode, we convert a Pandas DataFrame into an Arrow Table, create a Parquet file from Arrow on a local filesystem, and then use the server's COPY command to load it into a table. If your data is already in Parquet or CSV format, you can, load them from your S3 STORAGE location, move the files to the server and load them with the COPY command using Immerse, heavysql, another client, or programmatically as shown in the example. The loading performance depends on various factors, including the number, data types, and contents of the columns, as well as the machine you're using for data ingestion. Typically, using local compressed CSV files, to load a regular table with 30 columns containing numerics, timestamps, and low cardinality strings, you can achieve a throughput of around 4 million records per second using 12 cores. This throughput doesn't change significantly if you're using Parquet. I recommend starting by loading a small but representative subset of the data, optimizing the data types for the tables either manually or using the Our documentation on data loading is accessible through this link: https://docs.heavy.ai/loading-and-exporting-data/command-line/load-data If your data is stored in a database, you can explore the option of using the SQLImporter Java utility. However, please note that the throughput in this case may not be exceptionally high because the utility isn't primarily designed for speed. If you'd like more assistance, please share the Data Definition Language (DDL) statements for your tables and provide additional detailed information about their formats. By the way, processing such a high number of records will require a very powerful hardware setup with ample system and GPU memory to accommodate the data. Regards, |
@cdessanti Hello, the current import solution is flink -> kafka -> heavydb. When I import 1 billion data and execute select * from table_name limit 10;, an error will be reported! Error message: std::bad_alloc. HeavyDB Server CPU Memory Summary: HeavyDB Server GPU Memory Summary: |
Hi @caijielong, The error message is quite generic, and I'm having trouble pinpointing the issue, especially with such a straightforward query and no additional data. Firstly, could you specify where you encounter this generic error? Is it causing a database crash, or is it returned by the heavysql command? If possible, please provide a screenshot of the error message. To assist you better, it would be beneficial if you could share the DDL of your table or provide essential information such as the number of columns, the data types used, and the fragment size (if you've deviated from the default). You can easily obtain this information by running the It's also crucial to know which version of the software you're using. You can find this by running the Additionally, sharing the database logs when the error occurs could provide insights. You can locate the logs in the directory where you store the data (typically /var/lib/heavyai/) under the storage/logs directory. The log files are named like this: heavydb.INFO.20231026-152844.log. I attempted to reproduce the error using OS 7.0, and even when I started the system with very limited cache memory, I received the expected database exception: heavysql> select * from flights_p limit 10;
ERR_OUT_OF_CPU_MEM: Not enough host memory to execute the query
heavysql> \memory_summary
HeavyDB Server CPU Memory Summary:
MAX USE ALLOCATED FREE
2861.02 MB 2838.13 MB 2861.02 MB 22.89 MB
HeavyDB Server GPU Memory Summary:
[GPU] MAX USE ALLOCATED FREE
[0] 10627.50 MB 0.00 MB 0.00 MB 0.00 MB In my case, the system utilized as much memory as was available before throwing the error, suggesting that the SELECT operation aborted before execution in your case. I'm eager to assist further once I have the requested infos, so we can work together to resolve this issue. Regards, |
Hi,@cdessanti 1: vi /var/lib/heavyai/heavy.conf 2: sudo systemctl restart heavydb |
Hi, Many thanks for the detailed information you provided. I tried to reproduce your issue using the same database version, the very same table populated with hundreds of millions of generated data, but everything is working as expected. While apparently, the error is in the heavysql tool; it would be better to investigate if something is going on in the server, so you should add the parameter verbose=true to your heavy.conf file and restart the server. select * from flights_2008_tm limit 10;
select * from tweets limit 10;
After you get the error, get the heavydb.INFO file located in cat /var/lib/heavyai/storage/log. You can get the contents of the file with the Also, to be sure that the heavysql tool is working as expected, please try to run the same query from another client; the SQL editor of Immerse is fine for this test, located at this URL in your machine http://localhost:6273/heavyai/sql-editor (I'm assuming that the software is installed on your workstation). I hope to hear something from you soon. Regards, |
Hi, @cdessanti 1: 3: 4: 5: |
Hi @caijielong, It seems that you've attached the example heavy.INFO file rather than the one from your machine. To resolve the issue, I recommend restarting the server and running the SELECT * FROM tweets query. This will pinpoint the bad-alloc, enabling us to identify the problematic column. Allow me to provide some insights into what happens internally within the server when executing a query like this. This understanding will help you grasp the memory requirements for such operations. In heavydb, tables are locally divided into SLABs, each with a default size of 32 million rows. With your query, only the 1st SLAB is loaded into memory, representing roughly 1/32 of your table data. Approximately 3.7GB of memory is needed for data, in addition to server memory for the dictionaries required to decode the text-encoded values. Considering your average text length of around 115 bytes and multiplying it by 32 million, this accounts for the memory requirement. The tweet_text column likely demands a significant memory chunk in the form of a large dictionary, potentially causing the bad_alloc. (This assumption is based on the usage of long, distinctly different strings. For instance, if 1 billion unique strings average around 100 characters, it would require approximately 100GB of RAM to be loaded into memory.) Addressing point 5: In your case, 100 million records require a maximum of 11.5GB of memory (115 * 100000000). However, running SELECT * FROM table LIMIT 1 won't load 100 million records into memory but only 32 million, along with additional memory for dictionaries in TEXT-encoded fields. For troubleshooting, verbose logs in the /var/lib/heavyai/storage/log/ directory, specifically the file named heavydb.INFO-[DATE]-[TIME].log (e.g., heavydb.INFO.20230828-072844.log), provide valuable information, like memory used by various pool including the dictionaries. Kindly run your queries and share these logs here, as they contain pertinent details necessary to address your problem. Candido |
How can I use programming languages such as golang and python to connect to heavydb
The text was updated successfully, but these errors were encountered: