Welcome to HPyCC’s documentation!

hpycc Readme

The hpycc package is intended to simplify the use of data stored on HPCC and make it easily available to both users and other servers through basic Python calls. Its long-term goal is to make access to and manipulation of HPCC data as quick and easy as any other type system.

Docker Announcement

HPCC seem to have pulled the images that this is based on. I’ve created one for the test environment I developed against but if you want ot dev against a different one you will need to build the relevent container. I’ve forked the old HPCC docker repo in case you need it or it gets removed: https://github.com/datamacgyver/docker-hpcc

Note that in the docker files I’ve tried the wget for HPCC is broken. You can get the new ones from the HPCC website but there’s a modified one in this repoo I used for the build.

Documentation

The below readme and package documentation is available at https://hpycc.readthedocs.io/en/latest/

The package’s github is available at: https://github.com/OdinProAgrica/hpycc

This package is released under GNU GPLv3 Licence: https://www.gnu.org/licenses/gpl-3.0.en.html

Want to use this in R? Then the reticulate package is your friend! Just save as a CSV and read back in. That or you can use an R notebook with a Python chunk.

Installation

Install with:

pip install hpycc

Or, if you are still a bit old school:

python -m pip install hpycc

Current Status

Tested and working on HPCC v6.4.2 and python 3.5.2 under windows 10. Has been used on Linux systems but not extensively tested.

Dependencies

The package itself mainly uses core Python, Pandas is needed for outputting dataframes.

There is a dependency for client tools to run ECL scripts (you need ecl.exe and eclcc.exe). Make sure you install the right client tools for your HPCC version and add the dir to your system path, e.g. C:\Program Files (x86)\HPCCSystems\X.X.X\clienttools\bin.

Tests and docker container functions require docker to spin up HPCC environments.

Main Functions

Below summarises the key functions and non-optional parameters. For specific arguments see the relevant function’s documentation. Note that while retrieving a file is a multi-thread process, running a script and getting the results is not. Therefore if your file is quite big you may be better off saving the results of a script using run.run_script() with a thor file output then downloading the file with get.get_thor_file().

connection(username, server=”localhost”, port=8010, repo=None, password=”password”, legacy=False, test_conn=True)

Create a connection to a new HPCC instance. This is then passed to any interface functions.

get_output(connection, script, …) & save_output(connection, script, path, …)

Run a given ECL script and either return the first result as a pandas dataframe or save it to file.

get_outputs(connection, script, …)

Run a given ECL script and return all results as a dict of pandas dataframes or save them to files.

get_thor_file(connection, logical_file, path, …) & save_thor_file(connection, logical_file, path, …)

Get a logical file and either return as a pandas dataframe or save it to file.

run_script(connection, script, …)

Run a given ECL script. 10 rows will be returned but they will be dumped, no output is given.

spray_file(connection, source_file, logical_file, …)

Spray a csv or pandas DataFrame into HPCC.

docker_tools.HPCCContainer(tag=”6.4.26-1”, …)

Designed for our testing but made available generally, a collection of functions for running and managing HPCC docker containers is also available. The above function starts a container, see help file for shutting down and other management tasks.

Examples

The below code gives an example of functionality:

import hpycc
import pandas as pd
from hpycc.utils import docker_tools
from os import remove

# Start an HPCC docker image for testing
docker_tools.HPCCContainer(tag="6.4.26-1")

# Setup stuff
username = 'HPCC_dev'
test_file = 'test.csv'
f_hpcc_1 = '~temp::testfile1'
f_hpcc_2 = '~temp::testfile2'
ecl_script = 'ecl_script.ecl'

# Let's create a connection object so we can interface with HPCC.
# up with Docker
conn = hpycc.Connection(username, server="localhost")
try:
    # So, let's spray up some data:
    pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': ['a', 'b', 'c', 'd']}).to_csv(test_file, index=False)
    hpycc.spray_file(conn, test_file, f_hpcc_1, expire=7)

    # Lovely, we can now extract that as a Thor file:
    df = hpycc.get_thor_file(conn, f_hpcc_1)
    print(df)
    # Note __fileposition__ column. This will be drop-able in future versions.

    #################################
    #   col1 col2  \__fileposition__#
    # 0    1    a                 0 #
    # 1    3    c                20 #
    # 2    2    b                10 #
    # 3    4    d                30 #
    #################################

    # If preferred data can also be extracted using an ECL script.
    with open(ecl_script, 'w') as f:
        f.writelines("DATASET('%s', {STRING col1; STRING col2;}, THOR);" % f_hpcc_1)
        # Note, all columns are currently string-ified by default
    df = hpycc.get_output(conn, ecl_script)
    print(df)

    ################
    #   col1 col2  #
    # 0    1    a  #
    # 1    3    c  #
    # 2    2    b  #
    # 3    4    d  #
    ############## #


    # get_thor_file() is optimised for large files, get_output is not (yet). To run a script and
    # download a large result you should therefore save a thor file and grab that.

    with open(ecl_script, 'w') as f:
        f.writelines("a := DATASET('%s', {STRING col1; STRING col2;}, THOR);"
                     "OUTPUT(a, , '%s');" % (f_hpcc_1, f_hpcc_2))
    hpycc.run_script(conn, ecl_script)
    df = hpycc.get_thor_file(conn, f_hpcc_2)
    print(df)

    #################################
    #   col1 col2  \__fileposition__#
    # 0    1    a                 0 #
    # 1    3    c                20 #
    # 2    2    b                10 #
    # 3    4    d                30 #
    #################################

finally:
    # Shutdown our docker container
    docker_tools.HPCCContainer(pull=False, start=False).stop_container()
    remove(ecl_script)
    remove(test_file)

Issues, Bugs, Comments?

Please use the package’s github: https://github.com/OdinProAgrica/hpycc

Any contributions are also welcome.

hpycc package

Subpackages

hpycc.utils package

Submodules
hpycc.utils.docker_tools module

Functions to create and control HPCC docker images. Requires Docker to be installed and running!!!!!!!

class hpycc.utils.docker_tools.HPCCContainer(tag='latest', name='hpycc_test_img', users=None, pull=True, start=True)[source]

Bases: object

create_passwords()[source]
pull_image()[source]
put_archive(b, name, path)[source]
setup_hpcc()[source]
start_container()[source]
start_hpcc()[source]
stop_container()[source]
hpycc.utils.filechunker module

functions that chunk an iterable.

Functions
  • make_chunks – Return tuples of start index and chunk size.
hpycc.utils.filechunker.make_chunks(num, chunk_size=10000)[source]

Return tuples of start index and chunk size.

Parameters:
  • num (int) – Total number of items.
  • chunk_size (int, optional) – Max chunk size, 10,000 by default.
Returns:

chs – List of chunks in the form [(start_index, num_items)]

Return type:

list of tuples

hpycc.utils.parsers module
hpycc.utils.parsers.apply_custom_dtypes(schema, dtypes)[source]
hpycc.utils.parsers.get_python_type_from_ecl_type(child)[source]

Get the python type from an hpcc schema node

Parameters:child (XML node) – Node of schema xml. See parse_schema_from_xml
Returns:type – Pythonic type. If the HPCC type cannot be mapped, is str.
Return type:type
hpycc.utils.parsers.parse_schema_from_xml(xml)[source]

Parse an ECL schema into python types.

Parameters:xml (str) – xml string returned by ecl run. This is located in the json as [“WUResultResponse][“Result”][“XmlSchema”][“xml”].
Returns:
  • OrderedDict – dict of column stats, in the form {name: Str, type: Str, is_a_set: Bool}.
  • list – Column names in order of occurrence.
hpycc.utils.parsers.parse_wuid_from_failed_response(result)[source]
hpycc.utils.parsers.parse_wuid_from_xml(result)[source]

Function retrieves a WUID for a script that has run. This retrieves it only in the cases where the request response was in XML format.

Parameters:result ('XML') – The XML response for the script that has run.
Returns:wuid – The Workunit ID from the XML.
Return type:str
hpycc.utils.parsers.parse_xml(xml)[source]

Return a DataFrame from a nested XML.

Parameters:xml (str) – xml to be parsed.
Returns:df – Parsed xml.
Return type:pd.DataFrame
Module contents

Submodules

hpycc.connection module

Object for connecting to a HPCC instance.

This module provides a Connection class to connect to a HPCC instance. This connection is used as the first input to the majority of public functions in the hpycc package.

Classes

  • Connection – HPCC connection class.
class hpycc.connection.Connection(username, server='localhost', port=8010, repo=None, password='password', legacy=False, test_conn=True)[source]

Bases: object

check_syntax(script)[source]

Run an ECL syntax check on an ECL script.

Uses eclcc to run a syntax check on script. If the syntax check fails, ie. an error is present, a SyntaxError will be raised. Note that this requires that eclcc.exe is on the path. Attributes legacy and repo are also used.

Parameters:script (str) – path to ECL script.
Returns:
Return type:None
Raises:SyntaxError – If the script fails the syntax check.
get_chunk_from_hpcc(logical_file, start_row, n_rows, max_attempts, max_sleep)[source]

Using the HPCC instance at server:port and the credentials username and password, return the JSON response to a request for a part of a logical_file. Starting at start row and n_rows long.

Parameters:
  • logical_file (str) – Name of logical file.
  • start_row (int) – First row to return where 0 is the first row of the dataset.
  • n_rows (int) – Number of rows to return.
  • max_attempts (int) – Maximum number of times url should be queried in the case of an exception being raised.
  • max_sleep (int) – Maximum time, in seconds, to sleep between attempts. The true sleep time is a random int between max_sleep and max_sleep * 0.75.
Returns:

resp – JSON formatted response containing rows and all associated metadata.

Return type:

json

get_logical_file_chunk(logical_file, start_row, n_rows, max_attempts, max_sleep)[source]

Return a chunk of a logical file from an HPCC instance.

Using the HPCC instance at server:port and the credentials username and password, return a chunk of logical_file which starts at row start_row and is n_rows long.

Parameters:
  • logical_file (str) – Name of logical file.
  • start_row (int) – First row to return where 0 is the first row of the dataset.
  • n_rows (int) – Number of rows to return.
  • max_attempts (int) – Maximum number of times url should be queried in the case of an exception being raised.
  • max_sleep (int) – Maximum time, in seconds, to sleep between attempts. The true sleep time is a random int between max_sleep and max_sleep * 0.75.
Returns:

result_response – Rows of logical file as list of dicts. In the form [{“col1”: 1, “col2”: 2}, {“col1”: 1, “col2”: 2}, …].

Return type:

pd.DataFrame, str

run_ecl_script(script, syntax_check, delete_workunit, stored)[source]

Run an ECL script and return the stdout and stderr.

Run the ECL script script on the HPCC instance at server:port, using the credentials username and password. If syntax_check, run a syntax check before execution. Attributes legacy and repo are also used.

Parameters:
  • script (str) – path to ECL script.
  • syntax_check (bool) – If a syntax check should be ran before the script is executed.
  • delete_workunit (bool) – Delete workunit once completed.
  • stored (dict or None) – Key value pairs to replace stored variables within the script. Values should be str, int or bool.
Returns:

result – NamedTuple in the form (stdout, stderr).

Return type:

namedtuple

Raises:

subprocess.CalledProcessError: – If script fails syntax check.

See also

syntax_check(), run_ecl_string()

run_ecl_string(string, syntax_check, delete_workunit, stored)[source]

Run an ECL string and return the stdout and stderr.

Run the ECL string string on the HPCC instance at server:port, using the credentials username and password. If syntax_check, run a syntax check before execution. Attributes legacy and repo are also used.

Parameters:
  • string (str) – ECL script as a string.
  • syntax_check (bool) – If a syntax check should be ran before the script is executed.
  • delete_workunit (bool) – Delete workunit once completed.
  • stored (dict or None) – Key value pairs to replace stored variables within the script. Values should be str, int or bool.
Returns:

result – NamedTuple in the form (stdout, stderr).

Return type:

namedtuple

Raises:

SyntaxError: – If script fails syntax check.

See also

syntax_check(), run_ecl_script()

run_url_request(url, max_attempts, max_sleep)[source]

Return the contents of a url.

Use attributes username and password to return the contents of url. Parameter max_attempts can be used to retry if an exception is raised. Each attempt is delayed by up to max_sleep seconds, so a large number of retries may be slow.

Parameters:
  • url (str) – URL to query.
  • max_attempts (int) – Maximum number of times url should be queried in the case of an exception being raised.
  • max_sleep (int) – Maximum time, in seconds, to sleep between attempts. The true sleep time is a random int between max_sleep and max_sleep * 0.75.
Returns:

r – Response object from url

Return type:

requests.models.Response

Raises:

requests.exceptions.RetryError: – If max_attempts is exceeded.

test_connection()[source]

Assert that the Connection can connect to the HPCC instance.

This method attempts to connect to ECL Watch using its server and port attributes. The credentials provided are its username and password.

Returns:
Return type:True
Raises:Exception: – If the connection fails, the relevant exception is raised.

hpycc.delete module

Functions to delete things in HPCC. The first input to all functions is an instance of Connection.

functions

  • delete_logical_file – delete given logical file
  • delete_workunit – delete given workunit (based on WUID)
hpycc.delete.delete_logical_file(connection, logical_file, delete_workunit=True)[source]

Delete a logical file.

Parameters:
  • connection (Connection) – HPCC Connection instance, see also Connection.
  • logical_file (str) – Logical file to be downloaded.
  • delete_workunit (bool, optional) – Delete workunit once completed. True by default.
Returns:

Return type:

None

hpycc.delete.delete_workunit(connection, wuid, max_attempts=3, max_sleep=15)[source]

Delete a workunit

Parameters:
  • connection (Connection) – HPCC Connection instance, see also Connection.
  • wuid (string) – Workunit ID
  • max_attempts (int, optional) – Maximum number of times url should be queried in the case of an exception being raised. 3 by default.
  • max_sleep (int, optional) – Maximum time, in seconds, to sleep between attempts. The true sleep time is a random int between max_sleep and max_sleep * 0.75. 5 by default.
Returns:

If the workunit is deleted successfully.

Return type:

True

Raises:

ValueError: – If the workunit could not be deleted.

hpycc.get module

Functions to get data out of a HPCC instance.

This module contains functions to get either the output(s) of an ECL script, or the contents of a logical file. The first input to all functions is an instance of Connection.

Functions

  • get_output – Return the first output of an ECL script.
  • get_outputs – Return all outputs of an ECL script.
  • get_thor_file – Return the contents of a thor file.
hpycc.get.get_output(connection, script, syntax_check=True, delete_workunit=True, stored=None)[source]

Return the first output of an ECL script as a pandas.DataFrame.

Note that whilst attempts are made to preserve the datatypes of the result, anything with an ambiguous type will revert to a string. If the output of the ECL string is an empty dataset (or if the script does not output anything), an empty pandas.DataFrame is returned.

Parameters:
  • connection (hpycc.Connection) – HPCC Connection instance, see also Connection.
  • script (str) – Path of script to execute.
  • syntax_check (bool, optional) – Should the script be syntax checked before execution? True by default.
  • delete_workunit (bool, optional) – Delete workunit once completed. True by default.
  • stored (dict or None, optional) – Key value pairs to replace stored variables within the script. Values should be str, int or bool. None by default.
Returns:

Return type:

pandas.DataFrame of the first output of script.

Raises:

SyntaxError: – If script fails syntax check.

See also

get_outputs(), save_output(), Connection.syntax_check()

Examples

>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write("OUTPUT(2);")
>>> hpycc.get_output(conn, "example.ecl")
    Result_1
0          2
>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
>>>     file.write("OUTPUT(2);OUTPUT(3);")
>>> hpycc.get_output(conn, "example.ecl")
    Result_1
0          2
>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write(
...     "a:= DATASET([{'1', 'a'}],"
...     "{STRING col1; STRING col2});",
...     "OUTPUT(a);")
>>> hpycc.get_output(conn, "example.ecl")
   col1 col2
0     1    a
>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write(
...     "a:= DATASET([{'a', 'a'}],"
...     "{STRING col1;});",
...     "OUTPUT(a(col1 != a));")
>>> hpycc.get_output(conn, "example.ecl")
Empty DataFrame
Columns: []
Index: []
hpycc.get.get_outputs(connection, script, syntax_check=True, delete_workunit=True, stored=None)[source]

Return all outputs of an ECL script.

Note that whilst attempts are made to preserve the datatypes of the result, anything with an ambiguous type will revert to a string.

Parameters:
  • connection (hpycc.Connection) – HPCC Connection instance, see also Connection.
  • script (str) – Path of script to execute.
  • syntax_check (bool, optional) – Should the script be syntax checked before execution? True by default.
  • delete_workunit (bool,) – Delete the workunit once completed. True by default.
  • stored (dict or None, optional) – Key value pairs to replace stored variables within the script. Values should be str, int or bool. None by default.
Returns:

as_dict – Outputs of script in the form {output_name: pandas.DataFrame}

Return type:

dict of pandas.DataFrames

Raises:

SyntaxError: – If script fails syntax check.

See also

get_output(), save_outputs(), Connection.syntax_check()

Examples

>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write("OUTPUT(2);")
>>> hpycc.get_outputs(conn, "example.ecl")
{Result_1:
    Result_1
0          2
}
>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write(
...     "a:= DATASET([{'1', 'a'}],"
...     "{STRING col1; STRING col2});",
...     "OUTPUT(a);")
>>> hpycc.get_outputs(conn, "example.ecl")
{Result_1:
   col1 col2
0     1    a
}
>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write(
...     "a:= DATASET([{'1', 'a'}],"
...     "{STRING col1; STRING col2});",
...     "OUTPUT(a);"
...     "OUTPUT(a);")
>>> hpycc.get_outputs(conn, "example.ecl")
{Result_1:
   col1 col2
0     1    a,
Result_2:
   col1 col2
0     1    a
}
>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write(
...     "a:= DATASET([{'1', 'a'}],"
...     "{STRING col1; STRING col2});",
...     "OUTPUT(a);"
...     "OUTPUT(a, NAMED('ds_2'));")
>>> hpycc.get_outputs(conn, "example.ecl")
{Result_1:
   col1 col2
0     1    a,
ds_2:
   col1 col2
0     1    a
}
hpycc.get.get_thor_file(connection, thor_file, max_workers=10, chunk_size='auto', max_attempts=3, max_sleep=60, dtype=None)[source]

Return a thor file as a pandas.DataFrame.

Note: Ordering of the resulting DataFrame is not deterministic and may not be the same as on the HPCC cluster.

Parameters:
  • connection (hpycc.Connection) – HPCC Connection instance, see also Connection.
  • thor_file (str) – Name of thor file to be downloaded.
  • max_workers (int, optional) – Number of concurrent threads to use when downloading file. Warning: too many may cause instability! 10 by default.
  • chunk_size (int, optional) – Size of chunks to use when downloading file. If auto this is rows / workers (bounded between 100,000 and 400,000). If give then no limits are enforced.
  • max_attempts (int, optional) – Maximum number of times a chunk should attempt to be downloaded in the case of an exception being raised. 3 by default.
  • max_sleep (int, optional) – Minimum time, in seconds, to sleep between attempts. The true sleep time is a random int between max_sleep and max_sleep * 0.75.
  • dtype (type name or dict of col -> type, optional) – Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}. If converters are specified, they will be applied INSTEAD of dtype conversion. If None, or columns are missing from the provided dict, they will be converted to one of bool, str or int based on the HPCC datatype. None by default.
Returns:

df – Thor file as a pandas.DataFrame.

Return type:

pandas.DataFrame

See also

save_thor_file()

Examples

>>> import hpycc
>>> import pandas
>>> conn = hpycc.Connection("user")
>>> df = pandas.DataFrame({"col1": [1, 2, 3]})
>>> df.to_csv("example.csv", index=False)
>>> hpycc.spray_file(conn,"example.csv","example")
>>> hpycc.get_thor_file(conn, "example")
    col1
0     1
1     2
2     3
>>> import hpycc
>>> import pandas
>>> conn = hpycc.Connection("user")
>>> df = pandas.DataFrame({"col1": [1, 2, 3]})
>>> df.to_csv("example.csv", index=False)
>>> hpycc.spray_file(conn,"example.csv","example")
>>> hpycc.get_thor_file(conn, "example", dtype=str)
    col1
0     '1'
1     '2'
2     '3'

hpycc.run module

Function to run an ECL script

This module provides a function, run_script, to run an ECL script using an existing Connection. This can be used to run a script, saving a logical file which can then be accessing with get_thor_file(). The advantage of giving the download task to get_thor_file() is that it is able to multi-thread, something which functions in get_output, get_outputs, save_output and save_outputs cannot do.

Functions

  • run_script – Run an ECL script.
hpycc.run.run_script(connection, script, syntax_check=True, delete_workunit=True, stored=None)[source]

Run an ECL script.

This function runs an ECL script using a Connection object. It does not return the result.

Parameters:
  • connection (hpycc.Connection) – HPCC Connection instance, see also Connection.
  • script (str) – Path of script to execute.
  • syntax_check (bool, optional) – Should the script be syntax checked before execution? True by default.
  • delete_workunit (bool, optional) – Delete workunit once completed. True by default.
  • stored (dict or None, optional) – Key value pairs to replace stored variables within the script. Values should be str, int or bool. None by default.
Returns:

Return type:

True

Raises:

SyntaxError: – If script fails syntax check.

hpycc.save module

TEMPORARILY DEPRICATED! Just use get and save teh result. Trust us, it’s cleaner

Functions to get data out of an HPCC instance and save them to disk.

This modules functions closely mirror those in get. In fact all they really do is wrap get’s functions around csv writing tasks. The first input to all functions is an instance of Connection.

Functions

  • save_output – Save the first output of an ECL script.
  • save_outputs – Save all outputs of an ECL script.
  • save_thor_file – Save the contents of a thor file.
hpycc.save.save_output(connection, script, path_or_buf=None, syntax_check=True, delete_workunit=True, stored=None, **kwargs)[source]

Save the first output of an ECL script as a csv. See save_outputs() for saving multiple outputs to file and get_output() for returning as a DataFrame.

Parameters:
  • connection (Connection) – HPCC Connection instance, see also Connection.
  • script (str) – Path of script to execute.
  • path_or_buf (string or file handle, default None) – File path or object, if None is provided the result is returned as a string.
  • syntax_check (bool, optional) – Should script be syntax checked before execution. True by default.
  • delete_workunit (bool, optional) – Delete workunit once completed. True by default.
  • stored (dict or None, optional) – Key value pairs to replace stored variables within the script. Values should be str, int or bool. None by default.
  • kwargs – Additional parameters to be provided to pandas.DataFrame.to_csv().
Returns:

if path_or_buf is not None, else a string representation of the output csv.

Return type:

None or str

hpycc.save.save_thor_file(connection, thor_file, path_or_buf=None, max_workers=15, chunk_size='auto', max_attempts=3, max_sleep=60, dtype=None, **kwargs)[source]

Save a logical file to disk, see get_thor_file() for returning a DataFrame.

Parameters:
  • connection (Connection) – HPCC Connection instance, see also Connection.
  • thor_file (str) – Logical file to be downloaded
  • path_or_buf (string or file handle, default None) – File path or object, if None is provided the result is returned as a string.
  • max_workers (int, optional) – Number of concurrent threads to use when downloading. Warning: too many will likely cause either your machine or your cluster to crash! 15 by default.
  • chunk_size (int, optional.) – Size of chunks to use when downloading file. 10000 by default.
  • max_attempts (int, optional) – Maximum number of times a chunk should attempt to be downloaded in the case of an exception being raised. 3 by default.
  • max_sleep (int, optional) – Maximum time, in seconds, to sleep between attempts. The true sleep time is a random int between max_sleep and max_sleep * 0.75.
  • dtype (type name or dict of col -> type, optional) – Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}. If converters are specified, they will be applied INSTEAD of dtype conversion. If None, or columns are missing from the provided dict, they will be converted to one of bool, str or int based on the HPCC datatype. None by default.
  • kwargs – Additional parameters to be provided to pandas.DataFrame.to_csv().
Returns:

if path_or_buf is not None, else a string representation of the output csv.

Return type:

None or str

hpycc.spray module

The module contains functions to send files to HPCC.

Functions

  • spray_file – Spray a given csv or pandas DataFrame to HPCC.
hpycc.spray.spray_file(connection, source_file, logical_file, overwrite=False, expire=None, chunk_size=100000, max_workers=5, delete_workunit=True)[source]

Spray a file to a HPCC logical file, bypassing the landing zone.

Parameters:
  • connection (Connection) – HPCC Connection instance, see also Connection.
  • source_file (str, pd.DataFrame) – A pandas DataFrame or the path to a csv.
  • logical_file (str) – Logical file name on THOR.
  • overwrite (bool, optional) – Should the file overwrite any pre-existing logical file. False by default.
  • chunk_size (int, optional) – Size of chunks to use when spraying file. 100000 by default.
  • max_workers (int, optional) – Number of concurrent threads to use when spraying. Warning: too many will likely cause either your machine or your cluster to crash! 3 by default.
  • expire (int) – How long (days) until the produced logical file expires? None (ie no expiry) by default
  • delete_workunit (bool) – Delete workunit once completed.
Returns:

Return type:

None

Indices and tables