hpycc package

Submodules

hpycc.connection module

Object for connecting to a HPCC instance.

This module provides a Connection class to connect to a HPCC instance. This connection is used as the first input to the majority of public functions in the hpycc package.

Classes

  • Connection – HPCC connection class.
class hpycc.connection.Connection(username, server='localhost', port=8010, repo=None, password='password', legacy=False, test_conn=True)[source]

Bases: object

check_syntax(script)[source]

Run an ECL syntax check on an ECL script.

Uses eclcc to run a syntax check on script. If the syntax check fails, ie. an error is present, a SyntaxError will be raised. Note that this requires that eclcc.exe is on the path. Attributes legacy and repo are also used.

Parameters:script (str) – path to ECL script.
Returns:
Return type:None
Raises:SyntaxError – If the script fails the syntax check.
get_chunk_from_hpcc(logical_file, start_row, n_rows, max_attempts, max_sleep)[source]

Using the HPCC instance at server:port and the credentials username and password, return the JSON response to a request for a part of a logical_file. Starting at start row and n_rows long.

Parameters:
  • logical_file (str) – Name of logical file.
  • start_row (int) – First row to return where 0 is the first row of the dataset.
  • n_rows (int) – Number of rows to return.
  • max_attempts (int) – Maximum number of times url should be queried in the case of an exception being raised.
  • max_sleep (int) – Maximum time, in seconds, to sleep between attempts. The true sleep time is a random int between max_sleep and max_sleep * 0.75.
Returns:

resp – JSON formatted response containing rows and all associated metadata.

Return type:

json

get_logical_file_chunk(logical_file, start_row, n_rows, max_attempts, max_sleep)[source]

Return a chunk of a logical file from an HPCC instance.

Using the HPCC instance at server:port and the credentials username and password, return a chunk of logical_file which starts at row start_row and is n_rows long.

Parameters:
  • logical_file (str) – Name of logical file.
  • start_row (int) – First row to return where 0 is the first row of the dataset.
  • n_rows (int) – Number of rows to return.
  • max_attempts (int) – Maximum number of times url should be queried in the case of an exception being raised.
  • max_sleep (int) – Maximum time, in seconds, to sleep between attempts. The true sleep time is a random int between max_sleep and max_sleep * 0.75.
Returns:

result_response – Rows of logical file as list of dicts. In the form [{“col1”: 1, “col2”: 2}, {“col1”: 1, “col2”: 2}, …].

Return type:

pd.DataFrame, str

run_ecl_script(script, syntax_check, delete_workunit, stored)[source]

Run an ECL script and return the stdout and stderr.

Run the ECL script script on the HPCC instance at server:port, using the credentials username and password. If syntax_check, run a syntax check before execution. Attributes legacy and repo are also used.

Parameters:
  • script (str) – path to ECL script.
  • syntax_check (bool) – If a syntax check should be ran before the script is executed.
  • delete_workunit (bool) – Delete workunit once completed.
  • stored (dict or None) – Key value pairs to replace stored variables within the script. Values should be str, int or bool.
Returns:

result – NamedTuple in the form (stdout, stderr).

Return type:

namedtuple

Raises:

subprocess.CalledProcessError: – If script fails syntax check.

See also

syntax_check(), run_ecl_string()

run_ecl_string(string, syntax_check, delete_workunit, stored)[source]

Run an ECL string and return the stdout and stderr.

Run the ECL string string on the HPCC instance at server:port, using the credentials username and password. If syntax_check, run a syntax check before execution. Attributes legacy and repo are also used.

Parameters:
  • string (str) – ECL script as a string.
  • syntax_check (bool) – If a syntax check should be ran before the script is executed.
  • delete_workunit (bool) – Delete workunit once completed.
  • stored (dict or None) – Key value pairs to replace stored variables within the script. Values should be str, int or bool.
Returns:

result – NamedTuple in the form (stdout, stderr).

Return type:

namedtuple

Raises:

SyntaxError: – If script fails syntax check.

See also

syntax_check(), run_ecl_script()

run_url_request(url, max_attempts, max_sleep)[source]

Return the contents of a url.

Use attributes username and password to return the contents of url. Parameter max_attempts can be used to retry if an exception is raised. Each attempt is delayed by up to max_sleep seconds, so a large number of retries may be slow.

Parameters:
  • url (str) – URL to query.
  • max_attempts (int) – Maximum number of times url should be queried in the case of an exception being raised.
  • max_sleep (int) – Maximum time, in seconds, to sleep between attempts. The true sleep time is a random int between max_sleep and max_sleep * 0.75.
Returns:

r – Response object from url

Return type:

requests.models.Response

Raises:

requests.exceptions.RetryError: – If max_attempts is exceeded.

test_connection()[source]

Assert that the Connection can connect to the HPCC instance.

This method attempts to connect to ECL Watch using its server and port attributes. The credentials provided are its username and password.

Returns:
Return type:True
Raises:Exception: – If the connection fails, the relevant exception is raised.

hpycc.delete module

Functions to delete things in HPCC. The first input to all functions is an instance of Connection.

functions

  • delete_logical_file – delete given logical file
  • delete_workunit – delete given workunit (based on WUID)
hpycc.delete.delete_logical_file(connection, logical_file, delete_workunit=True)[source]

Delete a logical file.

Parameters:
  • connection (Connection) – HPCC Connection instance, see also Connection.
  • logical_file (str) – Logical file to be downloaded.
  • delete_workunit (bool, optional) – Delete workunit once completed. True by default.
Returns:

Return type:

None

hpycc.delete.delete_workunit(connection, wuid, max_attempts=3, max_sleep=15)[source]

Delete a workunit

Parameters:
  • connection (Connection) – HPCC Connection instance, see also Connection.
  • wuid (string) – Workunit ID
  • max_attempts (int, optional) – Maximum number of times url should be queried in the case of an exception being raised. 3 by default.
  • max_sleep (int, optional) – Maximum time, in seconds, to sleep between attempts. The true sleep time is a random int between max_sleep and max_sleep * 0.75. 5 by default.
Returns:

If the workunit is deleted successfully.

Return type:

True

Raises:

ValueError: – If the workunit could not be deleted.

hpycc.get module

Functions to get data out of a HPCC instance.

This module contains functions to get either the output(s) of an ECL script, or the contents of a logical file. The first input to all functions is an instance of Connection.

Functions

  • get_output – Return the first output of an ECL script.
  • get_outputs – Return all outputs of an ECL script.
  • get_thor_file – Return the contents of a thor file.
hpycc.get.get_output(connection, script, syntax_check=True, delete_workunit=True, stored=None)[source]

Return the first output of an ECL script as a pandas.DataFrame.

Note that whilst attempts are made to preserve the datatypes of the result, anything with an ambiguous type will revert to a string. If the output of the ECL string is an empty dataset (or if the script does not output anything), an empty pandas.DataFrame is returned.

Parameters:
  • connection (hpycc.Connection) – HPCC Connection instance, see also Connection.
  • script (str) – Path of script to execute.
  • syntax_check (bool, optional) – Should the script be syntax checked before execution? True by default.
  • delete_workunit (bool, optional) – Delete workunit once completed. True by default.
  • stored (dict or None, optional) – Key value pairs to replace stored variables within the script. Values should be str, int or bool. None by default.
Returns:

Return type:

pandas.DataFrame of the first output of script.

Raises:

SyntaxError: – If script fails syntax check.

See also

get_outputs(), save_output(), Connection.syntax_check()

Examples

>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write("OUTPUT(2);")
>>> hpycc.get_output(conn, "example.ecl")
    Result_1
0          2
>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
>>>     file.write("OUTPUT(2);OUTPUT(3);")
>>> hpycc.get_output(conn, "example.ecl")
    Result_1
0          2
>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write(
...     "a:= DATASET([{'1', 'a'}],"
...     "{STRING col1; STRING col2});",
...     "OUTPUT(a);")
>>> hpycc.get_output(conn, "example.ecl")
   col1 col2
0     1    a
>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write(
...     "a:= DATASET([{'a', 'a'}],"
...     "{STRING col1;});",
...     "OUTPUT(a(col1 != a));")
>>> hpycc.get_output(conn, "example.ecl")
Empty DataFrame
Columns: []
Index: []
hpycc.get.get_outputs(connection, script, syntax_check=True, delete_workunit=True, stored=None)[source]

Return all outputs of an ECL script.

Note that whilst attempts are made to preserve the datatypes of the result, anything with an ambiguous type will revert to a string.

Parameters:
  • connection (hpycc.Connection) – HPCC Connection instance, see also Connection.
  • script (str) – Path of script to execute.
  • syntax_check (bool, optional) – Should the script be syntax checked before execution? True by default.
  • delete_workunit (bool,) – Delete the workunit once completed. True by default.
  • stored (dict or None, optional) – Key value pairs to replace stored variables within the script. Values should be str, int or bool. None by default.
Returns:

as_dict – Outputs of script in the form {output_name: pandas.DataFrame}

Return type:

dict of pandas.DataFrames

Raises:

SyntaxError: – If script fails syntax check.

See also

get_output(), save_outputs(), Connection.syntax_check()

Examples

>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write("OUTPUT(2);")
>>> hpycc.get_outputs(conn, "example.ecl")
{Result_1:
    Result_1
0          2
}
>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write(
...     "a:= DATASET([{'1', 'a'}],"
...     "{STRING col1; STRING col2});",
...     "OUTPUT(a);")
>>> hpycc.get_outputs(conn, "example.ecl")
{Result_1:
   col1 col2
0     1    a
}
>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write(
...     "a:= DATASET([{'1', 'a'}],"
...     "{STRING col1; STRING col2});",
...     "OUTPUT(a);"
...     "OUTPUT(a);")
>>> hpycc.get_outputs(conn, "example.ecl")
{Result_1:
   col1 col2
0     1    a,
Result_2:
   col1 col2
0     1    a
}
>>> import hpycc
>>> conn = hpycc.Connection("user")
>>> with open("example.ecl", "r+") as file:
...     file.write(
...     "a:= DATASET([{'1', 'a'}],"
...     "{STRING col1; STRING col2});",
...     "OUTPUT(a);"
...     "OUTPUT(a, NAMED('ds_2'));")
>>> hpycc.get_outputs(conn, "example.ecl")
{Result_1:
   col1 col2
0     1    a,
ds_2:
   col1 col2
0     1    a
}
hpycc.get.get_thor_file(connection, thor_file, max_workers=10, chunk_size='auto', max_attempts=3, max_sleep=60, dtype=None)[source]

Return a thor file as a pandas.DataFrame.

Note: Ordering of the resulting DataFrame is not deterministic and may not be the same as on the HPCC cluster.

Parameters:
  • connection (hpycc.Connection) – HPCC Connection instance, see also Connection.
  • thor_file (str) – Name of thor file to be downloaded.
  • max_workers (int, optional) – Number of concurrent threads to use when downloading file. Warning: too many may cause instability! 10 by default.
  • chunk_size (int, optional) – Size of chunks to use when downloading file. If auto this is rows / workers (bounded between 100,000 and 400,000). If give then no limits are enforced.
  • max_attempts (int, optional) – Maximum number of times a chunk should attempt to be downloaded in the case of an exception being raised. 3 by default.
  • max_sleep (int, optional) – Minimum time, in seconds, to sleep between attempts. The true sleep time is a random int between max_sleep and max_sleep * 0.75.
  • dtype (type name or dict of col -> type, optional) – Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}. If converters are specified, they will be applied INSTEAD of dtype conversion. If None, or columns are missing from the provided dict, they will be converted to one of bool, str or int based on the HPCC datatype. None by default.
Returns:

df – Thor file as a pandas.DataFrame.

Return type:

pandas.DataFrame

See also

save_thor_file()

Examples

>>> import hpycc
>>> import pandas
>>> conn = hpycc.Connection("user")
>>> df = pandas.DataFrame({"col1": [1, 2, 3]})
>>> df.to_csv("example.csv", index=False)
>>> hpycc.spray_file(conn,"example.csv","example")
>>> hpycc.get_thor_file(conn, "example")
    col1
0     1
1     2
2     3
>>> import hpycc
>>> import pandas
>>> conn = hpycc.Connection("user")
>>> df = pandas.DataFrame({"col1": [1, 2, 3]})
>>> df.to_csv("example.csv", index=False)
>>> hpycc.spray_file(conn,"example.csv","example")
>>> hpycc.get_thor_file(conn, "example", dtype=str)
    col1
0     '1'
1     '2'
2     '3'

hpycc.run module

Function to run an ECL script

This module provides a function, run_script, to run an ECL script using an existing Connection. This can be used to run a script, saving a logical file which can then be accessing with get_thor_file(). The advantage of giving the download task to get_thor_file() is that it is able to multi-thread, something which functions in get_output, get_outputs, save_output and save_outputs cannot do.

Functions

  • run_script – Run an ECL script.
hpycc.run.run_script(connection, script, syntax_check=True, delete_workunit=True, stored=None)[source]

Run an ECL script.

This function runs an ECL script using a Connection object. It does not return the result.

Parameters:
  • connection (hpycc.Connection) – HPCC Connection instance, see also Connection.
  • script (str) – Path of script to execute.
  • syntax_check (bool, optional) – Should the script be syntax checked before execution? True by default.
  • delete_workunit (bool, optional) – Delete workunit once completed. True by default.
  • stored (dict or None, optional) – Key value pairs to replace stored variables within the script. Values should be str, int or bool. None by default.
Returns:

Return type:

True

Raises:

SyntaxError: – If script fails syntax check.

hpycc.save module

TEMPORARILY DEPRICATED! Just use get and save teh result. Trust us, it’s cleaner

Functions to get data out of an HPCC instance and save them to disk.

This modules functions closely mirror those in get. In fact all they really do is wrap get’s functions around csv writing tasks. The first input to all functions is an instance of Connection.

Functions

  • save_output – Save the first output of an ECL script.
  • save_outputs – Save all outputs of an ECL script.
  • save_thor_file – Save the contents of a thor file.
hpycc.save.save_output(connection, script, path_or_buf=None, syntax_check=True, delete_workunit=True, stored=None, **kwargs)[source]

Save the first output of an ECL script as a csv. See save_outputs() for saving multiple outputs to file and get_output() for returning as a DataFrame.

Parameters:
  • connection (Connection) – HPCC Connection instance, see also Connection.
  • script (str) – Path of script to execute.
  • path_or_buf (string or file handle, default None) – File path or object, if None is provided the result is returned as a string.
  • syntax_check (bool, optional) – Should script be syntax checked before execution. True by default.
  • delete_workunit (bool, optional) – Delete workunit once completed. True by default.
  • stored (dict or None, optional) – Key value pairs to replace stored variables within the script. Values should be str, int or bool. None by default.
  • kwargs – Additional parameters to be provided to pandas.DataFrame.to_csv().
Returns:

if path_or_buf is not None, else a string representation of the output csv.

Return type:

None or str

hpycc.save.save_thor_file(connection, thor_file, path_or_buf=None, max_workers=15, chunk_size='auto', max_attempts=3, max_sleep=60, dtype=None, **kwargs)[source]

Save a logical file to disk, see get_thor_file() for returning a DataFrame.

Parameters:
  • connection (Connection) – HPCC Connection instance, see also Connection.
  • thor_file (str) – Logical file to be downloaded
  • path_or_buf (string or file handle, default None) – File path or object, if None is provided the result is returned as a string.
  • max_workers (int, optional) – Number of concurrent threads to use when downloading. Warning: too many will likely cause either your machine or your cluster to crash! 15 by default.
  • chunk_size (int, optional.) – Size of chunks to use when downloading file. 10000 by default.
  • max_attempts (int, optional) – Maximum number of times a chunk should attempt to be downloaded in the case of an exception being raised. 3 by default.
  • max_sleep (int, optional) – Maximum time, in seconds, to sleep between attempts. The true sleep time is a random int between max_sleep and max_sleep * 0.75.
  • dtype (type name or dict of col -> type, optional) – Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}. If converters are specified, they will be applied INSTEAD of dtype conversion. If None, or columns are missing from the provided dict, they will be converted to one of bool, str or int based on the HPCC datatype. None by default.
  • kwargs – Additional parameters to be provided to pandas.DataFrame.to_csv().
Returns:

if path_or_buf is not None, else a string representation of the output csv.

Return type:

None or str

hpycc.spray module

The module contains functions to send files to HPCC.

Functions

  • spray_file – Spray a given csv or pandas DataFrame to HPCC.
hpycc.spray.spray_file(connection, source_file, logical_file, overwrite=False, expire=None, chunk_size=100000, max_workers=5, delete_workunit=True)[source]

Spray a file to a HPCC logical file, bypassing the landing zone.

Parameters:
  • connection (Connection) – HPCC Connection instance, see also Connection.
  • source_file (str, pd.DataFrame) – A pandas DataFrame or the path to a csv.
  • logical_file (str) – Logical file name on THOR.
  • overwrite (bool, optional) – Should the file overwrite any pre-existing logical file. False by default.
  • chunk_size (int, optional) – Size of chunks to use when spraying file. 100000 by default.
  • max_workers (int, optional) – Number of concurrent threads to use when spraying. Warning: too many will likely cause either your machine or your cluster to crash! 3 by default.
  • expire (int) – How long (days) until the produced logical file expires? None (ie no expiry) by default
  • delete_workunit (bool) – Delete workunit once completed.
Returns:

Return type:

None