Set up data for import

This topic is part of the manual Genesys Predictive Routing Deployment and Operations Guide for version 9.0.0 of Genesys Predictive Routing.

Supported types of data

In general, you need the following types of data:

Interaction data

Data Loader automatically extracts interaction data from the Genesys Info Mart database to create Datasets.

Agent Profile data

Data Loader automatically extracts Agent data from the Genesys Info Mart Database. You can optionally add agent data from other sources by providing a CSV file.

Customer Profile data

To create the Customer Profile, you must create a CSV file and upload it, using Data Loader, the GPR web application, or the GPR API.

Outcome Data

To use outcome data, for example from an after-call survey application, you must create a CSV file and upload it, using Data Loader, the GPR web application, or the GPR API.

See Configure Data Loader to upload data for how to configure Data Loader to upload both Genesys Info Mart data and CSV data.

See the relevant portion of the Help for how to use the GPR application to upload CSV files.

CSV file size requirements

Use the following guidelines to construct CSV files for data uploads:

Data Loader uploads data in 512 MB chunks. If your Dataset is larger than 512 MB, Data Loader automatically breaks it into chunks for upload.
The maximum number of columns in a Dataset is 100; the maximum number of rows is 2.5 million. If you upload a file with more than 2.5 million rows, only the first 2.5 million are uploaded—the remainder are discarded.
The maximum length of a single column name in a CSV file to be uploaded is 127 characters.
The maximum length of a single column name that is to be anonymized is 120 characters.
The maximum number of rows in the Agent Profile is 20 thousand.
The number of rows in the Customer Profile is 20 million.

If you try to upload more data than the data size limits allow, GPR generates an error and the remaining rows are discarded.

When you have reached the size limit, GPR does not add or update records. To continue, delete some records using the GPR API */purge endpoints.

Data size for Models and scoring

The following size limits apply to Model creation:

Minimum number of records needed to train a DISJOINT model for an agent - 10
Maximum number of active Models per Tenant - 50
Total cardinality limit for Model training: no specific column count; has been tested up to 250 columns.
- Total cardinality should be less than 2 to the power of 29.
- Total Cardinality = the number of numeric columns plus the sum of the number of unique values across all string columns within a given Dataset.
Record count limit for GLOBAL Model training - not applicable; from a Model-training perspective there is virtually no limit on the number of columns. The constraining issue is the possibility of compromising the Global Model quality by ending up with a reduced number of samples for training.
- The total number of records should be less than 2 to power of 29 (that is, 536870912) divided by total cardinality as defined above.
- Example 1: You are required to use ALL of the data for training the GLOBAL Model (note that the GLOBAL Model is trained even if you select DISJOINT, so that the scoring engine can rank agents who do not yet have data). If the Dataset contains 1 million records, the maximum total cardinality is 536 (536870912 divided by 1 million).
- Example 2: You can undersample the data for training the GLOBAL Model—that is, use fewer than the ideal number of records for training. You might take 10,000 as the total cardinality, but only 53,687 of your total of 1 million records will be used for training. The calculation to determine this is 10,000 * 53,687 = 536870912 (the maximum cardinality).

The following limitation applies to scoring requests:

Maximum number of agents that can be scored in one scoring request - 1,000.

Data formatting requirements

When you create the CSV data file for a Dataset, Agent Profile, or Customer Profile, do not include the following in the column name for the field to be used as the ID_FIELD, Agent ID, or Customer ID: ID, _id, or any variant of these that changes only the capitalization. Using these strings in the column name results in an error when you try to upload your data.
When you create the CSV data file for a Dataset, Agent Profile, or Customer Profile, do not include the following reserved names in column names:
- created_at
- tenant_id
- updated_at
- acl
In the Agent Profile, if you are using skill names that include a dot (period) or a space in them, use double quotation mark characters to enclose the skill name. For example, a skill named ''fluent spanish.8'' should be entered as "fluent spanish.8".
GPR supports UTF-8 encoding. All responses and returned data is provided in UTF-8 encoding.
If you use a Microsoft editor to create your CSV file, remove the carriage return (^M) character before uploading. Microsoft editors such as Excel, WordPad, and NotePad automatically insert this character. For tips on removing the character from Excel files, refer to How to remove carriage returns (line breaks) from cells in Excel 2016, 2013, 2010.
Only one-dimensional dictionaries are supported, with up to 200 key-value pairs where the key is a string and the value is int, float, or Boolean.
If you have dictionary-type fields that use comma separators, use tab separators for your CSV file.
Fields of the dictionary (DICT) type are discovered correctly only if the quotes appear as in the following example, with double quotation marks outside a dictionary entry and single quotation marks for the values within it. This applies to DICT fields in both Datasets and Agent and Customer Profiles.
- "{'vq_1':0.54,'vq_2':6.43}"
In SQL queries, use of angle brackets (<>) to signify "not" is not supported. Genesys recommends that you use the following symbols instead: !=.

Data anonymization

PII, or personally identifiable information, and sensitive data, such as passwords must be hidden when you upload it to the GPR Core platform. To ensure that such data is secured, instruct GPR to anonymize the fields containing such data. Note the following points about anonymized data in GPR:

You can anonymize up to 20 fields.
You cannot anonymize fields after you have uploaded data.
Once you have uploaded data with anonymized fields, you cannot de-anonymize them.
Anonymizing Numeric or Boolean fields changes them to String fields. This change has some effect on how the fields are weighted in the Feature Analysis report and during scoring.
Each Tenant has its own salt for anonymization.

NOTE: If you anonymize a field, you must anonymize it in every dataset in which it appears. For example, if you anonymize a customer phone number in the Customer Profile, you must also anonymize it in any Dataset in which it appears. If there is an inconsistency, GPR cannot correctly map agents and, as a result, no Local models can be built for such agents, negatively affecting Disjoint and Hybrid Model performance. In such cases, the Global Model is used for all agents.

Unsupported characters in column names

The following characters are not supported for column names. If GPR encounters these characters in a CSV file, it reads them as column delimiters and parses the data accordingly.

The pipe character
\t (the TAB character)
, (the comma)

Workaround: To use these characters in column names, add double quotation marks (" ") around the entire affected column name, except in the following situations:

If you have a comma-delimited CSV file, add double quotations marks around commas within column names; you do not need quotations for the \t (TAB) character.
If you have a TAB-delimited CSV file, add double quotations marks around TAB characters within column names; you do not need quotations for the , (comma) character.
You must always use double quotations for the pipe character.

Data retention policies

GPR follows standard Genesys data retention guidelines for PureEngage Cloud.

Most objects and data are deleted automatically after 90 days during which they are inactive. These include the following:

Dataset data and the Dataset object - Deleted after 90 days of idle time, which means no new files were appended and the Dataset was not used to generate any data for Predictors in that period.
File upload object - Deleted after 90 days of idle time. Here idle time means this file was not used to generate any data for Predictors in that period.
Agent / Customer Profiles - Deleted after 90 days of idle time, which means the Profile was not updated in the last 90 days.
Model - Deleted after 90 days of idle time, which means the Model was not used for any score requests in last 90 days.
Predictor generated data and the Predictor object - Deleted after 90 days of idle time, which means that no associated Model was used for a score request in last 90 days.

The following data uses different retention policies:

Uploaded anonymized files - Deleted 7 days after upload.
Files stored for billing purposes - Deleted 60 days after creation.