Deduplicate Sheet Records
A Flatfile plugin that provides functionality to find and remove duplicate records from a sheet based on specified field values or custom logic.
The Dedupe plugin provides functionality to find and remove duplicate records from a sheet within Flatfile. It is designed to be used as a server-side listener that is triggered by a custom action configured on a specific sheet.
The primary use case is data cleaning. For example, when importing a list of contacts, you can use this plugin to automatically remove entries that have the same email address. The plugin is flexible, allowing you to specify which field to check for duplicates (e.g., ‘email’, ‘orderId’). You can configure it to keep either the first or the last occurrence of a duplicate record. For more complex deduplication logic, you can provide your own custom function.
Installation
Install the plugin using npm:
Configuration & Parameters
The dedupePlugin
function takes two parameters:
Parameters
- jobOperation (string, required): The operation name that you define in a Sheet-level action. The plugin will only run when an action with this exact operation name is triggered.
- opts (PluginOptions, required): An object containing the configuration options for the plugin.
Configuration Options
Option | Type | Default | Description |
---|---|---|---|
on | string | undefined | The field key (e.g., ‘email’) to check for duplicate values. Required when using the keep option. |
keep | ’first’ | ‘last’ | undefined | Determines which record to keep when a duplicate is found. ‘first’ keeps the first record encountered, ‘last’ keeps the last record encountered. |
custom | function | undefined | A custom function that receives a batch of records and a set of unique values. Should return an array of record IDs to be deleted. Overrides the keep option. |
debug | boolean | false | When set to true , the plugin may output additional logging for development and debugging purposes. |
Default Behavior
By default, without any configuration in the opts
object, the plugin will not perform any deduplication. You must provide either the keep
and on
options or a custom
function for the plugin to work. If the keep
option is used, the on
option becomes mandatory.
Usage Examples
Basic Usage
This example configures the plugin to trigger when a sheet action with the operation “dedupe-email” is clicked. It will find duplicates in the ‘email’ field and keep the last record found, deleting any previous ones.
Custom Deduplication Logic
This example shows how to use a custom function for more complex deduplication logic. The custom function identifies records to delete based on the ‘email’ field and returns their IDs.
Keep First Record
This example keeps the first record encountered for each unique email value and deletes subsequent duplicates.
Troubleshooting
Common Issues
Plugin Not Triggering
- Check for a mismatch between the
jobOperation
string in your listener code and theoperation
value in your Sheet configuration’s action. They must be identical.
Incorrect Field Error
- Ensure the field key passed to the
on
option exists in your Sheet configuration and is spelled correctly.
No Duplicates Removed
- Verify your data to ensure duplicates actually exist for the specified
on
field. - If using a
custom
function, add logging to debug its logic.
Error Scenarios
The plugin will throw descriptive errors for common misconfigurations:
- Missing
on
option:Error: \
on` is required when `keep` is first` - Field not found:
Error: Field "non_existent_field" not found
- Invalid context:
Error: Dedupe must be called from a sheet-level action
Notes
Requirements and Limitations
- Server-Side Requirement: This plugin must be deployed in a server-side listener. It is not intended for client-side use.
- Sheet Action Requirement: The plugin is triggered by a job. To trigger this job, you must configure a Sheet-level action in your Sheet configuration. The
operation
property of this action must exactly match thejobOperation
string passed to thededupePlugin
function. - Large Dataset Limitation: The
keep: 'last'
option may not function as expected on very large datasets where duplicate records are spread across different pages of data. Thekeep: 'first'
option is generally more reliable for large datasets as it correctly tracks unique values across all pages. For a reliable “keep last” implementation on large datasets, acustom
function should be used.
Error Handling
The plugin is wrapped in the jobHandler
utility, which provides standardized job management. Any error thrown during the dedupe function’s execution will be caught, and the job will be marked as ‘failed’ with the corresponding error message. The plugin also performs its own configuration checks and will throw descriptive errors for common misconfigurations.