Overview
NumPy provides vectorized set operations for 1D arrays and multidimensional subarrays. These tools allow for deduplication, membership testing, and finding differences/intersections between datasets.
When to Use
- •Deduplicating rows in a large feature matrix.
- •Filtering a dataset to exclude a list of forbidden values.
- •Synchronizing two datasets by finding their intersection.
- •Compressing data by storing unique values and their index mappings.
Decision Tree
- •Need to find non-duplicate elements?
- •Use
np.unique.
- •Use
- •Need to reconstruct the original array from unique values?
- •Set
return_inverse=Trueinnp.unique.
- •Set
- •Checking if elements exist in another list?
- •Use
np.isin(data, target_list).
- •Use
Workflows
- •
Finding Unique Rows in a Dataset
- •Create a 2D array.
- •Call
np.unique(arr, axis=0). - •Inspect the result to see the deduplicated records.
- •
Reconstructing an Array from Sets
- •Call
u, inv = np.unique(arr, return_inverse=True). - •Store 'u' and 'inv' separately (useful for data compression).
- •Rebuild the original array using
u[inv].
- •Call
- •
Filtering by Membership
- •Define a 'forbidden' set of values.
- •Generate a boolean mask using
~np.isin(data, forbidden). - •Filter the data:
clean_data = data[mask].
Non-Obvious Insights
- •Flattening by Default: Set operations work on flattened 1D versions of input arrays unless an
axisis explicitly specified. - •NaN Handling: Like sorting,
uniquetreatsNaNas a value and sorts it to the end of the unique output. - •Lexicographic Row Sort: When
axis=0is used inunique, the resulting unique rows are sorted lexicographically.
Evidence
- •"Returns the sorted unique elements of an array." Source
- •"isin(element, test_elements...)... broadcasting over element only." Source
Scripts
- •
scripts/numpy-set-ops_tool.py: Routines for unique row detection and inverse reconstruction. - •
scripts/numpy-set-ops_tool.js: Simulated set intersection logic.
Dependencies
- •
numpy(Python)