@EventDriven @SideEffectFree @SupportsBatching @Tags(value={"split","text"}) @InputRequirement(value=INPUT_REQUIRED) @CapabilityDescription(value="Splits a text file into multiple smaller text files on line boundaries limited by maximum number of lines or total size of fragment. Each output split file will contain no more than the configured number of lines or bytes. If both Line Split Count and Maximum Fragment Size are specified, the split occurs at whichever limit is reached first. If the first line of a fragment exceeds the Maximum Fragment Size, that line will be output in a single split file which exceeds the configured maximum size limit. This component also allows one to specify that each split should include a header lines. Header lines can be computed by either specifying the amount of lines that should constitute a header or by using header marker to match against the read lines. If such match happens then the corresponding line will be treated as header. Keep in mind that upon the first failure of header marker match, no more matches will be performed and the rest of the data will be parsed as regular lines for a given split. If after computation of the header there are no more data, the resulting split will consists of only header lines.") @WritesAttribute(attribute="text.line.count",description="The number of lines of text from the original FlowFile that were copied to this FlowFile") @WritesAttribute(attribute="fragment.size",description="The number of bytes from the original FlowFile that were copied to this FlowFile, including header, if applicable, which is duplicated in each split FlowFile") @WritesAttribute(attribute="fragment.identifier",description="All split FlowFiles produced from the same parent FlowFile will have the same randomly generated UUID added for this attribute") @WritesAttribute(attribute="fragment.index",description="A one-up number that indicates the ordering of the split FlowFiles that were created from a single parent FlowFile") @WritesAttribute(attribute="fragment.count",description="The number of split FlowFiles generated from the parent FlowFile") @WritesAttribute(attribute="segment.original.filename ",description="The filename of the parent FlowFile") @SeeAlso(value=MergeContent.class) @SystemResourceConsideration(resource=MEMORY, description="The FlowFile with its attributes is stored in memory, not the content of the FlowFile. If many splits are generated due to the size of the content, or how the content is configured to be split, a two-phase approach may be necessary to avoid excessive use of memory.") public class SplitText extends AbstractProcessor
| Modifier and Type | Class and Description |
|---|---|
private class |
SplitText.SplitInfo
Container for hosting meta-information pertaining to the split so it can
be used later to create
FlowFile representing the split. |
| Modifier and Type | Field and Description |
|---|---|
static String |
FRAGMENT_COUNT |
static String |
FRAGMENT_ID |
static String |
FRAGMENT_INDEX |
static PropertyDescriptor |
FRAGMENT_MAX_SIZE |
static String |
FRAGMENT_SIZE |
static PropertyDescriptor |
HEADER_LINE_COUNT |
static PropertyDescriptor |
HEADER_MARKER |
private int |
headerLineCount |
private String |
headerMarker |
static PropertyDescriptor |
LINE_SPLIT_COUNT |
private int |
lineCount |
private long |
maxSplitSize |
private static List<PropertyDescriptor> |
properties |
static Relationship |
REL_FAILURE |
static Relationship |
REL_ORIGINAL |
static Relationship |
REL_SPLITS |
private static Set<Relationship> |
relationships |
static PropertyDescriptor |
REMOVE_TRAILING_NEWLINES |
private boolean |
removeTrailingNewLines |
static String |
SEGMENT_ORIGINAL_FILENAME |
static String |
SPLIT_LINE_COUNT |
| Constructor and Description |
|---|
SplitText() |
| Modifier and Type | Method and Description |
|---|---|
private SplitText.SplitInfo |
computeHeader(TextLineDemarcator demarcator,
long startOffset,
long splitMaxLineCount,
byte[] startsWithFilter,
SplitText.SplitInfo previousSplitInfo)
Will generate
SplitText.SplitInfo for the next fragment that represents the
header of the future split. |
private FlowFile |
concatenateContents(FlowFile sourceFlowFile,
ProcessSession session,
FlowFile... flowFiles)
|
protected Collection<ValidationResult> |
customValidate(ValidationContext validationContext) |
private List<FlowFile> |
generateSplitFlowFiles(String fragmentId,
FlowFile sourceFlowFile,
SplitText.SplitInfo splitInfo,
List<SplitText.SplitInfo> computedSplitsInfo,
ProcessSession processSession)
Generates the list of
FlowFiles representing splits. |
Set<Relationship> |
getRelationships() |
protected List<PropertyDescriptor> |
getSupportedPropertyDescriptors() |
private SplitText.SplitInfo |
nextSplit(TextLineDemarcator demarcator,
long startOffset,
long splitMaxLineCount,
SplitText.SplitInfo remainderSplitInfo,
long startingLength)
Will generate
SplitText.SplitInfo for the next split. |
void |
onSchedule(ProcessContext context) |
void |
onTrigger(ProcessContext context,
ProcessSession processSession)
Will split the incoming stream releasing all splits as FlowFile at once.
|
private FlowFile |
updateAttributes(ProcessSession processSession,
FlowFile splitFlowFile,
long splitLineCount,
long splitFlowFileSize,
String splitId,
int splitIndex,
String origFileName) |
onTriggergetControllerServiceLookup, getIdentifier, getLogger, getNodeTypeProvider, init, initialize, isConfigurationRestored, isScheduled, toString, updateConfiguredRestoredTrue, updateScheduledFalse, updateScheduledTrueequals, getPropertyDescriptor, getPropertyDescriptors, getSupportedDynamicPropertyDescriptor, hashCode, onPropertyModified, validateclone, finalize, getClass, notify, notifyAll, wait, wait, waitisStatefulgetPropertyDescriptor, getPropertyDescriptors, onPropertyModified, validatepublic static final String SPLIT_LINE_COUNT
public static final String FRAGMENT_SIZE
public static final String FRAGMENT_ID
public static final String FRAGMENT_INDEX
public static final String FRAGMENT_COUNT
public static final String SEGMENT_ORIGINAL_FILENAME
public static final PropertyDescriptor LINE_SPLIT_COUNT
public static final PropertyDescriptor FRAGMENT_MAX_SIZE
public static final PropertyDescriptor HEADER_LINE_COUNT
public static final PropertyDescriptor HEADER_MARKER
public static final PropertyDescriptor REMOVE_TRAILING_NEWLINES
public static final Relationship REL_ORIGINAL
public static final Relationship REL_SPLITS
public static final Relationship REL_FAILURE
private static final List<PropertyDescriptor> properties
private static final Set<Relationship> relationships
private volatile boolean removeTrailingNewLines
private volatile long maxSplitSize
private volatile int lineCount
private volatile int headerLineCount
private volatile String headerMarker
public Set<Relationship> getRelationships()
getRelationships in interface ProcessorgetRelationships in class AbstractSessionFactoryProcessor@OnScheduled public void onSchedule(ProcessContext context)
public void onTrigger(ProcessContext context, ProcessSession processSession) throws ProcessException
onTrigger in class AbstractProcessorProcessExceptionprotected Collection<ValidationResult> customValidate(ValidationContext validationContext)
customValidate in class AbstractConfigurableComponentprotected List<PropertyDescriptor> getSupportedPropertyDescriptors()
getSupportedPropertyDescriptors in class AbstractConfigurableComponentprivate List<FlowFile> generateSplitFlowFiles(String fragmentId, FlowFile sourceFlowFile, SplitText.SplitInfo splitInfo, List<SplitText.SplitInfo> computedSplitsInfo, ProcessSession processSession)
FlowFiles representing splits. If
SplitText.SplitInfo provided as an argument to this operation is not null
it signifies the header information and its contents will be included in
each and every computed split.private FlowFile concatenateContents(FlowFile sourceFlowFile, ProcessSession session, FlowFile... flowFiles)
FlowFiles
into a single FlowFile. While this operation is as general as it
is described in the previous sentence, in the context of this processor
there can only be two FlowFiles with the first FlowFile
representing the header content of the split and the second
FlowFile represents the split itself.private FlowFile updateAttributes(ProcessSession processSession, FlowFile splitFlowFile, long splitLineCount, long splitFlowFileSize, String splitId, int splitIndex, String origFileName)
private SplitText.SplitInfo computeHeader(TextLineDemarcator demarcator, long startOffset, long splitMaxLineCount, byte[] startsWithFilter, SplitText.SplitInfo previousSplitInfo) throws IOException
SplitText.SplitInfo for the next fragment that represents the
header of the future split.
If split size is controlled by the amount of lines in the split then the
resulting SplitText.SplitInfo line count will always be <= 'splitMaxLineCount'. It can only be less IF it reaches the EOF.
If split size is controlled by the maxSplitSize, then the resulting SplitText.SplitInfo line count
will vary but the length of the split will never be > maxSplitSize and IllegalStateException will be thrown.
This method also allows one to provide 'startsWithFilter' to allow headers to be determined via such filter (see HEADER_MARKER.IOExceptionprivate SplitText.SplitInfo nextSplit(TextLineDemarcator demarcator, long startOffset, long splitMaxLineCount, SplitText.SplitInfo remainderSplitInfo, long startingLength) throws IOException
SplitText.SplitInfo for the next split.
If split size is controlled by the amount of lines in the split then the resulting
SplitText.SplitInfo line count will always be <= 'splitMaxLineCount'.
If split size is controlled by the maxSplitSize, then the resulting SplitText.SplitInfo
line count will vary but the length of the split will never be > maxSplitSize.IOExceptionCopyright © 2023 Apache NiFi Project. All rights reserved.