How to Safely Process Large Data with Spring Batch - Introductory Guide to Job/Step/Chunk Processing

2026-02-02

Written by: Zuko

Updated: 2026-05-26

About this article

This article deepens your understanding of Spring Batch. A guide to implementing large-scale batch processing with Spring Batch, complete with sample code that even beginners can follow. Covers the basic structure of Job/Step/ItemReader/ItemWriter, memory-efficient implementation using chunk processing, transaction management, and error handling with skip and retry, all using the latest syntax compatible with Spring Boot 3.x.

About the author View all Spring Batch articles

Processing scenarios that handle tens of thousands or even millions of records in batch are common in enterprise systems. Many developers know about scheduled execution with @Scheduled but struggle with how to efficiently process large volumes of data.

Spring Batch is a framework specialized for such large-scale data processing. This article explains Spring Batch from its basic structure to actual implementation methods, with practical examples.

What is Spring Batch

When to Use and When Not to Use Spring Batch

Spring Batch is not suited for every kind of scheduled processing. It shines in scenarios like the following.

Processing large amounts of data (tens of thousands to millions of records) in a single run
Wanting to resume from where it left off after a failure
Wanting to scope transactions per chunk
Needing robust error handling such as skip / retry

On the other hand, for processing with small record counts or workloads requiring immediate response, @Scheduled plus regular service methods are sufficient. For streaming processing that requires real-time characteristics, messaging platforms like Kafka are more appropriate.

Comparison with Other Schedulers and ETL Tools

Tool	Suitable Use Case	Difference from Spring Batch
`@Scheduled`	Lightweight scheduled processing	No large data processing or rerun control
Quartz	Complex scheduling	Has no data processing features
Apache Airflow / Embulk	ETL / Workflow	Runs outside JVM, requires separate operational infrastructure
Spring Batch	Large-scale data batches within JVM	High affinity with Spring Boot, state managed via JobRepository

For large-scale data processing that completes within a Spring application, Spring Batch offers the lowest operational cost to introduce.

Spring Batch is a framework optimized for reading, processing, and writing large amounts of data. While @Scheduled controls “when to execute,” Spring Batch provides “how to process large amounts of data.”

Spring Batch has the following characteristics.

Memory-efficient data processing via chunk processing
Transaction management per chunk
Retry and skip error handling features built in by default
Rerun control and metadata management

You can safely process millions of records without loading all data into memory.

The Four Main Components of Spring Batch

Spring Batch consists of the following components.

Job - The top-level concept representing the entire batch process
Step - A processing unit that makes up a Job (multiple Steps can be defined in a single Job)
ItemReader - Reads data one record at a time from a data source
ItemProcessor - Processes and transforms the read data (optional)
ItemWriter - Writes the processed data in bulk

The basic flow is “read with ItemReader → process with ItemProcessor → write with ItemWriter.” This cycle is repeated per chunk.

How Chunk Processing Works

Chunk processing is the core mechanism of Spring Batch. It reads the specified chunk size worth of data and then writes it in bulk.

For example, if the chunk size is set to 100, it operates as follows.

ItemReader reads 100 records
ItemProcessor processes 100 records
ItemWriter writes 100 records in bulk
Commits the transaction

Because 1 chunk = 1 transaction, commits and rollbacks happen per chunk.

A reasonable chunk size is around 100 to 1000 records. Too large causes out-of-memory errors, while too small causes performance degradation, so adjust according to data characteristics.

Adding Dependencies

First, add the Spring Batch dependencies.

dependencies {
    implementation 'org.springframework.boot:spring-boot-starter-batch'
    implementation 'org.springframework.boot:spring-boot-starter-data-jpa'
    runtimeOnly 'com.h2database:h2'
}

Spring Batch requires a data source because it manages batch processing metadata via a mechanism called JobRepository. Use H2 during development, and PostgreSQL or MySQL in production.

As a note for Spring Boot 3.x and later, @EnableBatchProcessing is basically unnecessary. From Spring Boot 3.x onward, Batch features are enabled via auto-configuration, so if the default settings work, you don’t need to add @EnableBatchProcessing. Use it only when custom configuration is needed.

Also, from Spring Batch 5.0 onward, JobBuilderFactory/StepBuilderFactory are deprecated and have been changed to use JobBuilder/StepBuilder directly. This article uses the new notation.

Importing Data from CSV File to DB

As a basic example, let’s implement a process that reads a CSV file and inserts it into the DB.

The target entity class is as follows.

public class User {
    private Long id;
    private String name;
    private String email;
    // getter/setter omitted
}

Reading CSV with FlatFileItemReader

@Bean
public FlatFileItemReader<User> csvReader() {
    return new FlatFileItemReaderBuilder<User>()
        .name("csvReader")
        .resource(new ClassPathResource("users.csv"))
        .delimited()
        .names("id", "name", "email")
        .targetType(User.class)
        .build();
}

FlatFileItemReader is an ItemReader for reading CSV or TSV files. Specify column names with names() and the mapping target class with targetType(), and it automatically converts to objects when field names match.

If more flexible mapping is needed, you can define your own mapping logic using FieldSetMapper.

Writing to DB with JdbcBatchItemWriter

@Bean
public JdbcBatchItemWriter<User> dbWriter(DataSource dataSource) {
    return new JdbcBatchItemWriterBuilder<User>()
        .dataSource(dataSource)
        .sql("INSERT INTO users (id, name, email) VALUES (:id, :name, :email)")
        .beanMapped()
        .build();
}

JdbcBatchItemWriter uses JDBC’s batch update feature to INSERT multiple records in bulk. When you specify beanMapped(), the entity field names must match the SQL named parameters.

Defining Job and Step

@Configuration
public class CsvImportJobConfig {

    @Bean
    public Job csvImportJob(JobRepository jobRepository, Step csvImportStep) {
        return new JobBuilder("csvImportJob", jobRepository)
            .start(csvImportStep)
            .build();
    }

    @Bean
    public Step csvImportStep(JobRepository jobRepository,
                              PlatformTransactionManager transactionManager,
                              FlatFileItemReader<User> csvReader,
                              JdbcBatchItemWriter<User> dbWriter) {
        return new StepBuilder("csvImportStep", jobRepository)
            .<User, User>chunk(100, transactionManager)
            .reader(csvReader)
            .writer(dbWriter)
            .build();
    }
}

Assemble the Job and Step. From Spring Boot 3.x onward, JobRepository and PlatformTransactionManager are received as arguments from the Beans auto-configured by Spring. With chunk(100, transactionManager), chunk processing happens every 100 records and transaction management is enabled.

Bulk Data Transformation from DB to DB

Next is an example of reading from one table, processing with business logic, and writing to another table.

public class OrderEntity {
    private Long id;
    private Long customerId;
    private BigDecimal amount;
    private String status;
    // getter/setter omitted
}

public class ProcessedOrder {
    private Long orderId;
    private BigDecimal finalAmount;
    private String status;
    // getter/setter omitted
}

Reading from DB with JdbcCursorItemReader

@Bean
public JdbcCursorItemReader<OrderEntity> orderReader(DataSource dataSource) {
    return new JdbcCursorItemReaderBuilder<OrderEntity>()
        .name("orderReader")
        .dataSource(dataSource)
        .sql("SELECT id, customer_id, amount, status FROM orders WHERE status = 'PENDING'")
        .rowMapper(new BeanPropertyRowMapper<>(OrderEntity.class))
        .build();
}

JdbcCursorItemReader uses a SQL cursor to read data one record at a time. It can sequentially process large amounts of data without loading everything into memory.

Processing Data with ItemProcessor

@Component
public class OrderProcessor implements ItemProcessor<OrderEntity, ProcessedOrder> {

    @Override
    public ProcessedOrder process(OrderEntity order) throws Exception {
        ProcessedOrder processed = new ProcessedOrder();
        processed.setOrderId(order.getId());
        processed.setFinalAmount(order.getAmount().multiply(new BigDecimal("0.9")));
        processed.setStatus("PROCESSED");
        return processed;
    }
}

Apply business logic one record at a time with ItemProcessor. Returning null here filters out that record so it is not passed to the ItemWriter.

Error Handling - skip and retry

Pre and Post Processing for Job Execution with JobExecutionListener

When you want to insert common processing before and after a Job, use JobExecutionListener. It is useful for measuring execution time, sending Slack notifications, sending metrics, and more.

@Component
public class LoggingJobListener implements JobExecutionListener {

    private static final Logger log = LoggerFactory.getLogger(LoggingJobListener.class);

    @Override
    public void beforeJob(JobExecution jobExecution) {
        log.info("Job started: {} params={}", jobExecution.getJobInstance().getJobName(), jobExecution.getJobParameters());
    }

    @Override
    public void afterJob(JobExecution jobExecution) {
        BatchStatus status = jobExecution.getStatus();
        Duration duration = Duration.between(jobExecution.getStartTime(), jobExecution.getEndTime());
        log.info("Job finished: status={} duration={}ms", status, duration.toMillis());
        if (status == BatchStatus.FAILED) {
            // Send alerts on failure, etc.
        }
    }
}

Register the listener with JobBuilder using .listener(listener). When you want pre/post processing per Step, use StepExecutionListener.

Parallel Processing - Multi-threaded Step and Partitioning

When single-threaded chunk processing cannot keep up with processing time, Spring Batch provides two main parallelization methods.

Multi-threaded Step (Processing the Same Step with Multiple Threads)

@Bean
public Step multiThreadedStep(JobRepository jobRepository,
                              PlatformTransactionManager transactionManager,
                              ItemReader<User> reader,
                              ItemWriter<User> writer) {
    return new StepBuilder("multiThreadedStep", jobRepository)
        .<User, User>chunk(100, transactionManager)
        .reader(reader)
        .writer(writer)
        .taskExecutor(new SimpleAsyncTaskExecutor("batch-thread-"))
        .build();
}

Simply setting taskExecutor parallelizes chunk processing. However, the ItemReader must be thread-safe. Since JdbcCursorItemReader and others are not thread-safe, wrap them with SynchronizedItemStreamReader or use JdbcPagingItemReader.

Partitioning (Splitting Data and Running in Parallel)

Partitioning is a method of dividing input data into ranges and running them in parallel via multiple Worker Steps. For example, you might split “IDs 1-10000 go to Worker1, 10001-20000 go to Worker2.” It offers the highest throughput but requires implementing a Partitioner. This is the choice for production batches that periodically process large amounts of data.

Restart (Resuming a Failed Job from Where It Left Off)

Since Spring Batch records the progress of each Step in the JobRepository, you can resume a failed Job from where it failed.

When you rerun with the same JobParameters, Spring Batch references the previous JobExecution, skips completed Steps, and resumes processing from where the failed Step left off. In the case of chunk processing, it resumes from the chunk following the last committed one.

However, for the ItemReader to retain its resume position, it must save its state in the ExecutionContext. FlatFileItemReader and JdbcCursorItemReader provide this mechanism by default. If you use your own custom Reader, implement ItemStream.

Spring Batch Admin and Operational Monitoring

Spring Batch Admin was once an official project for managing Job execution status via a Web UI, but it is currently deprecated. Modern operational monitoring mainly uses the following methods.

Send Job execution metrics to Prometheus with Spring Boot Actuator + Micrometer
Directly monitor JobRepository tables (BATCH_JOB_EXECUTION, etc.) with SQL
Introduce Spring Cloud Data Flow as a Job platform

In production, the basic pattern is to periodically check the STATUS column of BATCH_JOB_EXECUTION and fire an alert when FAILED is detected.

Even when some records cannot be processed due to data inconsistencies and the like, you may want to continue processing without stopping the whole. Using skip, you can skip records where specific exceptions occur and proceed to the next.

@Bean
public Step resilientStep(JobRepository jobRepository,
                          PlatformTransactionManager transactionManager,
                          ItemReader<User> reader,
                          ItemWriter<User> writer) {
    return new StepBuilder("resilientStep", jobRepository)
        .<User, User>chunk(100, transactionManager)
        .reader(reader)
        .writer(writer)
        .faultTolerant()
        .skip(ValidationException.class)
        .skipLimit(10)
        .retry(TransientDataAccessException.class)
        .retryLimit(3)
        .build();
}

faultTolerant() enables error handling features. Specify exceptions to ignore with skip(), and skipLimit(10) allows up to 10 skips.

retry() is the retry setting for temporary failures such as network errors. retryLimit(3) retries up to 3 times. skip and retry can be combined.

Execution Control with JobParameters

Using JobParameters, you can flexibly control processing by passing parameters at runtime.

@Bean
@StepScope
public FlatFileItemReader<User> parameterizedReader(
        @Value("#{jobParameters['inputFile']}") String inputFile) {
    return new FlatFileItemReaderBuilder<User>()
        .name("parameterizedReader")
        .resource(new FileSystemResource(inputFile))
        .delimited()
        .names("id", "name", "email")
        .targetType(User.class)
        .build();
}

By adding @StepScope, the Bean is created at Step execution time via lazy evaluation, allowing it to receive JobParameters values. This allows different files to be processed on each execution.

JobParameters are also used to identify executions. Since the same JobParameters are treated as the same Job instance, a successfully completed Job cannot be rerun. To rerun, you need to change the parameters or use RunIdIncrementer to change parameters automatically.

How to Execute the Batch

There are several ways to execute the implemented batch.

Combine with @Scheduled for Periodic Execution

@Component
public class BatchScheduler {

    private final JobLauncher jobLauncher;
    private final Job csvImportJob;

    public BatchScheduler(JobLauncher jobLauncher, Job csvImportJob) {
        this.jobLauncher = jobLauncher;
        this.csvImportJob = csvImportJob;
    }

    @Scheduled(cron = "0 0 2 * * *")
    public void runBatch() throws Exception {
        JobParameters params = new JobParametersBuilder()
            .addLong("time", System.currentTimeMillis())
            .toJobParameters();
        
        jobLauncher.run(csvImportJob, params);
    }
}

By combining @Scheduled and JobLauncher, you can easily achieve periodic execution. By passing different parameters each time (such as the current time), it remains rerunnable.

Implementation Considerations

Review the DB connection pool settings - Large-scale data processing acquires a connection per chunk. The default settings can lead to pool exhaustion, so be sure to also review HikariCP tuning.

Be careful about graceful shutdown for long-running batches - If a batch is forcibly terminated during a Kubernetes rolling update, it may leave a half-finished state. Configuring graceful shutdown so it can stop safely brings peace of mind.

Combine Job failure notifications with exception handlers - Relying solely on logs for failure detection delays awareness. Referring to the GlobalExceptionHandler production patterns, designing alerts to be sent from JobExecutionListener makes operations easier.

If real-time integration is needed, instead of batches, also consider asynchronous messaging via Kafka Producer/Consumer.

Here are a few practical considerations when using this in real-world scenarios.

Start with a chunk size of 100 - Adjustment based on data characteristics is needed, but starting at 100 is a safe bet.

Be careful about log output volume in large-scale data processing - Outputting logs for all records bloats the log file. Thin out the log output every 1000 records or so.

Use a persistent DB for JobRepository in production - Use H2 only for development, and use PostgreSQL or similar in production. Spring Batch’s metadata tables are automatically created on first execution.

Summary

With Spring Batch, you can process large amounts of data safely and efficiently. By understanding the four components Job/Step/ItemReader/ItemWriter and grasping how chunk processing works, you can implement practical batch processing.

Since transaction management and error handling are also provided by default, you can confidently integrate this into business systems. Start with small batches first and gradually tackle more complex processing.

References

Official documentation and references for the topics covered in this article.