Newest 'bigdata' Questions

0 votes

0 answers

51 views

fread() takes 60GB of RAM to load a 22GB CSV dataset [duplicate]

I am loading a CSV file into RStudio using fread() and despite the file being 22GB large, I can see my memory usage at 60 of my 64GB. Why is that? This becomes a problem right after as I need to join ...

Marti

101

asked yesterday

0 votes

0 answers

44 views

Dealing in R with too many columns [closed]

I am facing issues due to low memory caused by the use of too many columns. I have 900 data frames (df) each with 2 million rows. Each df contains values for one individual. I tried to merge all the ...

Kostas

1

asked 2 days ago

1 vote

0 answers

23 views

Hash a sparse vector from CountVectorizer

I am very new to spark so bear with me. I am currently trying to hash feature vectors generated by the CountVectorizer. So for the following example with a hash size of 50: +---+--------------------+--...

EyedBread

21

asked 2 days ago

0 votes

0 answers

10 views

ES: Why Refresh delete Segment with "committed=false"?

Background： There exists an index named "i_dm_f_da_enterprise" in ES, and I prohibited auto-refresh by setting "refresh_interval" : -1. I notice that the refresh operation of the ...

Junbo Wang

1

asked Jul 15 at 12:21

0 votes

0 answers

35 views

Finding Top Users with Common Records in a Growing Dataset

I’m working on a project where I have a large dataset containing billions of records. Each user can have one or more records, and each record can be associated with multiple users. Given a specific ...

Herman Streltsov

31

asked Jul 14 at 15:42

0 votes

0 answers

19 views

How to append time-series data with PyArrow Datasets?

Problem I'm looking to store time-series data that's being aggregated live to Parquet Datasets via PyArrow. I receive live batched data, for example, video view count each hour for the last 24 hours. ...

humanlikely

325

asked Jul 12 at 17:55

0 votes

0 answers

31 views

Optimizing Data Model for Frequent IoT Sensor Data Updates in Data Warehouse

I am working on a data engineering project involving IoT sensor data. In this project, I have a use case where: We have a table in the data warehouse to store sensor data generated every second by 10 ...

Harshit Chandani

1

asked Jul 9 at 18:03

0 votes

0 answers

24 views

Standalone spark 3.3.0 java application throws access denied exception when reading from files on mounted drive

I'm using spark 3.3.0 on a standalone cluster, and i have mounted drive from which i need to read some files that comes periodically. As spark application is running as spark user and mounted drive is ...

Petar Markov

1

asked Jul 9 at 10:46

1 vote

0 answers

39 views

How to load .dat file to Hive with additional columns?

I want to load .dat(without headers) file to hive external table. But in hive table there are extra columns like cob_date , region, file_name which are not present in .dat file. cob_date will be the ...

Big data Pyspark

71

asked Jul 8 at 9:55

-1 votes

0 answers

14 views

How to get MetricQueryService URL?

@Override public CompletableFuture<MetricQueryServiceGateway> retrieveService(String rpcServiceAddress) { return rpcService.connect(rpcServiceAddress, MetricQueryServiceGateway.class); } I ...

seedoilz

19

asked Jul 4 at 9:00

-1 votes

0 answers

14 views

Assistance with Integrating Open Data Cube and Cesium for 3D Topographical Earth Model

I am currently working on a project that involves developing an integrated big data processing system and visualizing the data in a 3D topographical Earth model. I am seeking assistance with the ...

Angel Zaldivar

1

asked Jul 3 at 16:17

-1 votes

0 answers

10 views

Why Livy Server delete request doesn't terminate the yarn application if session in starting state

Seeing LivyServer behavior that if LivySession state is starting and call the delete API for this session ,than LivyServer return the success status that deleted and also Livy Server UI also doesn't ...

agarwal_achhnera

2,446

asked Jul 2 at 8:56

0 votes

0 answers

21 views

How to build ActorSystem in Flink 1.13.5?

This is how I build ActorSystem in Flink 1.8.5. public static ActorSystem createNewActorSystem() throws Exception { String ip = HostPortUtil.getLocalIp(); Configuration configuration = new ...

seedoilz

19

asked Jul 2 at 6:34

-1 votes

0 answers

17 views

Save data into hive from worker nodes using apache spark

I am working on a data backloading taks using apache spark and hive. I am loading the data from hive, mapping it to another schema and storing the new result in hive. Now there might be a few failures ...

Destravna

29

asked Jun 30 at 20:13

0 votes

1 answer

28 views

Comparing two types of data in bigQuery

We have a very big dataset. And I need to get all the values that are mapping from the source attributes to normalized attributes in my json. The relation between normalised and source is that if the ...

Ankita Prasad

23

asked Jun 29 at 22:04

Collectives™ on Stack Overflow

Questions tagged [bigdata]

fread() takes 60GB of RAM to load a 22GB CSV dataset [duplicate]

Dealing in R with too many columns [closed]

Hash a sparse vector from CountVectorizer

ES: Why Refresh delete Segment with "committed=false"?

Finding Top Users with Common Records in a Growing Dataset

How to append time-series data with PyArrow Datasets?

Optimizing Data Model for Frequent IoT Sensor Data Updates in Data Warehouse

Standalone spark 3.3.0 java application throws access denied exception when reading from files on mounted drive

How to load .dat file to Hive with additional columns?

How to get MetricQueryService URL?

Assistance with Integrating Open Data Cube and Cesium for 3D Topographical Earth Model

Why Livy Server delete request doesn't terminate the yarn application if session in starting state

How to build ActorSystem in Flink 1.13.5?

Save data into hive from worker nodes using apache spark

Comparing two types of data in bigQuery

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [bigdata]

Related Tags